📦 GOETL

Modern ETL for LLM Dataset Preparation

Extract, transform, and load your data efficiently for Large Language Model training and analytics

Get Started View on GitHub

Extract text from .pdf and .txt files easily

Clean, tokenize, and chunk text for LLM-friendly datasets

Output to JSONL, CSV, or directly to databases

Generate semantic graphs from code directories

Run as a web service for programmatic or UI-driven ETL

Intuitive React frontend for easy job configuration

High-performance ETL engine and REST API

User-friendly web UI

Serves static UI and reverse-proxies API requests

Containerized and orchestratable

docker build -t anuragsingh086/goetl:latest .
docker run -p 8080:8080 -v $(pwd)/samples:/data anuragsingh086/goetl:latest

go run ./cmd/main.go -input samples/demo.pdf -output output/data.jsonl -format jsonl

POST /api/etl

{
  "input": "/data/demo.pdf",
  "output": "output/data.jsonl",
  "chunksize": 200,
  "overlap": 20,
  "format": "jsonl"
}

Learn how to use GOETL effectively

Complete API documentation

Step-by-step guides for common tasks

Real-world examples and use cases

We welcome contributions from the community!

Help us improve by reporting bugs

Submit your code contributions

Show your support for the project