📦 GOETL

Modern ETL for LLM Dataset Preparation

Extract, transform, and load your data efficiently for Large Language Model training and analytics

🚀 Features

Extract

Extract text from .pdf and .txt files easily

Transform

Clean, tokenize, and chunk text for LLM-friendly datasets

Load

Output to JSONL, CSV, or directly to databases

Semantic Analysis

Generate semantic graphs from code directories

REST API

Run as a web service for programmatic or UI-driven ETL

Web UI

Intuitive React frontend for easy job configuration

🏗️ Architecture

GOETL Architecture

Go Backend

High-performance ETL engine and REST API

React Frontend

User-friendly web UI

Caddy

Serves static UI and reverse-proxies API requests

Docker & Kubernetes

Containerized and orchestratable

⚡ Quick Start

docker build -t anuragsingh086/goetl:latest .
docker run -p 8080:8080 -v $(pwd)/samples:/data anuragsingh086/goetl:latest

Web UI: http://localhost:8080
API: http://localhost:8080/api/etl

go run ./cmd/main.go -input samples/demo.pdf -output output/data.jsonl -format jsonl
POST /api/etl

{
  "input": "/data/demo.pdf",
  "output": "output/data.jsonl",
  "chunksize": 200,
  "overlap": 20,
  "format": "jsonl"
}

📚 Documentation

💡 Contribute

We welcome contributions from the community!