Ingestion
Ingest raw documents from Hugging Face, S3, local disk, and web crawls into INDW JSONL format.
INDW ingests raw documents from multiple sources and normalizes them to JSONL under raw/<source>/data.jsonl. The ingest subsystem handles download, format conversion, and incremental resume.
Source layout
Each source directory name becomes the source identifier used in mixture weights, balance caps, and checkpoint tracking.
Hugging Face datasets
DatasetDownloader streams Hugging Face datasets directly to JSONL:
Install with fast HF transfer:
FastDatasetPipeline
For end-to-end ingest + merge workflows:
FastDatasetPipeline runs three phases:
- Download — stream sources to
work/raw/ - Merge — quality merge to
work/filtered.jsonl - Export — optional token bin export
Incremental ingestion
Resume partial downloads with incremental sources:
run_incremental_stage tracks per-source download progress and skips completed shards.
Local disk ingestion
Place pre-formatted JSONL files directly:
Convert other formats with a preprocessing script:
Document schema
| Field | Required | Description |
|---|---|---|
text | Yes | Document body |
url | No | Source URL for provenance and licensing |
meta | No | Arbitrary metadata dict |
id | No | Stable document identifier |
Increase write_buffer_bytes for high-throughput downloads. Default is 8 MB; 64 MB reduces syscall overhead on fast networks.
Content hashing
indw.ingest.hash computes content hashes at ingest time for dedup pre-checks. Hashes are SHA-256 of normalized text content.