Storage
INDW storage layer: corpus registry, atomic I/O, columnar formats, and JSONL export.
The storage subsystem (indw.store) manages corpus metadata, atomic file I/O, checkpoint persistence, and training-ready export formats. Work directories are the operational source of truth for every merge run.
Corpus registry
CorpusRegistry tracks corpus identity, source manifests, and metadata within a work directory:
The registry persists corpus manifests used by FastDatasetPipeline for multi-phase ingest-merge workflows.
Input format
INDW reads JSONL from raw/<source_name>/data.jsonl. Each line is a JSON object:
Required field: text. Additional fields are preserved through the pipeline and written to output.
Output format
Filtered output is JSONL with pipeline annotations appended:
Output is written atomically via indw.store.io.atomic — partial writes never corrupt the output file.
Atomic I/O
Atomic writes use temp-file-then-rename to guarantee crash safety.
JSON codec
indw.store.io.json_codec uses orjson for high-throughput serialization. The codec handles numpy types, datetime objects, and nested structures in document metadata.
Columnar storage
For analytics and corpus evaluation, INDW supports columnar formats via indw.store.io.columnar. Use columnar storage for intermediate artifacts that benefit from column-oriented access patterns.
Export utilities
| Function | Purpose |
|---|---|
export_token_bins_fast | Export token-binned training shards |
build_pretrain_dataloader | Memmap-backed PyTorch dataloader |
build_val_dataloader | Validation split dataloader |
Work directory layout
Manifest
indw.store.corpus.manifest generates corpus manifests with document counts, source breakdowns, quality score distributions, and output hash for reproducibility tracking.
Keep work directories for production runs. They contain the resolved config, checkpoints for resume, and audit artifacts needed by indw audit --kind pipeline.