Outputs
Configure INDW output paths, JSONL format, token bin export, and work directory artifacts.
INDW produces training-ready JSONL output, optional token bin shards, and operational artifacts in the work directory.
Primary output
The merge command writes filtered JSONL to the specified output path:
Each output line is a JSON object with the cleaned document text and pipeline annotations.
Output document schema
| Field | Type | Description |
|---|---|---|
text | string | Cleaned document body |
url | string | Source URL (if present in input) |
quality_score | float | Composite quality score (0–1) |
doc_tier | int | PCI admission tier (0 = accepted) |
doc_content_hash | string | SHA-256 of normalized content |
domain | string | Content domain classification |
meta | object | Preserved input metadata |
Example:
Output hash
The merge result includes sorted_output_hash — a deterministic hash of the output content independent of processing order within batches:
Use this hash to verify reproducibility across runs, backends, and worker counts.
Append mode
Append new documents to existing output without clearing prior results:
Token bin export
Export token-binned training shards for direct dataloader consumption:
Build PyTorch dataloaders from memmap streams:
Work directory artifacts
| Artifact | Path | Purpose |
|---|---|---|
| Checkpoint | work/checkpoint.json | Per-source line offsets |
| Resolved config | work/resolved_config.yaml | Pinned quality profile |
| Merge lock | work/merge.lock | Concurrency guard |
| Exact dedup index | work/dedup/exact_index.bin | Hash index state |
| Fuzzy LSH | work/dedup/fuzzy_lsh.pkl | MinHash state |
| Stage profile | work/audit/stage_profile.json | Per-stage timing |
| Reject log | work/audit/reject_log.jsonl | Rejection reasons |
Licensing provenance
Include licensing metadata in output when needed for compliance:
This adds license classification and provenance fields to each output document.
Compression
Install the compress extra for ZSTD-compressed output streams:
Write output JSONL to local SSD or high-throughput network storage. The apply coordinator serializes writes in document order — slow storage becomes the bottleneck at scale.