Quickstart
|||
Run your first INDW pipeline in under five minutes.
Run a complete merge pipeline on a small raw corpus and verify deterministic output in under five minutes.
1. Prepare raw input
INDW expects JSONL files under raw/<source_name>/data.jsonl. Each line is a JSON object with at least a text field.
Example document:
json
2. Run merge
bash
| Flag | Purpose |
|---|---|
raw_dir | Directory containing */data.jsonl source files |
out_path | Output JSONL path |
--work-dir | Checkpoints, resolved config, audit artifacts |
--workers | Parallel worker count |
--fresh | Clear prior checkpoints and start clean |
3. Inspect output
bash
Surviving documents retain source metadata and gain pipeline annotations (quality_score, doc_tier, dedup hashes).
4. Validate determinism
bash
The parity suite confirms that local and multiprocess backends produce identical output hashes for the same configuration.
Python API
python
Custom quality profile
Load a YAML profile instead of defaults:
python
Resume interrupted runs:
Omit --fresh to resume from the last checkpoint. INDW tracks per-source line offsets in the work directory.