Instant DevelopersGet StartedQuickstart

Quickstart

|||

Run your first INDW pipeline in under five minutes.

Run a complete merge pipeline on a small raw corpus and verify deterministic output in under five minutes.

1. Prepare raw input

INDW expects JSONL files under raw/<source_name>/data.jsonl. Each line is a JSON object with at least a text field.

  • my-pipeline
    • raw
      • web-crawl
        • data.jsonl
    • pipeline.yaml

Example document:

json
{"text": "The transformer architecture uses self-attention to model sequence dependencies...", "url": "https://example.com/doc1"}

2. Run merge

bash
indw merge ./raw ./out/filtered.jsonl --work-dir ./work --workers 2 --fresh
FlagPurpose
raw_dirDirectory containing */data.jsonl source files
out_pathOutput JSONL path
--work-dirCheckpoints, resolved config, audit artifacts
--workersParallel worker count
--freshClear prior checkpoints and start clean

3. Inspect output

bash
wc -l ./out/filtered.jsonl
head -n 1 ./out/filtered.jsonl | python -m json.tool

Surviving documents retain source metadata and gain pipeline annotations (quality_score, doc_tier, dedup hashes).

4. Validate determinism

bash
indw validate

The parity suite confirms that local and multiprocess backends produce identical output hashes for the same configuration.

Python API

python
from pathlib import Path
from indw.filter.spec.quality import QualityPipelineConfig
from indw.schedule import merge_with_quality
 
merge_with_quality(
    Path("./raw"),
    Path("./out/filtered.jsonl"),
    quality_config=QualityPipelineConfig(),
    work_dir=Path("./work"),
    workers=2,
    fresh=True,
)

Custom quality profile

Load a YAML profile instead of defaults:

python
import yaml
from pathlib import Path
from indw.filter.spec.quality import QualityPipelineConfig
from indw.schedule import merge_with_quality
 
cfg = QualityPipelineConfig.from_dict(
    yaml.safe_load(Path("configs/filtering/quality_fast_first.yaml").read_text())
)
merge_with_quality("./raw", "./out/filtered.jsonl", quality_config=cfg, work_dir="./work", fresh=True)
Resume interrupted runs:

Omit --fresh to resume from the last checkpoint. INDW tracks per-source line offsets in the work directory.

© 2026 Instant Developers. All rights reserved.