Overview

InstantAI Data Workflow (INDW) is an open-source data pipeline that ingests raw documents, applies multi-stage filtering and semantic cleaning, deduplicates at scale, and emits deterministic JSONL suitable for large-scale model training.

Pipeline stages

Execution backends

Dedup strategies

100%

Hash-stable output

What INDW does

INDW transforms heterogeneous raw corpora into training-ready datasets. The same configuration and input corpus produce the same acceptance decisions and output hash whether you run on a laptop, a multi-core server, or a Dask cluster.

Core capabilities

Stage0 Fast Rejection

Rejects 60–80% of low-value documents before expensive processing using lightweight heuristics and PCI admission gates.

Deterministic Merge

Strict sequence-ordered apply with hash-verifiable JSONL output stable across worker counts and backends.

Distributed Execution

Scale to Dask or multiprocess clusters with a single backend flag — no pipeline logic changes.

Quality Profiles

YAML-driven QualityPipelineConfig with adaptive calibration, balance caps, and observability hooks.

Pipeline stages

Stage	Purpose
Ingest	Download from Hugging Face, S3, or local disk; normalize to JSONL
Stage0 Filter	Fast rejection: length, language, PII, toxicity, licensing, exact doc dedup
Semantic Clean	ACIM artifact discovery, HTML processing, boilerplate removal
Knowledge Extract	Section routing, entity extraction, schema normalization
Deduplication	Exact hash, MinHash fuzzy, semantic FAISS-based
Scheduler	Merge graph, admission tiers, ordered apply, execution backends
Store	Corpus registry, atomic I/O, JSONL export

When to use INDW

INDW targets operators building production training corpora who need predictable behavior at scale. Use it when you require reproducible acceptance decisions, multi-source mixture control, and backend-agnostic scaling from laptop to cluster.

Reproducibility first:

Run indw validate after any pipeline change. The parity suite verifies local vs multiprocess backend match and workers=1 vs workers=N hash stability.

Install and run

bash

pip install indw
indw doctor
indw merge ./raw ./out/filtered.jsonl --work-dir ./work --workers 2 --fresh

Get started View on GitHub