Overview
InstantAI Data Workflow — open-source corpus pipelines for petabyte-scale AI training data preparation.
InstantAI Data Workflow (INDW) is an open-source data pipeline that ingests raw documents, applies multi-stage filtering and semantic cleaning, deduplicates at scale, and emits deterministic JSONL suitable for large-scale model training.
What INDW does
INDW transforms heterogeneous raw corpora into training-ready datasets. The same configuration and input corpus produce the same acceptance decisions and output hash whether you run on a laptop, a multi-core server, or a Dask cluster.
Core capabilities
Stage0 Fast Rejection
Rejects 60–80% of low-value documents before expensive processing using lightweight heuristics and PCI admission gates.
Deterministic Merge
Strict sequence-ordered apply with hash-verifiable JSONL output stable across worker counts and backends.
Distributed Execution
Scale to Dask or multiprocess clusters with a single backend flag — no pipeline logic changes.
Quality Profiles
YAML-driven QualityPipelineConfig with adaptive calibration, balance caps, and observability hooks.
Pipeline stages
| Stage | Purpose |
|---|---|
| Ingest | Download from Hugging Face, S3, or local disk; normalize to JSONL |
| Stage0 Filter | Fast rejection: length, language, PII, toxicity, licensing, exact doc dedup |
| Semantic Clean | ACIM artifact discovery, HTML processing, boilerplate removal |
| Knowledge Extract | Section routing, entity extraction, schema normalization |
| Deduplication | Exact hash, MinHash fuzzy, semantic FAISS-based |
| Scheduler | Merge graph, admission tiers, ordered apply, execution backends |
| Store | Corpus registry, atomic I/O, JSONL export |
When to use INDW
INDW targets operators building production training corpora who need predictable behavior at scale. Use it when you require reproducible acceptance decisions, multi-source mixture control, and backend-agnostic scaling from laptop to cluster.
Run indw validate after any pipeline change. The parity suite verifies local vs multiprocess backend match and workers=1 vs workers=N hash stability.