Instant DevelopersGet StartedOverview

Overview

|||

InstantAI Data Workflow — open-source corpus pipelines for petabyte-scale AI training data preparation.

InstantAI Data Workflow (INDW) is an open-source data pipeline that ingests raw documents, applies multi-stage filtering and semantic cleaning, deduplicates at scale, and emits deterministic JSONL suitable for large-scale model training.

7
Pipeline stages
4
Execution backends
3
Dedup strategies
100%
Hash-stable output

What INDW does

INDW transforms heterogeneous raw corpora into training-ready datasets. The same configuration and input corpus produce the same acceptance decisions and output hash whether you run on a laptop, a multi-core server, or a Dask cluster.

Core capabilities

Pipeline stages

StagePurpose
IngestDownload from Hugging Face, S3, or local disk; normalize to JSONL
Stage0 FilterFast rejection: length, language, PII, toxicity, licensing, exact doc dedup
Semantic CleanACIM artifact discovery, HTML processing, boilerplate removal
Knowledge ExtractSection routing, entity extraction, schema normalization
DeduplicationExact hash, MinHash fuzzy, semantic FAISS-based
SchedulerMerge graph, admission tiers, ordered apply, execution backends
StoreCorpus registry, atomic I/O, JSONL export

When to use INDW

INDW targets operators building production training corpora who need predictable behavior at scale. Use it when you require reproducible acceptance decisions, multi-source mixture control, and backend-agnostic scaling from laptop to cluster.

Reproducibility first:

Run indw validate after any pipeline change. The parity suite verifies local vs multiprocess backend match and workers=1 vs workers=N hash stability.

Install and run

bash
pip install indw
indw doctor
indw merge ./raw ./out/filtered.jsonl --work-dir ./work --workers 2 --fresh
© 2026 Instant Developers. All rights reserved.