Datasets
Configure source registries, mixture weights, and multi-source dataset definitions for INDW pipelines.
INDW datasets are collections of JSONL sources under raw/<name>/data.jsonl. Source registries, mixture weights, and domain maps control how multiple datasets are interleaved and balanced during merge.
Source directory convention
The directory name (wiki-en, common-crawl) is the source identifier used in checkpoints, mixture weights, and balance enforcement.
Source registry
Domain classification maps sources to content domains for balance caps:
load_source_registry resolves source metadata at merge bootstrap. Domain assignment feeds into BalanceConfig.domain_caps.
Mixture weights
Control interleaving frequency with per-source weights:
Weights are loaded from orchestration config or derived from balance domain caps.
Hugging Face source definition
Pass this to FastDatasetPipeline.run() or DatasetDownloader.fetch_all().
Source filtering
Process a subset of sources in a merge run:
Multilingual sources
Enable multilingual detection and per-language balance caps:
corpus_has_multilingual_sources detects multilingual corpora and adjusts language gate behavior.
Corpus registry
Track dataset identity across pipeline phases:
The registry persists manifests linking sources, document counts, and quality statistics.
Use descriptive source names that encode content type and language (e.g. wiki-en, crawl-2024-q1). Names appear in reject logs, audit reports, and checkpoint state.