Configuration
|||
Configure INDW pipelines with YAML quality profiles, environment variables, and Python dataclasses.
INDW pipelines are configured through QualityPipelineConfig — a dataclass loaded from YAML quality profiles, environment variables, or Python code. Configuration controls every stage from Stage0 gates through dedup and balance caps.
Configuration sources
| Source | Precedence | Use case |
|---|---|---|
| CLI flags | Highest | --backend, --workers, --fresh |
| Environment variables | Medium | Cluster deployment, CI |
| YAML quality profile | Base | Reproducible pipeline definitions |
| Python dataclass defaults | Lowest | Fallback when no profile loaded |
YAML quality profiles
Profiles live under configs/filtering/ in the repository. Load one at runtime:
python
A minimal profile:
yaml
Environment variables
| Variable | Values | Purpose |
|---|---|---|
INSTANT_PIPELINE_BACKEND | local, thread, multiprocess, dask | Execution backend |
INSTANT_PIPELINE_EXECUTOR | Same as above | Legacy alias for backend |
INSTANT_DASK_SCHEDULER | tcp://host:8786 | Dask scheduler address |
DASK_SCHEDULER_ADDRESS | Same as above | Standard Dask env var |
INSTANT_MERGE_HW_PROBE | 0 or 1 | Hardware tuning probe |
bash
Top-level config sections
| Section | Controls |
|---|---|
cleaning | Semantic cleaning, artifact discovery, HTML processing |
thresholds | Quality gates, score floors, signal-based rejection |
dedup | Exact, fuzzy, semantic dedup modes and thresholds |
balance | Domain and language mixture caps |
adaptive_calibration | Warmup reservoir and percentile anchoring |
synthetic_defense | Synthetic content and span repetition detection |
semantic_selection | Section-aware content selection |
curriculum | Stage-weighted curriculum mixing |
orchestration | LCI routing and mixture orchestration |
observability | Reject logs, stage profiles, audit events |
Config pinning:
Merge runs snapshot the resolved configuration into the work directory. Changing YAML mid-run does not affect an in-progress merge — restart with --fresh to pick up changes.
PipelineConfigContext
For programmatic access to the active quality spec:
python