Instant DevelopersGet StartedConfiguration

Configuration

|||

Configure INDW pipelines with YAML quality profiles, environment variables, and Python dataclasses.

INDW pipelines are configured through QualityPipelineConfig — a dataclass loaded from YAML quality profiles, environment variables, or Python code. Configuration controls every stage from Stage0 gates through dedup and balance caps.

Configuration sources

SourcePrecedenceUse case
CLI flagsHighest--backend, --workers, --fresh
Environment variablesMediumCluster deployment, CI
YAML quality profileBaseReproducible pipeline definitions
Python dataclass defaultsLowestFallback when no profile loaded

YAML quality profiles

Profiles live under configs/filtering/ in the repository. Load one at runtime:

python
import yaml
from pathlib import Path
from indw.filter.spec.quality import QualityPipelineConfig
 
raw = yaml.safe_load(Path("configs/filtering/quality_fast_first.yaml").read_text())
cfg = QualityPipelineConfig.from_dict(raw)

A minimal profile:

yaml
meta:
  id: my-pipeline
  purpose: english web crawl, high throughput
 
cleaning:
  enabled: true
  semantic_cleaning: true
  artifact_discovery: true
 
thresholds:
  min_chars: 150
  max_chars: 100000
  min_alpha_ratio: 0.45
  max_boilerplate_score: 0.40
 
dedup:
  exact: true
  fuzzy: true
  fuzzy_threshold: 0.90
  semantic: false
 
balance:
  enabled: true
  domain_caps:
    web: 0.55
    code: 0.20
 
language:
  enabled: true
  fast_detector_only: true
 
pii:
  enabled: true
  mode: structural_only
 
toxicity:
  enabled: true
  mode: rules_only
 
licensing:
  enabled: true
  reject_proprietary: true

Environment variables

VariableValuesPurpose
INSTANT_PIPELINE_BACKENDlocal, thread, multiprocess, daskExecution backend
INSTANT_PIPELINE_EXECUTORSame as aboveLegacy alias for backend
INSTANT_DASK_SCHEDULERtcp://host:8786Dask scheduler address
DASK_SCHEDULER_ADDRESSSame as aboveStandard Dask env var
INSTANT_MERGE_HW_PROBE0 or 1Hardware tuning probe
bash
export INSTANT_PIPELINE_BACKEND=dask
export INSTANT_DASK_SCHEDULER=tcp://scheduler:8786
indw merge ./raw ./out/filtered.jsonl --workers 8

Top-level config sections

SectionControls
cleaningSemantic cleaning, artifact discovery, HTML processing
thresholdsQuality gates, score floors, signal-based rejection
dedupExact, fuzzy, semantic dedup modes and thresholds
balanceDomain and language mixture caps
adaptive_calibrationWarmup reservoir and percentile anchoring
synthetic_defenseSynthetic content and span repetition detection
semantic_selectionSection-aware content selection
curriculumStage-weighted curriculum mixing
orchestrationLCI routing and mixture orchestration
observabilityReject logs, stage profiles, audit events
Config pinning:

Merge runs snapshot the resolved configuration into the work directory. Changing YAML mid-run does not affect an in-progress merge — restart with --fresh to pick up changes.

PipelineConfigContext

For programmatic access to the active quality spec:

python
from indw.config.resolve import PipelineConfigContext
 
ctx = PipelineConfigContext.resolve(quality_spec="data/filtering/quality_fast_first")
cfg = ctx.quality
gate = ctx.with_quality(cfg).build_gate()
© 2026 Instant Developers. All rights reserved.