Instant DevelopersConfigurationOutputs

Outputs

|||

Configure INDW output paths, JSONL format, token bin export, and work directory artifacts.

INDW produces training-ready JSONL output, optional token bin shards, and operational artifacts in the work directory.

Primary output

The merge command writes filtered JSONL to the specified output path:

bash
indw merge ./raw ./out/filtered.jsonl --work-dir ./work

Each output line is a JSON object with the cleaned document text and pipeline annotations.

Output document schema

FieldTypeDescription
textstringCleaned document body
urlstringSource URL (if present in input)
quality_scorefloatComposite quality score (0–1)
doc_tierintPCI admission tier (0 = accepted)
doc_content_hashstringSHA-256 of normalized content
domainstringContent domain classification
metaobjectPreserved input metadata

Example:

json
{
  "text": "Cleaned document content...",
  "url": "https://example.com/article",
  "quality_score": 0.84,
  "doc_tier": 0,
  "doc_content_hash": "e3b0c44298fc1c14...",
  "domain": "web"
}

Output hash

The merge result includes sorted_output_hash — a deterministic hash of the output content independent of processing order within batches:

python
result = merge_with_quality("./raw", "./out/filtered.jsonl", work_dir="./work")
print(result["sorted_output_hash"])

Use this hash to verify reproducibility across runs, backends, and worker counts.

Append mode

Append new documents to existing output without clearing prior results:

python
merge_with_quality(
    "./raw", "./out/filtered.jsonl",
    append=True,
    resume=True,
    work_dir="./work",
)

Token bin export

Export token-binned training shards for direct dataloader consumption:

python
from indw.store.export.fast_export import export_token_bins_fast
 
export_token_bins_fast(
    "./out/filtered.jsonl",
    "./out/token_bins",
    tokenizer_path="./tokenizer",
)

Build PyTorch dataloaders from memmap streams:

python
from indw.store.export.memmap_stream import build_pretrain_dataloader
 
loader = build_pretrain_dataloader("./out/token_bins", batch_size=32)

Work directory artifacts

ArtifactPathPurpose
Checkpointwork/checkpoint.jsonPer-source line offsets
Resolved configwork/resolved_config.yamlPinned quality profile
Merge lockwork/merge.lockConcurrency guard
Exact dedup indexwork/dedup/exact_index.binHash index state
Fuzzy LSHwork/dedup/fuzzy_lsh.pklMinHash state
Stage profilework/audit/stage_profile.jsonPer-stage timing
Reject logwork/audit/reject_log.jsonlRejection reasons

Licensing provenance

Include licensing metadata in output when needed for compliance:

yaml
licensing:
  enabled: true
  include_provenance_in_jsonl: true

This adds license classification and provenance fields to each output document.

Compression

Install the compress extra for ZSTD-compressed output streams:

bash
pip install "indw[compress]"
Output path on fast storage:

Write output JSONL to local SSD or high-throughput network storage. The apply coordinator serializes writes in document order — slow storage becomes the bottleneck at scale.

© 2026 Instant Developers. All rights reserved.