Data Workflow is now open source

For the past two years, we've been building and refining a highly distributed data pipeline capable of handling semantic cleaning, deduplication, and massive scale data ingestion. Today, we are opening the doors to developers everywhere.

Why open source?

The best developer tools are built in the open. By open-sourcing the Data Workflow framework under the Apache 2.0 license, we want to empower developers to process datasets locally, build their own custom pipeline stages, and contribute back to the ecosystem.

What's included in this release?

The Core Execution Graph: Our state-of-the-art directed acyclic graph (DAG) execution engine.
Stage0 Filtering: High-performance data admission tiers for preliminary cleaning.
Semantic Deduplication: Built-in algorithms for identifying and merging duplicate records at scale.
Developer API: A robust SDK to let you write custom data manipulation stages in Python or TypeScript.

Getting Started

You can find the entire source code, issue tracker, and contribution guidelines over on our official GitHub repository.

If you want to start building right away, head over to our Quickstart Guide to get your first pipeline running locally in under five minutes.

Thank you to our amazing beta testers and early access partners who helped us refine this architecture. We can't wait to see what you build.