We are thrilled to announce that our core data processing framework is now fully open source and available to the community.
For the past two years, we've been building and refining a highly distributed data pipeline capable of handling semantic cleaning, deduplication, and massive scale data ingestion. Today, we are opening the doors to developers everywhere.
The best developer tools are built in the open. By open-sourcing the Data Workflow framework under the Apache 2.0 license, we want to empower developers to process datasets locally, build their own custom pipeline stages, and contribute back to the ecosystem.
You can find the entire source code, issue tracker, and contribution guidelines over on our official GitHub repository.
If you want to start building right away, head over to our Quickstart Guide to get your first pipeline running locally in under five minutes.
Thank you to our amazing beta testers and early access partners who helped us refine this architecture. We can't wait to see what you build.