Reproducible Data Science that Scales!

Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance.

Version control for data

Pachyderm version controls data, similar to what Git does with code. You can track the state of your data over time, backtest models on historical data, share data with teammates, and revert to previous states of data. Learn more →

feature1

Language-agnostic data pipelines

Pachyderm lets you use the tools and frameworks you need, from bash scripts to Tensorflow. You just declaratively tell Pachyderm what you want to run, and Pachyderm takes care of triggering, data sharding, parallelism, and resource management on the backend. Learn more →

feature2

Why Pachyderm?

Because data scientists should be able to focus on data science, not infrastructure

services1
Reproducibility

Consistently recreate results from any previous state of your data or analysis.

services2
Data Provenance

Understand every step of the process that produced a given result.

services3
Collaboration

Manage shared data resources and work more effectively as a team.

services1
Incrementality

Build upon past results by only processing the new data for maximum performance.

services2
Data Scientist Autonomy

Maintain complete control of your data science toolchain choices.

services3
Infrastructure Agnostic

Run in the cloud or on-premise and integrate easily with your current infrastructure.

As featured in: