We built Pachyderm to solve real needs of real users. We felt existing tools didn’t reflect these values at all. These goals are our core tenets that we believe are the most basic requirements for anyone doing rigorous Data Analysis.
… Is the ability to consistently reconstruct any previous state of your data and analysis.
Reproducibility is the fundamental principle that enables data science.
The scientific method is a feedback loop that relies on testing hypotheses. To truly test any single hypothesis, you need careful controls to be able to evaluate and compare results consistently. More specifically you’ll need to reproduce the input data, the output data, and the analysis in exactly the same way.
Your input and output data is sacred. It needs to be reproduced by you (for your development workflow), for you colleagues (for collaboration), and work the same way in production (for sanity checks / debugging / vetting results). If it’s not reproducible across all contexts, you cannot truly expect any local results to be reproduced elsewhere.
Pachyderm File System (PFS) guarantees that you can version and recall your data consistently. PFS provides interfaces with your environment consistently whether you’re doing local development or production grade analysis. We leverage existing object stores for highly redundant, performant, and secure data hosting.
To test a hypothesis, you need the ability to consistently apply your analysis. Otherwise, you cannot compare the expected output to the actual output. It also needs to be consistently executed for yourself, for your colleagues, and in production.
Pachyderm Processing System (PPS) containerizes your code with your versioned data sets. That means every input to your analysis (the data, the tools, the execution environment) is controlled and the results are 100% reproducible
… Is the ability to track any result all the way back to its raw input data, including all analysis, code, and intermediate results.
Where Reproducibility allows us to uniquely reference a particular state of data, Provenance allows us to understand the entire process that produced a specific result.
Results without context is meaningless. At every step of your analysis, you need to understand where the data came from and how it reached its current state.
Debugging or designing new analysis is made possible by understanding your data’s Provenance. You can backtrace how surprising results happened by examining each analysis step. Without Data Provenance, you can’t tell the difference between a meaningful new result and a flaw in the analysis.
In Pachyderm, every output result is fundamentally linked to its input data and analysis code so you can fully trace its origins.
… Is the ability for a team of data scientists to effectively manage shared data resources and build on each other’s work.
Reproducibility and Provenance are fundamental building blocks for Collaboration. If you can’t get the same result when you run a colleague’s code on the same data, you cannot possibly collaborate. Similarly, if you can’t retrace the steps of analysis in order to fully understand it, you can’t effectively build on top of it.
But true Collaboration is more than just Reproducibility and Provenance, it includes software tooling to allow easy data resource management. With the right tools, collaboration should reduce development friction.
Developing analysis is a feedback loop. It involves constant iterations between munging and analyzing data. When you have a team developing analyses in parallel, resolving changes can be quite difficult. Changes to the data content, data format, analysis code, and execution environment all need to be managed.
By offering version control semantics similar to Git (i.e. commits, branches, etc), the Pachyderm File System (PFS) enables you to fork and merge your data. PFS is designed to let you iterate on your data content and format.
The Pachyderm Processing System (PPS) enables you to truly collaborate on your analysis. By combining your existing VCS for your code with PPS’s containerization, you can not only vet your results before merging, but validate that the changes to your analysis are compatible with your colleagues analysis code.
Pachyderm offers a unified collaboration mechanism for managing both data and code.
… Is the requirement that your results are synchronized with your input data AND that you perform no redundant processing.
In every large scale data infrastructure, jobs form a DAG (Directed Acyclic Graph) of data transformation steps that build into a dependency graph. Typical dependency schedulers, such as Airflow, Luigi or often just Cron, all have one major flaw -- they lack incrementality.
Incrementality enables your analysis to stay in sync with your input data. By keeping track of exactly which data has been updated, your dependency schedulers should only process the new data. This eliminates redundant processing, and allows analyses to complete much faster.
Pachyderm File System (PFS) versions each step of your analysis, and Pachyderm Processing System (PPS) only triggers individual steps of your DAG when new data is present. This combination means your DAG of pipelines stays in sync with your data automatically.
… Is the ability for data scientists to have complete control over their toolchain and deployment.
There are many robust tools for data analysis and your choice of tool is important. Whether you are choosing your tools for productivity and convenience, or advanced features and optimizations, your choice matters.
Pachyderm Processing System (PPS) is built on containers and automated deployment tools. This guarantees your freedom of tool choice - you can deploy a Docker image containing any tools and analysis code. It also guarantees control over deployment - deployment is just a matter of re-using the images you built locally for development. Teams experience a lot of friction when the responsibility and autonomy isn’t present - (See Jeff Magnusson’s blog post)
… Is the ability to deploy on your existing infrastructure.
We believe that a data analysis tool needs to be practically useful for everyone. We believe in application not theory. A tool can be powerful, but if it’s not tractable to use, it’s effectively useless. And by making the tool independent of the hosting method, you’re able to leverage the work of others, both inside and outside your organization.
We wanted to make it as easy as possible to actually deploy Pachyderm in production, regardless of your current data storage or hosted/on-prem infrastructure. That’s why we make sure that the only requirement for deploying Pachyderm in production is Docker. If you can deploy and run Docker images, then you can use Pachyderm.