Pachyderm generally provides a platform for scalable and reproducible data processing, but that means many different things and can fit into many different workflows. Below are a few of the use cases that Pachyderm is currently being used for in production.
Developing machine learning pipelines is always an iterative cycle of experimenting, training/testing, and productionizing. Pachyderm is ideally suited for exactly this type of process.
Data scientists can create jobs to explore and process data. Pachyderm will automatically let you down-sample data by mounting it locally so you can develop analysis on your machine without having to copy any data around.
Building training/testing data sets is incredibly easy with version-controlled data. Since you have all your historical data at your fingertips, you can simply train a model on data from last week and then test it on this week’s data. Getting training/testing pairs involves zero data copying or moving.
Finally, once your analysis is ready to go, you simply upgrade your job to a Pachyderm Pipeline. Now it’ll automatically run and continue updating as new data comes into the system, letting you seamlessly transition from experimentation to a productionized deployment of your new model that's constantly evolving.
A data lake is a place to dump and process gigantic data sets. This is where you store your nightly production database dumps, all your raw log files, and everything else you could possibly want. The data lake itself isn't just storage, it also offers data processing capabilities. Martin Fowler has a great blog post describing data lakes.
Generally, a data lake is your source of truth. If you track the provenance of any data analysis all the way to the raw input, you'll be in the data lake. Often times, basic ETL processes happen in the data lake and then push the nicely cleaned and organized data into data warehouses (aka: marts) for downstream querying and ad-hoc data consumption. Pachyderm is a data lake and can easily push data into many common data warehouses, including Amazon Redshift, Apache Hive, and Elastic Search.
Databases are generally single-state entities with entries routinely being overwritten and limited notion of previous values. With Pachyderm, you can take snapshots of your production databases so that you can revert back to any previous state if something goes wrong. Since Pachyderm understands data diffs, you're not copying the entire database dump every time, but instead only storing the diffs of how your data has evolved.
You never want to be running expensive analytics queries on your production databases. Since Pachyderm gives you a complete history of all of your data, you can analyze how your data is changing over time to find patterns and insights.
For example, if you have a large table of vendor prices that are constantly being updated, Pachyderm would allow you to analyze vendor pricing trends over time.
ETL (extract, transform, load) is the process of taking raw data and turning it into a useable form for other services to ingest. ETL processes usually involve many steps forming a DAG (Directed Acyclical Graph) — pulling raw data from different sources, teasing out and structuring the useful details, and then pushing those structures into a data warehouse or BI (business intelligence) tool for querying and analysis.
Pachyderm completely manages your ETL DAG by giving you explicit control over the inputs for every stage of your pipeline. We also give you a simple API — just read and write to the local file system inside a container — so it’s easy to push and pull data from a variety of sources.