Pachyderm File System (pfs)

PFS is Git for Data. We offer the same version control primatives you're used to using for code, but for massive data sets. You can also think of it like Time Machine for your data -- clear historical snapshots of how your data looked at different points in time.


A Pachyderm Data Repository (PDR) is the core organization primitive for your data, just like Git repos for code.

What Git does for Collaboration and Reproducibility for code, Pachyderm does for your data. Using a consistent data state is imperative to understanding your data. Data streams from different sources should generally go into different repositories.



What does it mean to make a commit?

As a data scientist, you are constantly modifying your data set. For any manipulation of your data (CRUD) you can encapsulate that modification in a commit. That means the operation is reversible and the new state is Reproducible for you and your colleagues. Just like in Git, commits are immutable so you can always refer back to a previous state of your data.

Commits are made on a per-repository (data set) basis so different repositories have independent commit structures. You can also create branches off a commit which works the same as in Git.



Files are the underlying mechanism for data storage, just like in Git. This interface is simple and flexible. It’s totally up to you to decide how you want to separate your data into files and organize them in directories.

You never have to worry about how large each file might get. The Pachyderm Processing System (PPS) serves the files in a way that’s easily digestible and scaleable.



Just as in git, the state of each file (and therefore the entire data set) is constructed from the series of diffs that were made to modify that file.

Adding data to your data set is as simple as adding a few lines to a file and committing the change! Pachyderm doesn't save copies of files, we only store the diff so it's incredibly space-efficient.

Read about data processing