A Guide to Data Lineage

Data Lineage means knowing, with certainty, the complete journey of your data, code, models, and the relationships between them.

What is Data Lineage

Data lineage describes the entire life cycle of your data from start to finish. It’s knowing the entire journey your data takes over time. That’s where your data came from, how it’s processed and where it goes. It describes what happens to your data as it goes through various transformations and changes.

At every step of your pipeline, you need to understand where your data came from and how it got to its current state. While reproducibility allows you to uniquely point at a particular state of your data, lineage (sometimes also called provenance) allows you to understand the entire journey your data took to give you a specific result. In other words, it shows you context. It allows you to backtrace any surprising result to the beginning, by letting you examine each step of your analysis in tremendous detail.

Why is Data Lineage Important

Understanding the history of your data gives you deep and highly valuable insights into why your model behaves the way it does. It also simplifies training and makes reproducing experiments much easier by letting you track the root cause of any anomalies or strange results.

How does Pachyderm Accomplish Data Lineage?

Data Lineage means the ability to track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.

1. Establish an Origin

In the first stage of the data lineage pipeline we ingest all of our data into an object store like S3 or Minio and the Pachyderm File System (PFS) take control, labeling and tagging it.

2. Track and Version

Pachyderm tracks any and all changes to your data, keeping immutable versions of each step. This allows you to see any change at will as your data moves through various pipeline stages.

3. Audit and Rollback

Pachyderm allows you to quickly audit which change made a difference in your model or deal with compliance issues with ease. Roll backwards and forwards in time to different points in time to ensure you can always reproduce any result.

Reproducibility is the Key to Better Data Science

When your data, your models and your code are all changing at the same time, how do you keep track of all the versions?

Changing data changes your experiments. If your data changes after you’ve run an experiment you can’t reproduce that experiment. Reproducibility is the essential bedrock of every data science project. For continually updated models, new data can change the performance of an algorithm as it retrains. Perhaps that new influx of information has outliers, inconsistencies, or corruptions that your team couldn’t see at the outset. Suddenly a production fraud detection model is showing too many false positives and customers are calling in upset as their accounts get suspended.

Even a simple change to the underlying data can wreak havoc on reproducible data science.

Industries That Can Benefit From Data Lineage


Life Sciences companies always stand at the cutting edge of technology, whether that’s biotechnology or industrial gene sequencing or computers. Today machine learning (ML) and artificial intelligence (AI) adoption have surged inside the halls of biotech companies big and small. Just as genetics engineers and lab techs need the repeatable, testable power of the scientific method, data science teams need reproducibility across their AI/ML pipelines so they can recreate all their results. Data versioning easily ensures the safety and efficacy of any findings as their machine learning models discover new drugs, treatments and promising new experimental pathways.

Learn more

Technology powers the cars of today and tomorrow. It wasn’t long ago that cars had only analog tech under the hood, but now cars are rolling computers with everything from anti-lock breaks to adaptable cruise control systems powered by silicon and algorithms. Everything from designing new cars, to the very cars themselves employ machine learning from top to bottom. When you’re working with a machine that interacts in the real world, safety and security are paramount. You need the ability to deliver explainable, repeatable, and scalable data science at scale to design a safe and top selling new car. Whether you’re designing new cars for increased aerodynamics or creating next-gen autonomous vehicles, or Advanced Driver Assistance Systems (ADAS) machine learning is now at the heart of the automotive industry.

Learn more

Choosing a Data Lineage Tool


Most data lineage platforms of the past failed because of a few simple reasons. The biggest one is immutability. If your system logs changes to a dataset but you can alter that data set without keeping old versions of that data, then your logging is worthless because you’ve got a log that points to a snapshot in time that no longer exists.


Metadata logging systems, like a database, keep an audit trail, but they don’t keep the deltas between changed versions of your data. The data can change out under your nose and now all you have is an entry in the database that’s no longer reflective of the real world.


As your data science teams grow it’s more crucial than ever for your teams to know who made what change and when. If one data scientist can change your data or a model and the rest of the team doesn’t know why that data changed it can set a project back by days or months. Pachyderm allows you to scale teams simply, with robust role base access control and smooth collaboration across the board.

Data Lineage & Pachyderm

In Pachyderm, lineage & metadata are ironclad. They go hand in hand every step of the way. Every change to your data gets invisibly tracked behind the scenes. You can’t go around the system and make a change that makes all your logs and commits worthless and that makes all the difference in choosing the right data lineage platform.

Pachyderm keeps track of every single piece of data -- whether its an input, output, parameter, or model binary. All of it gets fully versioned and tracked. Lineage is an inherent property of the data, not a metadata add-on. Even if the change derives from another commit, Pachyderm captures that information to create a provenance/lineage chain that adds up to a powerful “stacktrace” for your data.

Request a Demo