AgBiome uses Pachyderm to streamline genomic data science

AgBiome

AgBiome is a biotechnology company that uses their knowledge of the plant-associated microbiome to create innovative crop-protection products. Recently named as one of the 50 leading AgTech companies in the world, they isolate, catalog, and sequence the genomes for tens of thousands of microbes each year from samples gathered around the world to create more effective and sustainable agricultural products. Their mission is to be the best innovator in crop protection.

feature1

“Pachyderm helps us convert our existing data science pipelines from manually managed scripts to scalable, repeatable end-to-end workflows; enabling us to focus more on developing transformative technology to drive agriculture forward instead of wrangling infrastructure.”

- Mauricio Borgen
Director of IT & Scientific Compute

The Challenge

Mauricio Borgen and Charles Pepe-Ranney, members of the computational biology team at AgBiome, needed a way to improve the company’s approach to genomic data science. With a rapidly growing number of new microbes coming in from numerous sources and limited infrastructure to support their research processes, Mauricio and Charles realized that a new approach was necessary. To meet the needs of the business, the team needed to start transitioning their manual processes into automated and repeatable pipelines that could scale.

Instead of spending time innovating and using technology as a force-multiplier to the business, the team spent much of their time juggling scripts and analyses. “We had a lot of artisanally crafted, one-off ad-hoc analysis. That doesn't scale,” said Mauricio. Therefore, he and Charles were determined to find a way to transform their custom, heavily manual workflows into iterative, easy-to-assemble pipelines that could scale with the business.

Charles learned about Pachyderm after hearing Daniel Whitenack, a Pachyderm employee, on the Data Skeptic podcast and watching Daniel’s many conference talks on YouTube. He saw its potential to refashion their workflows and get data to bench scientists sooner. After exploring various options, they selected Pachyderm because it is the most effective platform for delivering standardized, end-to-end data-science pipelines. It could scale on any infrastructure while providing the team with the flexibility needed to easily leverage new frameworks and languages via Docker® containers.

Why AgBiome Chose Pachyderm

Efficiency

Manually managing genome analysis processes for an ever-expanding collection of microbial genomes that already numbers in the tens of thousands means a lot of time and resources focused on wrangling infrastructure instead of their core expertise, data science. The team would need to track down and often produce ad hoc the necessary data for each individual analysis. Pachyderm enables AgBiome to configure repeatable, yet modular pipelines that leveraged Docker® containers. This means that they can standardize aspects of their pipeline and build prefabricated workflows that run automatically as new data is added to the system.

Flexibility

The computational biology team at AgBiome supports a team of around 70 scientists, each having unique requirements and preferences. Mauricio and Charles wanted to streamline these environments without forcing everyone to conform to a certain language. Since Pachyderm leverages Docker® containers and is, therefore, language/framework agnostic, data scientists have the flexibility to choose the right tool for the job without adding additional complexities.

Provenance

Apart from the need to process large amounts of data, AgBiome needs to simultaneously maintain the reproducibility of results. Pachyderm was a natural fit given its ability to version control data, similar to the way Git does with code. This will give AgBiome the ability to track the state of their data over time, backtest models on historical data, share data with teammates, and revert to previous states of data.

Scalability

AgBiome wanted the ability to run experiments locally, easily scale larger workloads, and also be able to leverage the cloud when needed. Pachyderm was the only platform for Agbiome that provides the data provenance needed for their business while also supporting distributed processing and the ability to run both on-premise and in the cloud. By choosing Pachyderm, AgBiome can not only leverage the cloud when needed, but it also sets them up for further innovation down the road like moving to a serverless environment.

Usability

The Pachyderm dashboard provides a visual interface that makes the concept of pipelines intuitive to everyone involved. This means that computational biologists can easily see the work that they completed and immediately understand what is happening with their data. They don't have to waste cycles trying to understand complex IT concepts and instead focus on just building pipelines while Pachyderm takes care of the rest.

Conclusion

AgBiome is leading the way in the next agricultural revolution with the most innovative, unmatched use of the plant microbiome at scale. To keep up with that pace, they look to Pachyderm to automate their genomics pipelines and convert their existing pipelines from manually managed scripts to scalable, repeatable end-to-end workflows. Their teams can now focus more on developing transformative technology to drive agriculture forward instead of wrangling infrastructure.

Learn more about Pachyderm