General Fusion is developing fusion energy: a clean, safe, abundant and cost-competitive form of power. The company aims to design the world’s first full-scale demonstration fusion power plant based on commercially-viable technology.
“The true tipping point in our decision to use Pachyderm was its version control features for managing our data.”
- Jonathan Fraser
Engineer at General Fusion
From their plasma experimentation program, in which large devices called plasma injectors heat hydrogen gas to millions of degrees, General Fusion has collected a large set of complex data from thousands of sensors. While the conditions in which this information was captured may sound extreme, the challenges associated with managing this data, scaling data processing capabilities, and sharing scientific results are similar to those of other modern technology companies.
General Fusion had outgrown its existing data infrastructure. "We needed to evolve our data systems to match our increased analysis and collaboration efforts, and we wanted to leverage well-supported, off-the-shelf technologies that could move, scale and adapt as we grow our data and our company," said Brendan Cassidy, General Fusion’s Open Innovation Manager. They began the search for an infrastructure provider who could meet their data storage, processing and sharing needs while adhering to two important criteria:
1. Augment (not “rip and replace”) General Fusion’s existing experimental and analysis workflows
2. Facilitate collaboration with external scientific partners through seamless, ad hoc sharing of large sets of experimental data
“We found a limited set of options based on our requirements and pretty quickly narrowed it down to Hadoop and Pachyderm,” said Jonathan Fraser, the General Fusion engineer leading the project. Both solutions are able to store arbitrary unstructured data and can scale to meet computational demand. However, the implementations are very different.
Hadoop is based on the JVM which would have required General Fusion to completely rewrite their entire tech stack. “We have a great deal of existing scientific code that we still need. Moving to Hadoop would require a great deal of custom code to connect to our existing toolchain,” said Jonathan.
By contrast, Pachyderm leverages Docker containers to enable data scientists to write analysis using any languages or libraries of their choosing. General Fusion was able to migrate to Pachyderm in a fraction of the time and effort that writing custom code and tooling for Hadoop would have required.
While programmers use version control systems such as Git to manage and collaborate on a shared codebase, an additional level of complexity exists for data scientists who work with both code and data. Pachyderm enables data science teams to develop reproducible and distributed data workflows without interfering with each other's analysis.
“The true tipping point in our decision to use Pachyderm was its version control features for managing our data,” said Jonathan. “Our researchers no longer have to copy data locally or worry about a calibration update changing the underlying data while they're analyzing it."
Data in Pachyderm is versioned similarly to how code is managed in Git. It is organized into repositories where users can create commits (immutable snapshots), view diffs, and add or manipulate files like in any standard file system.
Pachyderm also provides complete data lineage (aka provenance) for every piece of data throughout the cluster. Every data transformation is tracked, allowing any result to be 100 percent reproducible and verifiable — an important consideration for any organization that relies on accurate analysis.
With Pachyderm, the General Fusion team can stay focused on plasma physics instead of designing and maintaining big data systems. The combination of language-agnostic infrastructure and version controlled data allows them to efficiently develop and iterate on their data analysis.
Pachyderm is committed to bringing a new paradigm of data infrastructure to the big data community through its open source platform and professional services. “What surprised us the most about our new infrastructure was the value the Pachyderm team brought to our deployment,” said Brendan. “Pachyderm developers have been committed to helping us transition to our new system and to adding functionality to meet our needs.” Pachyderm brings the perfect combination of advanced technology to solve modern data science challenges and flexible support services for efficient deployment.