Digital Reasoning is a communication analytics company that uses machine learning and AI to help its customers address some of the world’s toughest challenges. With customers ranging from law enforcement and intelligence agencies to leading healthcare institutes and top financial firms, the company uses data and powerful machine learning models to uncover fraud, detect cancer, and combat human trafficking. As they set out to explore what a next-generation architecture might look like to address its clients’ complex, high-stakes needs, Digital Reasoning explored Pachyderm to see how version-controlled data and containerized data pipelines would help them achieve explainable, repeatable, scalable data science.
“In one day, we ended up automating the entire process all the way to the reporting. The data preprocessing, the hyperparameter search, model tuning, model selection, and even some of the inference testing were all automated using Pachyderm.”
- Jimmy Whitaker
Manager of Applied Research
To deliver highly effective machine learning solutions to its customers, Digital Reasoning must process large volumes of disparate, disorganized, and seemingly unrelated information that constantly changes. Jimmy Whitaker, Manager of Applied Research, and his team use this data to develop complex models that detect key patterns and information in the sprawling array of intangible connections between people, places, and events. The team must constantly balance the opportunity cost between agility at scale with the overhead of communications when they collaborate with clients. They need an architecture that can deliver machine learning models that are both explainable and easily reproducible.
During an internal hackathon, Whitaker, a group of Digital Reasoning developers, and an intern set out to build the next-generation architecture for the company’s deep learning workloads. The team sought to accomplish two goals: (1) find a way to make their constantly changing data behave in the same version-controlled manner as code and (2) use the latest scalable infrastructure possible. The Digital Reasoning team selected Kubernetes to address scalable infrastructure. For its data science platform, it chose Pachyderm.
Using Pachyderm and Kubernetes, the team built scalable, repeatable, and explainable data science pipelines in just one day. Pachyderm enables the team to continuously ingest its constantly changing data end-to-end — with complete provenance and without sacrificing agility.
While the team initially set out to build just one pipeline, by the end of the hackathon, they had multiple pipelines set up for different use-cases. For their audio research use case, they built an end-to-end pipeline that would analyze audio files all the way through the transcription process, output the transcripts, and then apply some of the natural language processing components that they were working on onto the output of the transcripts — and so on all the way through to the inference testing. Because things were going so smoothly they even expanded into building pipelines to image analysis. And it didn’t end there.
Whitaker’s team took the new architecture a step further, integrating Jupyter notebooks into the process so its research engineers and data scientists could easily apply changes to any point of their pipeline and watch the impact in real time as Pachyderm automatically implemented those changes. “Pachyderm allows us to look at and component-ize the entire pipeline of analytics and transformations we run. For complex systems, this is incredibly useful to understand the big picture before jumping into the code. We can then easily dive into a specific component to address the needs of a project we are working on.” says Whitaker.
Pachyderm helped Digital Reasoning’s data scientists unearth new insights and carry out rapid experimentation without sacrificing speed or functionality. “There’s always a tension between agility, interpretability, and reproducibility, but Pachyderm makes that tension manageable” adds Whitaker. Pachyderm enables the company’s data scientists to efficiently overcome obstacles, handle data divergence, and generate reproducible outputs. With billions of dollars — and even lives — at stake, Digital Reasoning is always on the lookout for new ways to build accurate models for its clients that inform smart decisions using the best architecture and platforms available.