Data Scientists Share What Technologies They’re Using to Build Their Data Pipelines — and Why

Good data scientists are not jacks-of-all-trades; they’re specialists.

Two data scientists at the top of their field could have entirely different engineering skill sets, and work with totally different tools, platforms and programming languages. When taking this notion and applying it to scale, it becomes obvious that data scientists across the field can have vastly different approaches toward best practices for building their data pipelines.

For MassMutual’s Head of Data Engineering, Mukesh Sharma, growth this year involves investing in the most cutting-edge data science tools available without having to refactor old code. His team’s use of Amazon Web Service Step Functions allows them to combine multiple AWS services into serverless workflows, building new apps quickly and updating them without customizing any code. Sharma’s team also found that favoring platforms over bespoke solutions has been beneficial to scaling operations, as they are easily customizable and more sustainable than one-off technologies.

At machine learning solution provider Neural Magic, automation is the hot trend for 2020. ML Lead Mark Kurtz said his team uses tools like Jenkins and Dataset Version Control, which respectively automate and duplicate workflows in model versioning and training. Harnessing automation frees up his team to spend more time researching and integrating with the deep learning frameworks of their customers.

Sharma and Kurtz gave Built In a behind-the-scenes look at how they’re building their data pipelines for scale. In addition to using cutting-edge data science tools, the two tech professionals highlighted the importance of hiring top-notch engineering talent, and giving teammates agency to lower operational burdens as they see fit.

Building Data Pipelines

Building, and scaling, efficient data pipelines is essential to the health and success of any startup. When it comes to building a pipeline, companies must focus on using the tools and hiring engineering specialists that work best with the data they're trying to collect.

Neural Magic

View Profile

Following a $15 million seed funding raise in November 2019, Neural Magic announced it was expanding its engineering department. But building out a team of talented developers isn’t the only way ML Lead Mark Kurtz is scaling operations. Kurtz said in 2020, he will be researching and possibly deploying solutions like “Uber Michelangelo,” which allows for the construction and deployment of machine learning solutions at scale.

What tools are you using to build your data pipeline, and why did you choose those technologies?

We help customers achieve cost performance for their own pipeline. The majority of our customers’ pipelines are homegrown, while some are well-established market leaders. We recently worked with a customer to integrate with their SageMaker pipeline. Our goal is to use whatever our customer is using to execute deep learning frameworks like PyTorch, TensorFlow and Keras.

Internally, we implement and research all the aforementioned deep learning models, and others, while using Dataset Version Control (DVC) on top of Bitbucket for model and dataset versioning. Our own servers are connected to Jenkins for automated pipelines for deep learning model training to achieve the performance our customers desire. If scale is needed, we turn from our servers to data centers like AWS and Google Cloud Platform.

Our goal is to use whatever our customer is using to execute deep learning frameworks.”

As you scale, what steps are you taking to ensure your data pipeline continues to scale with the business?

Hiring the right engineering talent is crucial. It ensures we have enough resources and capacity to scale horizontally across customer needs.

When necessary, we connect to data centers to ensure scale. We make our processes automated and repeatable with Jenkins and DVC. In the near future, we are looking at more involved pipeline solutions, like Uber Michelangelo, that can scale across multiple data centers and data science teams.

MassMutual

View Profile

We are hiring

As a life insurance, retirement and investment solution provider, MassMutual services over 5 million clients — a feat that requires working with a massive amount of of data. Head of Data Engineering Mukesh Sharma said his team uses AWS Simple Storage Service, and its limitless data storage capabilities, to make housing that data more efficient.

What tools are you using to build your data pipeline, and why did you choose those technologies?

Our technology and tool choices are informed by a deliberate evaluation against well-defined criteria. Our philosophy is to choose the best-of-breed for each component of the stack that meets the extensibility, portability and scalability requirements. This encourages the engineering team to iterate and evolve as new or better tooling is available without major refactoring of the architecture.

Our current data pipeline uses the extract, load and transform (ELT) paradigm to get raw data sources using serverless workflows like AWS Step Functions and Lambda into a data lake (AWS S3 object store) and a massively parallel processing analytics engine (Vertica EON). Curation and transformation steps are executed using SQL within the Vertica warehouse, taking advantage of the data proximity and elastic scaling.

Our philosophy is to choose the best-of-breed for each component of the stack.”

As you scale, what steps are you taking to ensure your data pipeline continues to scale with the business?

Our strategy for scaling is to build platforms instead of bespoke solutions. We also empower the teams utilizing the platform to self-serve for core capabilities supported by standards, patterns and reusable modules. At an infrastructure level, we leverage cloud elastic scaling as much as possible to lower the operational burden.

MassMutual is Hiring | View 104 Jobs

Building Data Pipelines

Neural Magic

MassMutual

Recent Articles