Data Pipelines: Behind The Scenes - The Big Role Of Data Preparation

While a discussion of data preparation pipelines may not be considered exciting to some, their ability to help solve some of the most pressing issues of the day quietly in the background warrants some consideration. In the context of health care,  having access to massive amounts of real-world data offers the potential for important discoveries like; understanding the impact of Covid on subsets of the population or developing effective interventions, or tracking side effects of new therapeutics and drugs like the new Covid vaccines. Because of the many formats and issues in the raw data, companies and researchers are stymied in easily accessing this information and making it useful for their efforts.

Cloud-based platforms with “smart data pipelines” are being constructed to help transform data so that it can be efficiently used to extract insights. There are many functions that take place within these pipelines, so let’s shed some light on how they work.

How Data Pipelines Work

Data pipelines, like the ones used on the Wayfinder platform, are built to automate certain functions that improve the timeliness and quality of massive amounts of data that flow through them. Think of the pipeline as a ready-made process that automates the complex logic to improve data quality, standardize and integrate the data; processes that perform a variety of functions like aligning to common column names, conforming data values to appropriate data types, eliminating or repairing bad values, and standardizing addresses. Data engineers build these pipelines with varying degrees of utility, depending on the use case. The pipelines must be able to accommodate a large volume of data at a velocity that accelerates transformation. For example, Kythera has built its pipelines to accommodate billions of rows of historical claims information as well as the ability to append claims on a daily basis.  Additionally, our pipelines have embedded machine learning algorithms that correct and fill in data that is missing in order to have more accurate robust data.  

Data Transformation

Ultimately, the point of the pipeline is to transform data. Most organizations working with big data sets rely on a process like ETL (Extract-Transform-Load) where data is extracted from its source and imported; then transformed by cleaning, standardizing, deduplicating, and other tasks, and finally it is loaded into its destination. The key difference with a platform like Wayfinder is that there is no need for an  ETL tool and all the configurations that go with it since Wayfinder’s pipelines are built natively on the Databricks unified platform, creating a data lake house environment for further analysis and combination of more data. A Data lake house provides the benefits of being able to collect data in all its formats, including analytics, structured and unstructured data, all in one place at a lower operating cost than traditional data warehouses.   

Having high-quality master data is critical when users want to build models or combine data.  Let’s use healthcare claims as an example.  Claims data can come from a variety of vendors, such as clearinghouses, aggregators, and payers. Each of these vendors has its own relational data model.  A critical step is consolidating each data vendor’s relational model into a standardized data lake model.  Other processes that take place include things like address standardization, NPI (National Provider Identification) validation, tokenization, and other data cleaning and enhancing functions. Once the real-world data are normalized, cleaned and standardized, Kythera applies machine learning algorithms to build clean and highly accurate mastered assets, like provider, practitioner, and payor directories which are fed back into our data pipelines and accessed through our Wayfinder platform.

Because discovery and innovation decisions are made based on the output of these pipelines, it is critical that what is happening in these pipelines are visible and can be monitored. Kythera pipelines monitor skew in key columns to detect vendor delivery issues before they impact downstream analysis. 

As data continues to grow in volume and the decisions that are dependent on data explodes, smart data pipelines help organizations spend less time on deploying, data cleaning, and wrangling and spend more time on the activities that are essential to finding solutions and innovation. If you have questions about data pipelines, please reach out here or connect on LinkedIn.

Matt Ryan

Engineering/Co-Founder
Matt leads the data engineering team and is responsible for architecture, engineering, and technical operations at Kythera labs. He has over 30 years of experience in software development and enterprise architecture, including big data environments for healthcare, finance, and telecommunications. Matt has been recognized by Databricks as one of their top 10 innovators.