Data Quality In Machine Learning

There are many ways we at Kythera work to provide data of the highest quality to our customers. We restructure and reformat data from a variety of sources to produce data assets with consistent and unified structures. We match and combine data from different sources, addressing contradictions as we do so, to ensure that as much relevant information as possible is available for each record. We resolve healthcare events of different types, such as surgeries and imaging events, from the raw claims data to provide a greater level of insight into those sub-domains of healthcare. Finally, we remove incorrect information, impute correct information, and infer the existence of missing healthcare events to increase confidence that our data is correct and complete. Our goal is to provide as accurate and complete data as possible so our clients can have a high level of confidence when using our data. 

In this post, I’ll focus particularly on the value these last three transformations provide in the context of machine learning. I’ll begin by defining three methods for assessing data quality before reviewing the methods Kythera uses to improve the quality and quantity of its data and how this improves the quality of the machine learning models the data train. Throughout this post I will be referring to Kythera’s Surgical Case Asset to illustrate how these improvements are put to use.  Our Surgical Cases Asset shows surgical activity nationwide to an accurate site of service and includes details about referrals, providers, and discharges among  other information.

Data Requirements of A Machine Learning Model

Before looking more closely at how Kythera’s data can improve the quality of a machine learning model, we first need to understand how erroneous data impacts the quality of a machine learning model.

In simple terms, a machine learning model will search for patterns in the data it’s given that may help it predict its target variable. The model is therefore only as good as the data used to train it, applying whatever patterns it observed in the training data to make predictions about new records it’s given. If the training data is of poor quality, the model will fail to generalize well and will therefore make poor predictions when it’s deployed in the real world. 

The quality of data may be assessed in numerous ways, with correctness, missingness, and representativeness being the most significant. Each of these attributes  has the potential to affect the model in a different way. I’ll explain each with reference to a simple example of a model that predicts the sales volume for different brands of cars based on a car’s attributes.


Correctness, naturally, indicates whether the data is correct which may be true to greater or lesser degrees depending on the type of data being considered and how it is to be used. For example, if we list a car as electric when it is in fact diesel powered, the data is false. If we list a car’s miles per gallon as 25mpg when it is actually 20mpg, the data is inaccurate. While subtly different, each impact models differently. Incorrect data will, to some degree, obscure the real world relationships our model is attempting to learn from the data. It may directly contradict the real world relationships we are trying to predict. Or, it may make it harder for the model to identify the real world relationships. In either case, incorrect data  is likely to have a negative impact on model performance.


Missingness indicates how complete the data is in all its attributes. If we are predicting sales volume based on a car’s color, manufacturer, mpg, and power type, but not all the data we are using to train our model contains the mpg for a particular car, then our data is incomplete. In such a case, we must choose whether to discard the records which are missing data for that attribute, impute the missing data, or discard the mpg variable for our data entirely.  In the first case, our model receives less data from which to learn the real world relationship we’re trying to predict. In the second case, it receives data that may be incorrect. In the third case, we lose an entire variable that may contain information relevant to what we are trying to predict. In all cases, model performance may be negatively impacted.


Representativeness indicates whether the data is germane to the problem at hand; that is, whether it is both related to what we are trying to predict and representative of the diversity present in the real world. If the data is not related to what we are trying to predict, we will likely find ourselves with a very poor model. For example, if we use as a variable such as whether or not a car had foam dice wrapped around the rear-view mirror, it is unlikely our model will make accurate predictions on the basis of this feature alone. We are instead much more interested in the car’s mechanical specifications and its major cosmetic components, as these are more likely to have a tangible impact on a car’s sales volume. Alternatively, if the data is not representative of the real world situation in which we plan to use it, we may find ourselves with an exceptional model, but not one that is fit for our purpose. For example, say our training data contains only cars manufactured by Japanese companies, but we want to use the model in a situation where we only sell domestic cars. The resulting model may be the best in the world at predicting sales volume for Japanese cars, but it won’t be of much use to us given our context.

Fortunately, Kythera seeks to address all three of these metrics with the transformations we make to our data. Let’s now look at each improvement in turn and see how they help with these three data quality requirements.

Machine Learning Models Require Sufficient Quantity of Data

In our Surgical Cases Asset (surgical activity nationwide), Kythera addresses the issue of data quantity (specifically, the issue of quality data quantity) by inferring the existence of surgical events in situations where surgical claims are missing from our data set. We do this by using information found on claims not specific to a surgical procedure. With our methodology, we’ve been able to increase the number of surgical encounters present in our asset when compared with those that can be identified from the raw data alone. Significantly Increasing the amount of data has a positive effect on the 2nd and 3rd data quality metrics outlined above: missingness and representativeness.

Missingness, being a characteristic of individual records, as defined above, is not directly impacted by adding additional records to our data, but additional records do mitigate one of the downsides of missingness: the information cost of removing records containing missing data. Typically, machine learning models do not improve linearly as the amount of data available increases. They improve quickly at first and then that improvement plateaus as the marginal value of each new record, in information terms, decreases. With additional data, you can be more selective about which data you use whilst still maximizing model performance. Records with missing data or with lower data confidence can be discarded and complete, high-confidence data can be used in its place.

More data also improves the representativeness of the data used to train a model. Having a larger sample of data increases the chances that the data accurately represents what the model is being used to predict. It adds diversity to the data, making the model applicable in a wider range of situations (should this be desired); it reduces the prevalence of noisy data, obfuscating the true relationship between the predictors and the predicted; and it reduces the chances of overfitting - an issue where a model has been optimized for a training dataset that is not representative of the real world - meaning your model will be more performant once it’s deployed. This is especially true when a neural network is used - a powerful machine learning technique that can best simpler machine learning techniques but which typically requires a large amount of data to do so and can overfit on smaller datasets due to its complexity.

Machine Learning Models Require Quality Data

More data alone does not help machine learning models; the data must also be complete and correct. Kythera also works to improve the completeness and correctness of our data assets, which I’ll illustrate using the example of the rendering provider’s NPI field in our Surgical Cases Asset. We first check whether the data in this field is correct - is it possible that the given provider could perform the procedure that they’ve been listed as performing on the claim? If not, such as in the case where a cardiologist is listed as performing an orthopedic surgery, we remove this incorrect information. Next, for all instances where the data is missing, including those instances where we’ve removed the data because it was very likely to be incorrect, we attempt to identify a more appropriate surgeon. If we succeed in doing so with sufficient statistical confidence, we impute this new value.

Naturally, this addresses the data quality metrics of correctness and missingness. We now have a dataset which has had some incorrect data removed, some incorrect data corrected, and some missing data imputed. But is it not possible to achieve this result using traditional methods commonly employed in machine learning? Not necessarily.

The methods Kythera uses for identifying the incorrect information require subject matter expertise, something that data scientists or machine learning engineers are sometimes lacking. It is quite possible that a data scientist would not recognize the issues with the data they’re handling in a domain as complex as healthcare or how to correct them. If this is the case, the data scientist must rely on generic methods, such as outlier detection, for identifying incorrectness. While certainly usable, a generic technique like this is likely to yield poorer results than a solution that incorporates the subject matter expertise. Subject matter experts not only provide input into what raw data to use, they also provide guidance on how to best impute missing values and evaluate the accuracy of this information, whether through business rules or purposefully selected statistical measures.

The situation is similar for missingness. Unlike incorrectness, missingness is easily identified and in most cases will not require any knowledge of the working domain (with the only exception being pseudo-missing data). However, handling missingness appropriately is not necessarily so easy. A wide variety of generic statistical techniques exist for imputing missing data, but not all will be applicable to fields such as rendering NPI, nor will many, if any, be as accurate as a bespoke solution designed by subject matter experts. This, then, is where the work that Kythera does to address incorrectness and missingness is so valuable. We combine our deep understanding of statistical and machine learning techniques with tailored  solutions informed by decades of experience in the healthcare domain working with claims-based data. The ability to analyze a model, inspect the relationships between variables, and hypothesize why a particular variable is important can provide great value to the end user.


Machine learning in healthcare applications benefits from having both quality data in sufficient quantity and the input of subject matter expertise. We’ve seen the importance of this effort in the context of machine learning, but even if you simply want to answer questions using healthcare data without machine learning, having more quality data means you can expect better results.

If you have any questions about this post or would like to learn more about our data and data science-enabled platform, Wayfinder, get in touch.

Data Quality In Machine LearningLinkedIn

Ben Scoones

Senior Data Scientist
Ben manages Kythera's data and process monitoring alongside his work on various research and development projects. Ben graduated from UCL with a Master's in History and Philosophy of Science with a focus on causal epistemology. He is an accomplished data scientist and enjoys applying his philosophical background and learning to the many problems data science presents.
Data Quality In Machine LearningLinkedIn