Truth in Data, Part Two

Whether used for research, commercialization, marketing or business development, life sciences companies and healthcare providers recognize that longitudinal patient journey information is essential. But the usefulness and insights gleaned from patient journey data depend largely on its essential DNA and the quality of the underlying data used to construct that journey.

Let’s take a look at an example from outside healthcare that illustrates how we think about the quality of patient journey data. Two individuals walk into a lumber yard and purchase the raw materials needed for a project; in this case, the raw material is wood. Builder One constructs a simple bookcase for his home. It may hold books, but the base is wobbly, and the shelves are uneven, so the books fall over. Builder Two constructs a bookcase; however, it is well constructed and even has art-worthy concepts of design. Both have used the same raw material (wood), but the outcome differs.

Taking this example back to the patient journey, most patient journey information starts with the same raw materials, including raw claims data, EHR data, lab results, prescription data and other healthcare data. By knowing how to work with the raw materials, some patient journey insights will be of better quality, resulting in much higher confidence in the story the journey tells.

What Could Go Wrong

We will use raw claims data as an example to illustrate the myriad ways healthcare data can lead you astray. (Whether it is raw claims data, EHR data, lab data, etc., many of the same issues exist. We are using raw claims data as our example for brevity).

A claim is a mechanism to be paid for services rendered. While a single claim consists of hundreds of discrete values, called elements, only those fields that are required for reimbursement are essential for inclusion, and even then, these values are riddled with problems. If the patient journey data you are considering has incorrect or missing elements or is biased, you end up with a journey that may take you down the wrong path.

Common Problems to Watch for

Claims data can be purchased from a variety of sources. If those sources are not upfront about the quality of the data (not just quantity because a bunch of poor data won’t get you closer to your answer), how can you be confident in what that data is telling you? What could go wrong?

Often, we hear users of healthcare data say they need to move fast to answer their business-related questions. Unfortunately, you may be moving fast but in the wrong direction. Let’s briefly look at some of the problems we’ve encountered in our years of working with and correcting claims data.

Data Suppression

When purchasing claims data, you should verify that the information needed to accomplish your objectives is actually available in the data set. One common cause of missing data is data suppression. Sources suppress data for various reasons; contractual obligations like specific payers not allowing their data to be sold or eliminating small values (between 1-10) for things like discharges, admissions, and other information. In some cases, statistical methods can infer the suppressed data, and acquiring new data sources to fill in known gaps may be a prudent strategy for other missing data. For example, Kythera uses advanced statistical methods and machine learning first to evaluate each data source and element. Then we determine the best way to either fill in gaps or document the missingness.

Missing Sources or Missing Claims

Another problem commonly encountered when using claims data is not weighing the impact things like direct billing and closed systems have on your data. For example, most data sets do not have patient-level coverage for closed systems like Kaiser Permanente. In several regions of the country, closed networks impact the full visibility of facilities and provider networks. It is vital to understand what percent of the market may be missing from these types of “closed networks.” Direct billing, or when a health system or practice bypasses the clearinghouse and sends their claims for payment directly to a payer, is another cause of claims that may be missing entirely from a data set.

Bad Patient Tokens and Duplicates

When purchasing anonymized data, a patient token/hash is assigned to ensure the privacy of patient identity. The patient token is key to identifying unique patients, successfully joining data sets, and ultimately using the data for applications like constructing patient journeys. Healthcare claims data always contain suspect patient tokens, and it is crucial to identify and correct them when possible. We have seen vendors say they have data for millions of claims that actually turn out to be the result of duplicate tokens. Kythera executes a machine learning process to identify suspect tokens with a high degree of confidence and correct them at scale. This topic deserves a blog of its own so stay tuned for a better understanding of the pervasiveness of suspect patient tokens and how we address this industry-wide problem.

Open Versus Closed Claims

Open claims data refers to data sets from sources such as clearinghouses, labs, EHRs, etc., while closed claims refer to data sets that come directly from payers/insurers. There are advantages and disadvantages to each type of claim source.

Closed claims have complete information from the payer for all events during patients’ enrollment period, but there may also be payer bias or cause difficulty in understanding a complete market. Furthermore, closed claims may limit longer-term patient longitudinal views since subscribers move into and out of health plans over time.

Open claims offer a broader view of patient interactions and the healthcare market without time restrictions. One downside to using open claims is that data sources do not have 100% of claims, so gaps can exist in the data. It is essential to know what those gaps are and have a way to address them.

Data users should first understand what they are trying to accomplish and then determine whether open or closed data sets (or a combination of the two) will provide a better outcome.

Input Error

Generating a claim requires human input, and no human is perfect when typing or filling out forms. Simple errors include inputting information in the wrong field, transposing fields, typos, misspellings, and entering inaccurate information, like incorrect and inconsistent addresses. Data users often do not know when this information is incorrect. Is the vendor correcting these errors? Is the claim discarded? Or is the claim passed on to the user with all its faults? Wayfinder, Kythera’s cloud-based data science platform, corrects and enhances data so users have access to more complete and accurate data.

Inaccurate Information and Methodology

Again, since a claim is a reimbursement mechanism, we often encounter two common “problem” fields: Place of Service (POS) and Site of Care. Place of Service is defined as the setting of care in which service was provided. It is represented on a professional claim by a two-digit numerical code assigned by CMS. Place of Service includes Walk-In Retail Clinic (Code 17), Inpatient Hospital (Code 21), and Skilled Nursing Facility (Code 31). It should be straightforward, right? Unfortunately, we often see POS indicating a facility that is inconsistent with other facility information on the claim.

Site of Care is defined as the specific location where the care was delivered. For example, a Site of Care may be Riverside Urgent Care or Mount Sinai Hospital Queens. Vendors frequently omit this field, making it nearly impossible to accurately assess leakage, market share, referrals, and other critical strategic insights.

The address field is another potentially confusing data element. At its core, the address field tells a payer where to send a check, which is not always where care was delivered. And there are times when services delivered at multiple locations are on the same claim. So, we get a claim with multiple addresses or one address (maybe a centralized office, but not the practice location) in multiple fields. Either way, the address field is another source of inaccuracies or confusing information that makes it challenging to determine where care was delivered accurately.

There are dozens of complicating examples of when Place of Service and Site of Care can be incorrect. It is vital to know if these common problems have been identified and corrected in the data sets you are using. Read our Site of Care blog dive deeper.


Missingness is a term that describes missing data as a whole and, more specifically, completeness of the data in all attributes. We have written about missingness in several prior blog posts. We won’t cover it here, but please read the posts here and here. Missingness and how it is corrected is critical for users of claims data.

Understanding Leads to Better Outcomes

These examples highlight how claims data can be incorrect and can cause problems. At worst, they can lead to an incorrect strategy or be a costly failure. To a lesser degree, they can lead to a lack of confidence in the data when making decisions or even inaction. Whether you are a healthcare researcher or analyst or an executive responsible for Strategy, Branding, Commercialization, or Marketing, we encourage you to understand your data and the potential shortcomings to get the result you seek. Please reach out if you want to continue this discussion and learn more about how Kythera improves data quality.

Truth in Data, Part TwoLinkedIn

Jeff McDonald


Jeff is a serial entrepreneur and growth leader who successfully envisioned and developed analytical products and platform technologies to empower growth. He has more than 20 years of experience in the healthcare industry, combining his technology, innovation, and analytic product development experience with his conviction in the power of teamwork to help organizations succeed.
Truth in Data, Part TwoLinkedIn