There’s no shortage of buzz around the promise of Generative AI to revolutionize data workflows. But realizing that potential requires applying AI where it can drive real impact, and Kythera Labs’ focus is on simplifying the complex, technical work of healthcare data integration, management, and analysis.
Siloed data and difficulty integrating different data formats are pervasive problems; silos restrict the use of data across the enterprise, and widely varied data formats complicate data integration, limiting its utility. A common data model (CDM) breaks down these silos, brings together data from different sources, simplifies data complexity, and organizes enormous datasets to make integration and utilization possible. Kythera Labs has developed a dynamic CDMl that not only standardizes the data, it also works like a feedback loop that self-informs and improves the data by interacting with hundreds of data points. This CDM in itself is important because it allows data from various sources to be harmonized and used as a cohesive whole. A key challenge here is in identifying how data from each source can be transformed to fit this data model whilst retaining its integrity and preserving any unique elements it may contain. Historically, this process would require hours of data exploration conducted by a human with the technical skill necessary to query the data, the industry knowledge to understand the nuances of the particular data source, and the expertise to know how it connects to the common data model. This data exploration phase is then followed by data engineering to create the pipeline for integrating that data source into the CDM. This process typically takes hours or potentially days. Generative AI can greatly accelerate, and potentially enhance, both these data exploration and data pipelining phases.
In the former, when provided with proper context and appropriate tools, AI agents can query and explore the data to generate what is essentially an intelligent data profile of the new data source. From this profile, the AI agent will identify distinctive elements of the dataset as well as elements overlapping with the CDM, resulting in requirements to transform the raw data into the CDM. Additionally, these agents can be built to operate either fully independently or with a human-in-the-loop, providing real-time feedback and challenging conclusions or assumptions. Both provide significant value, and deciding which option to implement will depend on what resource or outcome is most highly valued.
We chose to design the agent to behave autonomously for two reasons. First, introducing human-in-the-loop would require a more complex system architecture in real time with pre-determined or identifiable interaction points. Second, since agent execution and query execution can be time-consuming, allowing the agent to operate entirely autonomously eliminates the need for any synchronous work, thereby saving human time. These reasons, combined with the good performance we’ve seen from a fully autonomous agent, are why we’ve pursued this approach, offering both efficiency and simplicity.
It is easy to see how an agent saves time, but it is also worth noting that using agents in data exploration can also improve data quality. An agent, if properly trained and validated, can delve more deeply and thoroughly into the data than a human, is more likely to be constantly aware of context that is relevant to how results should be interpreted, and is not prone to fatigue - all problems that become more acute for a human particularly as data becomes large. Agents in data exploration, therefore, represent both time-saving and quality-enhancing tools.
Kythera Labs uses a custom CDM to harmonize data from multiple sources. Historically, mapping new data to a CDM has been a slow, manual process requiring both technical skill and deep domain knowledge. Now, AI agents can dramatically accelerate this work. With the right tools and context, generative AI agents can:
Once data has been created, it needs to be managed properly. This is another area where Generative AI can accelerate existing processes and power capabilities that were previously not available.
For structured data, one of the most basic and necessary elements of metadata is a good data dictionary. A data dictionary documents the dataset’s content, defines column formats and data types, and maps relationships between tables within a data model. Generative AI can automate data dictionary creation from the data alone by analyzing raw datasets to generate column descriptions, detect data types and formats, identify key relationships and even suggest potential use cases based on the data’s structure and content patterns.
When working with large, complex data, it is difficult to onboard the data and efficiently identify which data points are relevant to a common data model. Generative AI can not only help detect, from each source, what data aligns with which column heading, but also generate entity descriptions and populate tables with sample values for validation.
For data sets that are accompanied by incomplete or weakly worded header descriptions, Generative AI can generate clearer, more descriptive labels to improve data usability.
Unstructured data also requires effective metadata management to enable efficient search and retrieval. Generative AI excels at this through document classification, sentiment analysis, summarization, and identification of sensitive information. This ultimately accelerates work requiring unstructured data by enabling the class and content of that data to be evaluated before human review.
Well-managed metadata enables generative AI to be used on the metadata itself. Healthcare presents unique challenges for understanding data models, domain terminology, and documentation due to complex data generation processes and the vast amounts of data variety. Generative AI, in conjunction with good metadata management and a well-structured and informative data model, is a powerful tool for addressing these problems by isolating only the elements necessary for a specific use case. This not only saves time but also improves the quality of any work conducted using the data and educates users on best practices and relevant knowledge within the given domain. Data not immediately relevant to a specific use case is cataloged for potential future use.
AI can now manage metadata—traditionally a manual chore—with greater depth and speed. For structured data, it can:
For unstructured data, AI tools shine in:
Once data has been constructed, AI can also be used for data analysis. AI does more than just accelerate or enhance an otherwise well-defined process; rather, it enables users to perform functions that would not be possible without it.
Before AI, querying data required professional expertise in a coding language, typically SQL or Python, for extracting and manipulating structured data. It also required understanding the data model in which the data was stored, and any domain-specific information about how that data should be queried to provide accurate answers to analytics questions. AI has transformed this by writing both syntactically correct and data model accurate queries for answering questions posed in natural language. Moreover, AI agents can be provided with information on how to properly interpret domain and context-specific terminology in order to accurately query the data they're working with; in more complex systems, they can even be trained to learn certain behavior.
Challenges still remain, however. For non-technical users, trust and transparency are paramount, particularly when they lack the ability to assess the accuracy of the generated code. At the same time, it is important to educate technical users about how business questions are translated into precise queries, including the nuances and potential pitfalls of the underlying data to extract the answer from the data. By focusing on proper construction and evaluation of the AI system, we ensure the intrinsic reliability of AI-generated outputs, and with the right tools, the reasoning process is more transparent. This approach builds user confidence and equips both technical and non-technical users with the insights they need to evaluate and trust the results.
Kythera is actively developing AI applications focused on healthcare data quality and real-world usability. At the heart of our work is a commitment to reducing the burden of healthcare data access and analysis. Effective metadata management is essential for helping users navigate the complexity of healthcare data. It is also why we go beyond raw claims and build higher-level assets, tailored to specific use cases, to accelerate insights, lower costs, and support more confident data-driven decisions.
Healthcare data is complex, and so are the questions users need answered. Kythera is actively leveraging Databricks Genie to accelerate data exploration, and the results are encouraging, particularly when working with curated datasets where Databricks Genie demonstrates strong performance.
There is more work to do to ensure data complexity is handled reliably and correctly. Our focus is ensuring that the technology can handle not just one use case but a broad range of real-world scenarios with reliability and scale.
AI is removing the technical barriers to data analysis. Instead of writing SQL or Python, users can now ask questions in natural language. Generative AI translates these into accurate, model-aware queries. With the right guardrails, AI helps:
Data pipelines are evolving into intelligent, adaptive, and user-friendly systems with the integration of GenAI and other innovations. These advancements are transforming how data is processed and leveraged, empowering organizations to unlock the full potential of their data while enhancing agility and scalability.
Kythera is building this future today. Leveraging Databricks Genie, alongside multi-agent AI, remastered healthcare data assets, and deep clinical expertise, we’re enabling AI to work as a co-pilot across the data journey—from ingestion and integration to query and insight.