Life sciences teams know that the full value of real-world data is unlocked only when diverse sources can be analyzed together, consistently, and at scale. Yet claims, labs, and EHR data each speak in different dialects, making integrated evidence generation slow and error-prone. At the 2025 OHDSI Global Symposium, Kythera Labs demonstrated what becomes possible when that fragmentation is removed—showcasing a breakthrough in unifying multi-source RWD into a single, OMOP-standardized dataset capable of powering high-fidelity, large-scale analytics and AI-driven discovery.
Kythera Labs unified large-scale real-world data (RWD) from a national medical claims source, a laboratory dataset, and two electronic health records (EHR) sources using the OMOP Common Data Model (CDM). The objective was to standardize complex, multi-source and multi-modal RWD to a single known global standard, OMOP version 5.4. By aligning diverse data to a common structure, this approach eliminates the need for manual reconciliation across formats and enables standardized analytics using OMOP’s semantic framework.
This work leveraged Wayfinder, built on Databricks, to execute scalable, Spark-based transformations and manage billions of records. The OMOP transformation pipeline was aligned with the FAIR (Findable, Accessible, Interoperable, Reusable) principles, enabling a searchable data catalog, role-based access controls, and end-to-end lineage and provenance tracking across the final, integrated OMOP dataset.
The pipeline employed a medallion-style architecture that streamlined Extract-Transform-Load (ETL) development, promoted logic reuse across sources, and supported rapid refresh cycles. Over 2.7 terabytes of structured RWD (~583 million source records) were transformed into OMOP, and the final OMOP dataset can be refreshed in under 90 minutes. We believe this is one of the largest known OMOP-conformed datasets in the U.S.
A key technical enabler of the transformation is the integration of standard open-source OHDSI tools, such as the Data Quality Dashboard (DQD) and Achilles, directly in Wayfinder. By containerizing these R-based tools within Kythera’s platform and leveraging SQL Warehouse capabilities, we reduced DQD execution time on billions of records to under 60 minutes. Not only was this process fast, it was also extremely compute efficient.
Kythera’s implementation highlights how a cloud-native, standards-aligned architecture can support more reproducible, scalable transformation of raw healthcare data into interoperable formats. For pharma R&D and data science teams, this approach simplifies the process of structuring, quality-assessing, and integrating RWD from disparate sources.
The 2025 OHDSI Symposium was an excellent forum for connecting with peers across industry and academia, sharing our progress, gathering feedback, and contributing to OHDSI’s mission of advancing standards-based science. We are especially thankful to our collaborators at The Hyve, who supported the OMOP vocabulary mappings as part of our initiative to map multi-source RWD to OMOP. We look forward to continued engagement with the OHDSI community in 2026.
We invite you to access our OHDSI US Study Brief and Poster. If you have any questions about our approach and implementation, or about the data set, please get in touch.
Glynn Dennis, Jr, Ph.D., https://www.linkedin.com/in/glynnsc/
Chief Science Officer