{"title":"我的病人去哪儿了?临床数据仓库中真实世界数据处理的模拟研究","authors":"Sonia Priou , Emmanuelle Kempf , Rémi Flicoteaux , Marija Jankovic , Gilles Chatellier , Christophe Tournigand , Christel Daniel , Guillaume Lamé","doi":"10.1016/j.hlpt.2024.100893","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To access Electronic Health Record (EHR) data, hospitals have implemented Clinical Data Warehouses (CDWs) using Extract Transform and Load (ETL) processes. While ETL performances are typically evaluated individually, our study examines the cumulative impact of ETLs on data availability.</p></div><div><h3>Methods</h3><p>Using a real multi-hospital CDW as a case study, we modeled EHR data processing from the software sources to the CDW's data store. We simulated a scenario where researchers aimed to reconstruct breast cancer care trajectories using EHR data. We calculated the size and characteristics of the data store population, and compared them to the original population.</p></div><div><h3>Results</h3><p>EHR data are recorded in various software depending on data category, hospital, and year, each requiring specific series of ETLs for integration in the CDW. Despite acceptable transfer rates for each ETL (range 73 %-100 %), cumulative losses led to study populations in the data store being up to 90 % smaller than anticipated when researchers required data exhaustivity for patients. Population size decreased steeply with the more data categories required. No difference was found in population characteristics between the data store and the original cohorts.</p></div><div><h3>Discussion & Conclusion</h3><p>Researchers should scrutinize data availability in CDWs as missing data could result from outsourced care, incomplete input, or underperforming ETLs. Integrating more data sources in CDWs increases the number of data routes, necessitating time for ETL implementation and maintenance, and increases data loss risks. Though commonly perceived as a “black box”, data transformation can significantly influence the reliability of populations studied in CDWs.</p></div><div><h3>Public interest Summary</h3><p>To access data generated during care, researchers build Clinical Data Warehouses (CDWs). CDWs are infrastructures composed of a series of processing steps to extract the data from the data source, transform it according to the needs and load it into a data store. Usually, the performances of these processing steps are evaluated one a time. However, each data point goes through a series of processing steps before being made available for research. In this study, we aim to evaluate the impact of the entire data processing pipeline on the availability of data points in a CDW by simulating a study on breast cancer and evaluating the impact on the size and the characteristics of the final cohort. The cumulative losses of the processing steps resulted in a population 90 % smaller than anticipated. The characteristics of the final population showed no difference to those of the original cohort.</p></div>","PeriodicalId":48672,"journal":{"name":"Health Policy and Technology","volume":"13 3","pages":"Article 100893"},"PeriodicalIF":3.4000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S221188372400056X/pdfft?md5=4a1bb2348615dd6ef9194acef9f805e2&pid=1-s2.0-S221188372400056X-main.pdf","citationCount":"0","resultStr":"{\"title\":\"'Where have my patients gone?': A simulation study on real-world data processing in Clinical Data Warehouses\",\"authors\":\"Sonia Priou , Emmanuelle Kempf , Rémi Flicoteaux , Marija Jankovic , Gilles Chatellier , Christophe Tournigand , Christel Daniel , Guillaume Lamé\",\"doi\":\"10.1016/j.hlpt.2024.100893\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><p>To access Electronic Health Record (EHR) data, hospitals have implemented Clinical Data Warehouses (CDWs) using Extract Transform and Load (ETL) processes. While ETL performances are typically evaluated individually, our study examines the cumulative impact of ETLs on data availability.</p></div><div><h3>Methods</h3><p>Using a real multi-hospital CDW as a case study, we modeled EHR data processing from the software sources to the CDW's data store. We simulated a scenario where researchers aimed to reconstruct breast cancer care trajectories using EHR data. We calculated the size and characteristics of the data store population, and compared them to the original population.</p></div><div><h3>Results</h3><p>EHR data are recorded in various software depending on data category, hospital, and year, each requiring specific series of ETLs for integration in the CDW. Despite acceptable transfer rates for each ETL (range 73 %-100 %), cumulative losses led to study populations in the data store being up to 90 % smaller than anticipated when researchers required data exhaustivity for patients. Population size decreased steeply with the more data categories required. No difference was found in population characteristics between the data store and the original cohorts.</p></div><div><h3>Discussion & Conclusion</h3><p>Researchers should scrutinize data availability in CDWs as missing data could result from outsourced care, incomplete input, or underperforming ETLs. Integrating more data sources in CDWs increases the number of data routes, necessitating time for ETL implementation and maintenance, and increases data loss risks. Though commonly perceived as a “black box”, data transformation can significantly influence the reliability of populations studied in CDWs.</p></div><div><h3>Public interest Summary</h3><p>To access data generated during care, researchers build Clinical Data Warehouses (CDWs). CDWs are infrastructures composed of a series of processing steps to extract the data from the data source, transform it according to the needs and load it into a data store. Usually, the performances of these processing steps are evaluated one a time. However, each data point goes through a series of processing steps before being made available for research. In this study, we aim to evaluate the impact of the entire data processing pipeline on the availability of data points in a CDW by simulating a study on breast cancer and evaluating the impact on the size and the characteristics of the final cohort. The cumulative losses of the processing steps resulted in a population 90 % smaller than anticipated. The characteristics of the final population showed no difference to those of the original cohort.</p></div>\",\"PeriodicalId\":48672,\"journal\":{\"name\":\"Health Policy and Technology\",\"volume\":\"13 3\",\"pages\":\"Article 100893\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-08-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S221188372400056X/pdfft?md5=4a1bb2348615dd6ef9194acef9f805e2&pid=1-s2.0-S221188372400056X-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health Policy and Technology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S221188372400056X\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH POLICY & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Policy and Technology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S221188372400056X","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}
'Where have my patients gone?': A simulation study on real-world data processing in Clinical Data Warehouses
Objective
To access Electronic Health Record (EHR) data, hospitals have implemented Clinical Data Warehouses (CDWs) using Extract Transform and Load (ETL) processes. While ETL performances are typically evaluated individually, our study examines the cumulative impact of ETLs on data availability.
Methods
Using a real multi-hospital CDW as a case study, we modeled EHR data processing from the software sources to the CDW's data store. We simulated a scenario where researchers aimed to reconstruct breast cancer care trajectories using EHR data. We calculated the size and characteristics of the data store population, and compared them to the original population.
Results
EHR data are recorded in various software depending on data category, hospital, and year, each requiring specific series of ETLs for integration in the CDW. Despite acceptable transfer rates for each ETL (range 73 %-100 %), cumulative losses led to study populations in the data store being up to 90 % smaller than anticipated when researchers required data exhaustivity for patients. Population size decreased steeply with the more data categories required. No difference was found in population characteristics between the data store and the original cohorts.
Discussion & Conclusion
Researchers should scrutinize data availability in CDWs as missing data could result from outsourced care, incomplete input, or underperforming ETLs. Integrating more data sources in CDWs increases the number of data routes, necessitating time for ETL implementation and maintenance, and increases data loss risks. Though commonly perceived as a “black box”, data transformation can significantly influence the reliability of populations studied in CDWs.
Public interest Summary
To access data generated during care, researchers build Clinical Data Warehouses (CDWs). CDWs are infrastructures composed of a series of processing steps to extract the data from the data source, transform it according to the needs and load it into a data store. Usually, the performances of these processing steps are evaluated one a time. However, each data point goes through a series of processing steps before being made available for research. In this study, we aim to evaluate the impact of the entire data processing pipeline on the availability of data points in a CDW by simulating a study on breast cancer and evaluating the impact on the size and the characteristics of the final cohort. The cumulative losses of the processing steps resulted in a population 90 % smaller than anticipated. The characteristics of the final population showed no difference to those of the original cohort.
期刊介绍:
Health Policy and Technology (HPT), is the official journal of the Fellowship of Postgraduate Medicine (FPM), a cross-disciplinary journal, which focuses on past, present and future health policy and the role of technology in clinical and non-clinical national and international health environments.
HPT provides a further excellent way for the FPM to continue to make important national and international contributions to development of policy and practice within medicine and related disciplines. The aim of HPT is to publish relevant, timely and accessible articles and commentaries to support policy-makers, health professionals, health technology providers, patient groups and academia interested in health policy and technology.
Topics covered by HPT will include:
- Health technology, including drug discovery, diagnostics, medicines, devices, therapeutic delivery and eHealth systems
- Cross-national comparisons on health policy using evidence-based approaches
- National studies on health policy to determine the outcomes of technology-driven initiatives
- Cross-border eHealth including health tourism
- The digital divide in mobility, access and affordability of healthcare
- Health technology assessment (HTA) methods and tools for evaluating the effectiveness of clinical and non-clinical health technologies
- Health and eHealth indicators and benchmarks (measure/metrics) for understanding the adoption and diffusion of health technologies
- Health and eHealth models and frameworks to support policy-makers and other stakeholders in decision-making
- Stakeholder engagement with health technologies (clinical and patient/citizen buy-in)
- Regulation and health economics