'Where have my patients gone?': A simulation study on real-world data processing in Clinical Data Warehouses

IF 3.7 3区医学 Q1 HEALTH POLICY & SERVICES Health Policy and Technology Pub Date : 2024-08-02 DOI:10.1016/j.hlpt.2024.100893

Sonia Priou , Emmanuelle Kempf , Rémi Flicoteaux , Marija Jankovic , Gilles Chatellier , Christophe Tournigand , Christel Daniel , Guillaume Lamé

{"title":"'Where have my patients gone?': A simulation study on real-world data processing in Clinical Data Warehouses","authors":"Sonia Priou , Emmanuelle Kempf , Rémi Flicoteaux , Marija Jankovic , Gilles Chatellier , Christophe Tournigand , Christel Daniel , Guillaume Lamé","doi":"10.1016/j.hlpt.2024.100893","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To access Electronic Health Record (EHR) data, hospitals have implemented Clinical Data Warehouses (CDWs) using Extract Transform and Load (ETL) processes. While ETL performances are typically evaluated individually, our study examines the cumulative impact of ETLs on data availability.</p></div><div><h3>Methods</h3><p>Using a real multi-hospital CDW as a case study, we modeled EHR data processing from the software sources to the CDW's data store. We simulated a scenario where researchers aimed to reconstruct breast cancer care trajectories using EHR data. We calculated the size and characteristics of the data store population, and compared them to the original population.</p></div><div><h3>Results</h3><p>EHR data are recorded in various software depending on data category, hospital, and year, each requiring specific series of ETLs for integration in the CDW. Despite acceptable transfer rates for each ETL (range 73 %-100 %), cumulative losses led to study populations in the data store being up to 90 % smaller than anticipated when researchers required data exhaustivity for patients. Population size decreased steeply with the more data categories required. No difference was found in population characteristics between the data store and the original cohorts.</p></div><div><h3>Discussion & Conclusion</h3><p>Researchers should scrutinize data availability in CDWs as missing data could result from outsourced care, incomplete input, or underperforming ETLs. Integrating more data sources in CDWs increases the number of data routes, necessitating time for ETL implementation and maintenance, and increases data loss risks. Though commonly perceived as a “black box”, data transformation can significantly influence the reliability of populations studied in CDWs.</p></div><div><h3>Public interest Summary</h3><p>To access data generated during care, researchers build Clinical Data Warehouses (CDWs). CDWs are infrastructures composed of a series of processing steps to extract the data from the data source, transform it according to the needs and load it into a data store. Usually, the performances of these processing steps are evaluated one a time. However, each data point goes through a series of processing steps before being made available for research. In this study, we aim to evaluate the impact of the entire data processing pipeline on the availability of data points in a CDW by simulating a study on breast cancer and evaluating the impact on the size and the characteristics of the final cohort. The cumulative losses of the processing steps resulted in a population 90 % smaller than anticipated. The characteristics of the final population showed no difference to those of the original cohort.</p></div>","PeriodicalId":48672,"journal":{"name":"Health Policy and Technology","volume":"13 3","pages":"Article 100893"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S221188372400056X/pdfft?md5=4a1bb2348615dd6ef9194acef9f805e2&pid=1-s2.0-S221188372400056X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Policy and Technology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S221188372400056X","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

To access Electronic Health Record (EHR) data, hospitals have implemented Clinical Data Warehouses (CDWs) using Extract Transform and Load (ETL) processes. While ETL performances are typically evaluated individually, our study examines the cumulative impact of ETLs on data availability.

Methods

Using a real multi-hospital CDW as a case study, we modeled EHR data processing from the software sources to the CDW's data store. We simulated a scenario where researchers aimed to reconstruct breast cancer care trajectories using EHR data. We calculated the size and characteristics of the data store population, and compared them to the original population.

Results

EHR data are recorded in various software depending on data category, hospital, and year, each requiring specific series of ETLs for integration in the CDW. Despite acceptable transfer rates for each ETL (range 73 %-100 %), cumulative losses led to study populations in the data store being up to 90 % smaller than anticipated when researchers required data exhaustivity for patients. Population size decreased steeply with the more data categories required. No difference was found in population characteristics between the data store and the original cohorts.

Discussion & Conclusion

Researchers should scrutinize data availability in CDWs as missing data could result from outsourced care, incomplete input, or underperforming ETLs. Integrating more data sources in CDWs increases the number of data routes, necessitating time for ETL implementation and maintenance, and increases data loss risks. Though commonly perceived as a “black box”, data transformation can significantly influence the reliability of populations studied in CDWs.

Public interest Summary

To access data generated during care, researchers build Clinical Data Warehouses (CDWs). CDWs are infrastructures composed of a series of processing steps to extract the data from the data source, transform it according to the needs and load it into a data store. Usually, the performances of these processing steps are evaluated one a time. However, each data point goes through a series of processing steps before being made available for research. In this study, we aim to evaluate the impact of the entire data processing pipeline on the availability of data points in a CDW by simulating a study on breast cancer and evaluating the impact on the size and the characteristics of the final cohort. The cumulative losses of the processing steps resulted in a population 90 % smaller than anticipated. The characteristics of the final population showed no difference to those of the original cohort.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

我的病人去哪儿了？临床数据仓库中真实世界数据处理的模拟研究

为了访问电子健康记录（EHR）数据，医院使用提取、转换和加载（ETL）流程实施了临床数据仓库（CDW）。虽然 ETL 性能通常是单独评估的，但我们的研究考察了 ETL 对数据可用性的累积影响。以一个真实的多医院 CDW 为案例，我们模拟了从软件源到 CDW 数据存储的 EHR 数据处理过程。我们模拟了研究人员利用电子病历数据重建乳腺癌护理轨迹的情景。我们计算了数据存储群体的规模和特征，并与原始群体进行了比较。根据数据类别、医院和年份的不同，电子病历数据记录在不同的软件中，每种数据都需要特定系列的 ETL 才能集成到 CDW 中。尽管每个 ETL 的传输率都在可接受的范围内（73%-100%），但当研究人员要求患者数据穷尽时，累积损失导致数据存储中的研究群体比预期的要少多达 90%。随着所需的数据类别越多，群体规模急剧下降。数据存储与原始队列之间的人群特征没有差异。研究人员应仔细检查 CDW 中的数据可用性，因为数据缺失可能是由于外包护理、输入不完整或 ETL 性能不佳造成的。在 CDW 中集成更多数据源会增加数据路径的数量，从而需要时间来实施和维护 ETL，并增加数据丢失的风险。虽然数据转换通常被视为 "黑盒子"，但它能极大地影响 CDW 中研究人群的可靠性。为了访问护理过程中生成的数据，研究人员建立了临床数据仓库（CDW）。临床数据仓库是由一系列处理步骤组成的基础设施，这些步骤包括从数据源提取数据、根据需要转换数据以及将数据加载到数据存储区。通常，这些处理步骤的性能一次评估一次。然而，每个数据点都要经过一系列处理步骤，然后才能用于研究。在本研究中，我们旨在通过模拟一项乳腺癌研究，评估整个数据处理流程对 CDW 中数据点可用性的影响，以及对最终队列规模和特征的影响。处理步骤的累积损失导致群体规模比预期小 90%。最终人群的特征与原始人群没有区别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Health Policy and Technology Medicine-Health Policy

CiteScore

9.20

自引率

3.30%

发文量

审稿时长

88 days

期刊介绍： Health Policy and Technology (HPT), is the official journal of the Fellowship of Postgraduate Medicine (FPM), a cross-disciplinary journal, which focuses on past, present and future health policy and the role of technology in clinical and non-clinical national and international health environments. HPT provides a further excellent way for the FPM to continue to make important national and international contributions to development of policy and practice within medicine and related disciplines. The aim of HPT is to publish relevant, timely and accessible articles and commentaries to support policy-makers, health professionals, health technology providers, patient groups and academia interested in health policy and technology. Topics covered by HPT will include: - Health technology, including drug discovery, diagnostics, medicines, devices, therapeutic delivery and eHealth systems - Cross-national comparisons on health policy using evidence-based approaches - National studies on health policy to determine the outcomes of technology-driven initiatives - Cross-border eHealth including health tourism - The digital divide in mobility, access and affordability of healthcare - Health technology assessment (HTA) methods and tools for evaluating the effectiveness of clinical and non-clinical health technologies - Health and eHealth indicators and benchmarks (measure/metrics) for understanding the adoption and diffusion of health technologies - Health and eHealth models and frameworks to support policy-makers and other stakeholders in decision-making - Stakeholder engagement with health technologies (clinical and patient/citizen buy-in) - Regulation and health economics