{"title":"Tracking provenance in clinical data warehouses for quality management","authors":"Marco Johns, Lena Baum, Fabian Prasser","doi":"10.1016/j.ijmedinf.2024.105690","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Data provenance, which documents the origin, history, and transformations of data, can enhance the reproducibility of processing workflows and help to address errors and quality issues. In this work, we focus on tracking and utilizing provenance information as part of quality management in Extract-Transform-Load (ETL) processes used to build clinical data warehouses.</div></div><div><h3>Methods</h3><div>We designed and implemented a framework that automatically tracks how data flows through an ETL process and detects errors and quality problems during processing. This information is then reported against an Application Programming Interface (API) that stores the issues along with contextual information on their location within the data being transformed and the overall workflow. We further designed a dashboard that supports health data engineers with inspecting the encountered issues and tracing them back to their root causes.</div></div><div><h3>Results</h3><div>The framework was implemented in Java using the Spring Framework and integrated into ETL processes for Informatics for Integrating Biology and the Bedside (i2b2). The dashboard was realized using Grafana. We evaluated our approach on three different ETL processes for real-world datasets used to integrate them into our i2b2 clinical data warehouse. Using the provenance dashboard, we were able to identify frequent error patterns and link them to specific data points from the sources as well as ETL process steps. Provenance tracking increased the execution times of loading processes with an impact depending on the number of identified issues.</div></div><div><h3>Conclusions</h3><div>Provenance tracking can be a valuable tool for implementing continuous quality management for ETL processes. Relevant information can be collected from existing ETL workloads using dedicated APIs and visualized through dashboards, which support the identification of frequent patterns of problems together with their root causes, providing valuable information for improvements.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"193 ","pages":"Article 105690"},"PeriodicalIF":3.7000,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624003538","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
Data provenance, which documents the origin, history, and transformations of data, can enhance the reproducibility of processing workflows and help to address errors and quality issues. In this work, we focus on tracking and utilizing provenance information as part of quality management in Extract-Transform-Load (ETL) processes used to build clinical data warehouses.
Methods
We designed and implemented a framework that automatically tracks how data flows through an ETL process and detects errors and quality problems during processing. This information is then reported against an Application Programming Interface (API) that stores the issues along with contextual information on their location within the data being transformed and the overall workflow. We further designed a dashboard that supports health data engineers with inspecting the encountered issues and tracing them back to their root causes.
Results
The framework was implemented in Java using the Spring Framework and integrated into ETL processes for Informatics for Integrating Biology and the Bedside (i2b2). The dashboard was realized using Grafana. We evaluated our approach on three different ETL processes for real-world datasets used to integrate them into our i2b2 clinical data warehouse. Using the provenance dashboard, we were able to identify frequent error patterns and link them to specific data points from the sources as well as ETL process steps. Provenance tracking increased the execution times of loading processes with an impact depending on the number of identified issues.
Conclusions
Provenance tracking can be a valuable tool for implementing continuous quality management for ETL processes. Relevant information can be collected from existing ETL workloads using dedicated APIs and visualized through dashboards, which support the identification of frequent patterns of problems together with their root causes, providing valuable information for improvements.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.