Introduction
Data provenance, which documents the origin, history, and transformations of data, can enhance the reproducibility of processing workflows and help to address errors and quality issues. In this work, we focus on tracking and utilizing provenance information as part of quality management in Extract-Transform-Load (ETL) processes used to build clinical data warehouses.
Methods
We designed and implemented a framework that automatically tracks how data flows through an ETL process and detects errors and quality problems during processing. This information is then reported against an Application Programming Interface (API) that stores the issues along with contextual information on their location within the data being transformed and the overall workflow. We further designed a dashboard that supports health data engineers with inspecting the encountered issues and tracing them back to their root causes.
Results
The framework was implemented in Java using the Spring Framework and integrated into ETL processes for Informatics for Integrating Biology and the Bedside (i2b2). The dashboard was realized using Grafana. We evaluated our approach on three different ETL processes for real-world datasets used to integrate them into our i2b2 clinical data warehouse. Using the provenance dashboard, we were able to identify frequent error patterns and link them to specific data points from the sources as well as ETL process steps. Provenance tracking increased the execution times of loading processes with an impact depending on the number of identified issues.
Conclusions
Provenance tracking can be a valuable tool for implementing continuous quality management for ETL processes. Relevant information can be collected from existing ETL workloads using dedicated APIs and visualized through dashboards, which support the identification of frequent patterns of problems together with their root causes, providing valuable information for improvements.