Background: Missing data are a common challenge in electronic health record (EHR)-based prediction modeling. Traditional imputation methods may not suit prediction or machine learning models, and real-world use requires workflows that are implementable for both model development and real-time prediction.
Objective: We evaluated methods for handling missing data when using EHR data to build clinical prediction models for patients admitted to the pediatric intensive care unit (PICU).
Methods: Using EHR data containing missing values from an academic medical center PICU, we generated a synthetic complete dataset. From this, we created 300 datasets with missing data under varying mechanisms and proportions of missingness for the outcomes of (1) successful extubation (binary) and (2) blood pressure (continuous). We assessed strategies to address missing data including simple methods (eg, last observation carried forward [LOCF]), complex methods (eg, random forest multiple imputation), and native support for missing values in outcome prediction models.
Results: Across 886 patients and 1220 intubation events, 18.2% of original data were missing. LOCF had the lowest imputation error, followed by random forest imputation (average mean squared error [MSE] improvement over mean imputation: 0.41 [range: 0.30, 0.50] and 0.33 [0.21, 0.43], respectively). LOCF generally outperformed other imputation methods across outcome metrics and models (mean improvement: 1.28% [range: -0.07%, 7.2%]). Imputation methods showed more performance variability for the binary outcome (balanced accuracy coefficient of variation: 0.042) than the continuous outcome (mean squared error coefficient of variation: 0.001).
Conclusions: Traditional imputation methods for inferential statistics, such as multiple imputation, may not be optimal for prediction models. The amount of missingness influenced performance more than the missingness mechanism. In datasets with frequent measurements, LOCF and native support for missing values in machine learning models offer reasonable performance for handling missingness at minimal computational cost in predictive analyses.
Background: The unstructured data of Chinese cancer electronic health records (EHRs) contains valuable medical expertise. Accurate medical entity recognition is crucial for building a medical-assisted decision system. Named entity recognition (NER) in cancer EHRs typically uses general models designed for English medical records. There is a lack of specialized handling for cancer-specific records and limited application to Chinese medical records.
Objective: This study aims to propose a specific NER model to enhance the recognition of medical entities in Chinese cancer EHRs.
Methods: Desensitized inpatient EHRs related to breast cancer were collected from a leading hospital in Beijing. Building upon the MC Bidirectional Encoder Representations from Transformers (BERT) foundation, the study further incorporated a Chinese cancer corpus for pretraining, resulting in the construction of the ChCancerBERT pretrained model. In conjunction with dilated-gated convolutional neural networks, bidirectional long short-term memory, multihead attention mechanism, and a conditional random field, this model forms a multimodel, multilevel integrated NER approach.
Results: This approach effectively extracts medical entity features related to symptoms, signs, tests, treatments, and time in Chinese breast cancer EHRs. The entity recognition performance of the proposed model surpasses that of the baseline model and other models compared in the experiment. The F1-score reached 86.93%, precision reached 87.24%, and recall reached 86.61%. The model introduced in this study demonstrates exceptional performance on the CCKS2019 dataset, attaining a precision rate of 87.26%, a recall rate of 87.27%, and an impressive F1-score of 87.26%, surpassing that of existing models.
Conclusions: The experiments demonstrate that the approach proposed in this study exhibits excellent performance in NER within breast cancer EHRs. This advancement will further contribute to clinical decision support for cancer treatment and research. In addition, the study reveals that incorporating domain-specific corpora in clinical NER tasks can further enhance the performance of BERT models in specialized domains.
Background: Coherence across sites in multicenter datasets is one substantial data quality dimension for reliable health data reuse, as unexpected heterogeneity in data can lead to biases in data analyses and suboptimal generalization of results.
Objective: This work aims to characterize and label the data coherence across sites in the first European multicenter dataset for cancer prevention in people and early detection among the homeless population in Europe: coadapting and implementing the health navigator model. This dataset emerged to enable research to address disparities in health challenges and health care access due to barriers such as unstable housing, limited resources, and social stigma in people experiencing homelessness.
Methods: The dataset comprises 652 cases: 142 from Austria, 158 from Greece, 197 from Spain, and 155 from the United Kingdom. All participants fit classifications from the European Typology of Homelessness and Housing Exclusion. This longitudinal study collected questionnaires at baseline, after 4 weeks, and at the end of the intervention. The 180-question survey covered sociodemographic data, overall health, mental health, empowerment, and interpersonal communication. Data variability was assessed using information theory and geometric methods to analyze discrepancies in distributions and completeness across the dataset.
Results: Substantial variability was observed among the 4 pilot countries, both in the overall analysis and within specific domains. In particular, measures of health care empowerment, quality of life, and interpersonal communication demonstrated the greatest discrepancies among pilot sites, with the exception of the health domain. Notably, Spain exhibited the most pronounced differences, characterized by a high number of missing values related to interpersonal communication and the use of health care services.
Conclusions: Health data may be comparable across the 4 countries; however, substantial differences were observed in the other questionnaires, requiring independent, country-specific analyses. This study underscores the heterogeneity among people experiencing homelessness and the critical need for data quality assessments to inform future research and policymaking in this field.
Background: As data-driven medical research advances, vast amounts of medical data are being collected, giving researchers access to important information. However, issues such as heterogeneity, complexity, and incompleteness of datasets limit their practical use. Errors and missing data negatively affect artificial intelligence-based predictive models, undermining the reliability of clinical decision-making. Thus, it is important to develop a quality management process (QMP) for clinical data.
Objective: This study aimed to develop a rules-based QMP to address errors and impute missing values in real-world data, establishing high-quality data for clinical research.
Methods: We used clinical data from 6491 patients with colorectal cancer (CRC) collected at Gachon University Gil Medical Center between 2010 and 2022, leveraging the clinical library established within the Korea Clinical Data Use Network for Research Excellence. First, we conducted a literature review on the prognostic prediction of CRC to assess whether the data met our research purposes, comparing selected variables with real-world data. A labeling process was then implemented to extract key variables, which facilitated the creation of an automatic staging library. This library, combined with a rule-based process, allowed for systematic analysis and evaluation.
Results: Theoretically, the tumor, node, metastasis (TNM) stage was identified as an important prognostic factor for CRC, but it was not selected through feature selection in real-world data. After applying the QMP, rates of missing data were reduced from 75.3% to 35.7% for TNM and from 24.3% to 18.5% for surveillance, epidemiology, and end results across 6491 cases, confirming the system's effectiveness. Variable importance analysis through feature selection revealed that TNM stage and detailed code variables, which were previously unselected, were included in the improved model.
Conclusions: In sum, we developed a rules-based QMP to address errors and impute missing values in Korea Clinical Data Use Network for Research Excellence data, enhancing data quality. The applicability of the process to real-world datasets highlights its potential for broader use in clinical studies and cancer research.
Background: Integrated health data are foundational for secondary use, research, and policymaking. However, data quality issues-such as missing values and inconsistencies-are common due to the heterogeneity of health data sources. Existing frameworks often use static, 1-time assessments, which limit their ability to address quality issues across evolving data pipelines.
Objective: This study evaluates the AIDAVA (artificial intelligence-powered data curation and validation) data quality framework, which introduces dynamic, life cycle-based validation of health data using knowledge graph technologies and SHACL (Shapes Constraint Language)-based rules. The framework is assessed for its ability to detect and manage data quality issues-specifically, completeness and consistency-during integration.
Methods: Using the MIMIC-III (Medical Information Mart for Intensive Care-III) dataset, we simulated real-world data quality challenges by introducing structured noise, including missing values and logical inconsistencies. The data was transformed into source knowledge graphs and integrated into a unified personal health knowledge graph. SHACL validation rules were applied iteratively during the integration process, and data quality was assessed under varying noise levels and integration orders.
Results: The AIDAVA framework effectively detected completeness and consistency issues across all scenarios. Completeness was shown to influence the interpretability of consistency scores, and domain-specific attributes (eg, diagnoses and procedures) were more sensitive to integration order and data gaps.
Conclusions: AIDAVA supports dynamic, rule-based validation throughout the data life cycle. By addressing both dimension-specific vulnerabilities and cross-dimensional effects, it lays the groundwork for scalable, high-quality health data integration. Future work should explore deployment in live clinical settings and expand to additional quality dimensions.
Background: Although an increasing number of bedside medical devices are equipped with wireless connections for reliable notifications, many nonnetworked devices remain effective at detecting abnormal patient conditions and alerting medical staff through auditory alarms. Staff members, however, can miss these notifications, especially when in distant areas or other private rooms. In contrast, the signal-to-noise ratio of alarm systems for medical devices in the neonatal intensive care unit is 0 dB or higher. A feasible system for automatic sound identification with high accuracy is needed to prevent alarm sounds from being missed by the staff.
Objective: The purpose of this study was to design a method for classifying multiple alarm sounds collected with a monaural microphone in a noisy environment.
Methods: Features of 7 alarm sounds were extracted using a mel filter bank and incorporated into a classifier using convolutional and recurrent neural networks. To estimate its clinical usefulness, the classifier was evaluated with mixtures of up to 7 alarm sounds and hospital ward noise.
Results: The proposed convolutional recurrent neural network model was evaluated using a simulation dataset of 7 alarm sounds mixed with hospital ward noise. At a signal-to-noise ratio of 0 dB, the best-performing model (convolutional neural network 3+bidirectional gate recurrent unit) achieved an event-based F1-score of 0.967, with a precision of 0.944 and a recall of 0.991. When the venous foot pump class was excluded, the classwise recall of the classifier ranged from 0.990 to 1.000.
Conclusions: The proposed classifier was found to be highly accurate in detecting alarm sounds. Although the performance of the proposed classifier in a clinical environment can be improved, the classifier could be incorporated into an alarm sound detection system. The classifier, combined with network connectivity, could improve the notification of abnormal status detected by unconnected medical devices.
Background: Predicting serious hematological adverse events (SHAEs) from poly (adenosine diphosphate ribose) polymerase inhibitors (PARPis) would allow us to prioritize patients with ovarian cancer at higher risk for more intensive care, ultimately lowering morbidity and preventing them from premature termination of medication.
Objective: This study aimed to explore the risk factors for SHAEs in patients with ovarian cancer receiving PARPi treatment and develop a risk prediction model for such events.
Methods: Prospective clinical data were collected on patients with ovarian cancer who received PARPi treatment at the Guangxi Medical University Affiliated Tumor Hospital from December 2018 to August 2024. They were divided into a SHAE group and a no-SHAE group based on the occurrence of SHAEs. Variable differences were screened using the chi-square test or Fisher exact test. Multivariate logistic regression was used to determine independent factors influencing SHAEs in patients with ovarian cancer. A predictive model for serious blood-related complications in ovarian cancer treatment was developed from identified independent risk factors using the R software. The model's clinical utility was assessed through decision curve analysis (net benefit), calibration (calibration curve), and discrimination (receiver operating characteristic curve).
Results: A total of 70 patients with ovarian cancer receiving PARPi treatment were included in this study. Of these 70 patients, 16 (23%) experienced SHAEs, with decreases in red blood cell (RBC) count and hemoglobin levels being the most common. Multiple logistic regression analysis identified 4 independent predictors of PARPi-associated SHAEs in patients with ovarian cancer: lymph node metastasis (odds ratio [OR] 6.733, 95% CI 1.197-37.873; P=.03), creatinine clearance rate of ≤60 mL per minute (OR 23.722, 95% CI 3.121-180.303; P=.002), RBC count of ≤3.3×1012 per liter (OR 4.847, 95% CI 1.020-23.041; P=.047), and combination therapy with vascular endothelial growth factor inhibitors (OR 6.749, 95% CI 1.313-34.689; P=.02). The internal validation yielded an area under the curve of 0.874 (95% CI 0.793-0.955), indicating moderate clinical utility and accuracy for the risk prediction model incorporating these predictors.
Conclusions: Lymph node metastasis, creatinine clearance rate of ≤60 mL per minute, RBC count of ≤3.3×1012 per liter, and combination therapy with vascular endothelial growth factor inhibitors are independent risk factors for PARPi SHAEs in patients with ovarian cancer. The risk prediction model established based on these factors demonstrated moderate predictive value.

