Problem: Pancreatic ductal adenocarcinoma (PDAC) is considered a highly lethal cancer due to its advanced stage diagnosis. The five-year survival rate after diagnosis is less than 10%. However, if diagnosed early, the five-year survival rate can reach up to 70%. Early diagnosis of PDAC can aid treatment and improve survival rates by taking necessary precautions. The challenge is to develop a reliable, data privacy-aware machine learning approach that can accurately diagnose pancreatic cancer with biomarkers.
Aim: The study aims to diagnose a patient's pancreatic cancer while ensuring the confidentiality of patient records. In addition, the study aims to guide researchers and clinicians in developing innovative methods for diagnosing pancreatic cancer.
Methods: Machine learning, a branch of artificial intelligence, can identify patterns by analyzing large datasets. The study pre-processed a dataset containing urine biomarkers with operations such as filling in missing values, cleaning outliers, and feature selection. The data was encrypted using the Fernet encryption algorithm to ensure confidentiality. Ten separate machine learning models were applied to predict individuals with PDAC. Performance metrics such as F1 score, recall, precision, and accuracy were used in the modeling process.
Results: Among the 590 clinical records analyzed, 199 (33.7%) belonged to patients with pancreatic cancer, 208 (35.3%) to patients with non-cancerous pancreatic disorders (such as benign hepatobiliary disease), and 183 (31%) to healthy individuals. The LGBM algorithm showed the highest efficiency by achieving an accuracy of 98.8%. The accuracy of the other algorithms ranged from 98 to 86%. In order to understand which features are more critical and which data the model is based on, the analysis found that the features "plasma_CA19_9", REG1A, TFF1, and LYVE1 have high importance levels. The LIME analysis also analyzed which features of the model are important in the decision-making process.
Conclusions: This research outlines a data privacy-aware machine learning tool for predicting PDAC. The results show that a promising approach can be presented for clinical application. Future research should expand the dataset and focus on validation by applying it to various populations.
Background: Artificial intelligence (AI) is increasingly used for prevention, diagnosis, monitoring, and treatment of cardiovascular diseases. Despite the potential for AI to improve care, ethical concerns and mistrust in AI-enabled healthcare exist among the public and medical community. Given the rapid and transformative recent growth of AI in cardiovascular care, to inform practice guidelines and regulatory policies that facilitate ethical and trustworthy use of AI in medicine, we conducted a literature review to identify key ethical and trust barriers and facilitators from patients' and healthcare providers' perspectives when using AI in cardiovascular care.
Methods: In this rapid literature review, we searched six bibliographic databases to identify publications discussing transparency, trust, or ethical concerns (outcomes of interest) associated with AI-based medical devices (interventions of interest) in the context of cardiovascular care from patients', caregivers', or healthcare providers' perspectives. The search was completed on May 24, 2022 and was not limited by date or study design.
Results: After reviewing 7,925 papers from six databases and 3,603 papers identified through citation chasing, 145 articles were included. Key ethical concerns included privacy, security, or confidentiality issues (n = 59, 40.7%); risk of healthcare inequity or disparity (n = 36, 24.8%); risk of patient harm (n = 24, 16.6%); accountability and responsibility concerns (n = 19, 13.1%); problematic informed consent and potential loss of patient autonomy (n = 17, 11.7%); and issues related to data ownership (n = 11, 7.6%). Major trust barriers included data privacy and security concerns, potential risk of patient harm, perceived lack of transparency about AI-enabled medical devices, concerns about AI replacing human aspects of care, concerns about prioritizing profits over patients' interests, and lack of robust evidence related to the accuracy and limitations of AI-based medical devices. Ethical and trust facilitators included ensuring data privacy and data validation, conducting clinical trials in diverse cohorts, providing appropriate training and resources to patients and healthcare providers and improving their engagement in different phases of AI implementation, and establishing further regulatory oversights.
Conclusion: This review revealed key ethical concerns and barriers and facilitators of trust in AI-enabled medical devices from patients' and healthcare providers' perspectives. Successful integration of AI into cardiovascular care necessitates implementation of mitigation strategies. These strategies should focus on enhanced regulatory oversight on the use of patient data and promoting transparency around the use of AI in patient care.
Background: The integrity of clinical research and machine learning models in healthcare heavily relies on the quality of underlying clinical laboratory data. However, the preprocessing of this data to ensure its reliability and accuracy remains a significant challenge due to variations in data recording and reporting standards.
Methods: We developed lab2clean, a novel algorithm aimed at automating and standardizing the cleaning of retrospective clinical laboratory results data. lab2clean was implemented as two R functions specifically designed to enhance data conformance and plausibility by standardizing result formats and validating result values. The functionality and performance of the algorithm were evaluated using two extensive electronic medical record (EMR) databases, encompassing various clinical settings.
Results: lab2clean effectively reduced the variability of laboratory results and identified potentially erroneous records. Upon deployment, it demonstrated effective and fast standardization and validation of substantial laboratory data records. The evaluation highlighted significant improvements in the conformance and plausibility of lab results, confirming the algorithm's efficacy in handling large-scale data sets.
Conclusions: lab2clean addresses the challenge of preprocessing and cleaning clinical laboratory data, a critical step in ensuring high-quality data for research outcomes. It offers a straightforward, efficient tool for researchers, improving the quality of clinical laboratory data, a major portion of healthcare data. Thereby, enhancing the reliability and reproducibility of clinical research outcomes and clinical machine learning models. Future developments aim to broaden its functionality and accessibility, solidifying its vital role in healthcare data management.
Background: The worldwide prevalence of type 2 diabetes mellitus in adults is experiencing a rapid increase. This study aimed to identify the factors affecting the survival of prediabetic patients using a comparison of the Cox proportional hazards model (CPH) and the Random survival forest (RSF).
Method: This prospective cohort study was performed on 746 prediabetics in southwest Iran. The demographic, lifestyle, and clinical data of the participants were recorded. The CPH and RSF models were used to determine the patients' survival. Furthermore, the concordance index (C-index) and time-dependent receiver operating characteristic (ROC) curve were employed to compare the performance of the Cox proportional hazards (CPH) model and the random survival forest (RSF) model.
Results: The 5-year cumulative T2DM incidence was 12.73%. Based on the results of the CPH model, NAFLD (HR = 1.74, 95% CI: 1.06, 2.85), FBS (HR = 1.008, 95% CI: 1.005, 1.012) and increased abdominal fat (HR = 1.02, 95% CI: 1.01, 1.04) were directly associated with diabetes occurrence in prediabetic patients. The RSF model suggests that factors including FBS, waist circumference, depression, NAFLD, afternoon sleep, and female gender are the most important variables that predict diabetes. The C-index indicated that the RSF model has a higher percentage of agreement than the CPH model, and in the weighted Brier Score index, the RSF model had less error than the Kaplan-Meier and CPH model.
Conclusion: Our findings show that the incidence of diabetes was alarmingly high in Iran. The results suggested that several demographic and clinical factors are associated with diabetes occurrence in prediabetic patients. The high-risk population needs special measures for screening and care programs.
Background: Predictive modeling based on multi-omics data, which incorporates several types of omics data for the same patients, has shown potential to outperform single-omics predictive modeling. Most research in this domain focuses on incorporating numerous data types, despite the complexity and cost of acquiring them. The prevailing assumption is that increasing the number of data types necessarily improves predictive performance. However, the integration of less informative or redundant data types could potentially hinder this performance. Therefore, identifying the most effective combinations of omics data types that enhance predictive performance is critical for cost-effective and accurate predictions.
Methods: In this study, we systematically evaluated the predictive performance of all 31 possible combinations including at least one of five genomic data types (mRNA, miRNA, methylation, DNAseq, and copy number variation) using 14 cancer datasets with right-censored survival outcomes, publicly available from the TCGA database. We employed various prediction methods and up-weighted clinical data in every model to leverage their predictive importance. Harrell's C-index and the integrated Brier Score were used as performance measures. To assess the robustness of our findings, we performed a bootstrap analysis at the level of the included datasets. Statistical testing was conducted for key results, limiting the number of tests to ensure a low risk of false positives.
Results: Contrary to expectations, we found that using only mRNA data or a combination of mRNA and miRNA data was sufficient for most cancer types. For some cancer types, the additional inclusion of methylation data led to improved prediction results. Far from enhancing performance, the introduction of more data types most often resulted in a decline in performance, which varied between the two performance measures.
Conclusions: Our findings challenge the prevailing notion that combining multiple omics data types in multi-omics survival prediction improves predictive performance. Thus, the widespread approach in multi-omics prediction of incorporating as many data types as possible should be reconsidered to avoid suboptimal prediction results and unnecessary expenditure.