Pub Date : 2024-06-26DOI: 10.1186/s13040-024-00372-2
Yunfei Yin, Zheng Yuan, Islam Md Tanvir, Xianjian Bao
The loss of electronic medical records has seriously affected the practical application of biomedical data. Therefore, it is a meaningful research effort to effectively fill these lost data. Currently, state-of-the-art methods focus on using Generative Adversarial Networks (GANs) to fill the missing values of electronic medical records, achieving breakthrough progress. However, when facing datasets with high missing rates, the imputation accuracy of these methods sharply deceases. This motivates us to explore the uncertainty of GANs and improve the GAN-based imputation methods. In this paper, the GRUD (Gate Recurrent Unit Decay) network and the UGAN (Uncertainty Generative Adversarial Network) are proposed and organically combined, called UGAN-GRUD. In UGAN-GRUD, it highlights using GAN to generate imputation values and then leveraging GRUD to compensate them. We have designed the UGAN and the GRUD network. The former is employed to learn the distribution pattern and uncertainty of data through the Generator and Discriminator, iteratively. The latter is exploited to compensate the former by leveraging the GRUD based on time decay factor, which can learn the specific temporal relations in electronic medical records. Through experimental research on publicly available biomedical datasets, the results show that UGAN-GRUD outperforms the current state-of-the-art methods, with average 13% RMSE (Root Mean Squared Error) and 24.5% MAPE (Mean Absolute Percentage Error) improvements.
电子病历的丢失严重影响了生物医学数据的实际应用。因此,有效填补这些丢失的数据是一项有意义的研究工作。目前,最先进的方法主要是使用生成对抗网络(GAN)来填补电子病历的缺失值,并取得了突破性进展。然而,当面对高缺失率的数据集时,这些方法的估算准确性会急剧下降。这促使我们探索 GAN 的不确定性,并改进基于 GAN 的估算方法。本文提出 GRUD(门递归单元衰减)网络和 UGAN(不确定性生成对抗网络),并将其有机地结合起来,称为 UGAN-GRUD。在 UGAN-GRUD 中,它强调使用 GAN 生成估算值,然后利用 GRUD 对其进行补偿。我们设计了 UGAN 和 GRUD 网络。前者通过生成器和判别器反复学习数据的分布模式和不确定性。后者则利用基于时间衰减因子的 GRUD 来弥补前者的不足,后者可以学习电子病历中的特定时间关系。通过对公开生物医学数据集的实验研究,结果表明 UGAN-GRUD 优于目前最先进的方法,平均 RMSE(均方根误差)提高了 13%,MAPE(平均绝对误差)提高了 24.5%。
{"title":"Electronic medical records imputation by temporal Generative Adversarial Network.","authors":"Yunfei Yin, Zheng Yuan, Islam Md Tanvir, Xianjian Bao","doi":"10.1186/s13040-024-00372-2","DOIUrl":"10.1186/s13040-024-00372-2","url":null,"abstract":"<p><p>The loss of electronic medical records has seriously affected the practical application of biomedical data. Therefore, it is a meaningful research effort to effectively fill these lost data. Currently, state-of-the-art methods focus on using Generative Adversarial Networks (GANs) to fill the missing values of electronic medical records, achieving breakthrough progress. However, when facing datasets with high missing rates, the imputation accuracy of these methods sharply deceases. This motivates us to explore the uncertainty of GANs and improve the GAN-based imputation methods. In this paper, the GRUD (Gate Recurrent Unit Decay) network and the UGAN (Uncertainty Generative Adversarial Network) are proposed and organically combined, called UGAN-GRUD. In UGAN-GRUD, it highlights using GAN to generate imputation values and then leveraging GRUD to compensate them. We have designed the UGAN and the GRUD network. The former is employed to learn the distribution pattern and uncertainty of data through the Generator and Discriminator, iteratively. The latter is exploited to compensate the former by leveraging the GRUD based on time decay factor, which can learn the specific temporal relations in electronic medical records. Through experimental research on publicly available biomedical datasets, the results show that UGAN-GRUD outperforms the current state-of-the-art methods, with average 13% RMSE (Root Mean Squared Error) and 24.5% MAPE (Mean Absolute Percentage Error) improvements.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"19"},"PeriodicalIF":4.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11202349/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141460183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-22DOI: 10.1186/s13040-024-00370-4
Yusuf Brima, Marcellin Atemkeng
Deep learning shows great promise for medical image analysis but often lacks explainability, hindering its adoption in healthcare. Attribution techniques that explain model reasoning can potentially increase trust in deep learning among clinical stakeholders. In the literature, much of the research on attribution in medical imaging focuses on visual inspection rather than statistical quantitative analysis.In this paper, we proposed an image-based saliency framework to enhance the explainability of deep learning models in medical image analysis. We use adaptive path-based gradient integration, gradient-free techniques, and class activation mapping along with its derivatives to attribute predictions from brain tumor MRI and COVID-19 chest X-ray datasets made by recent deep convolutional neural network models.The proposed framework integrates qualitative and statistical quantitative assessments, employing Accuracy Information Curves (AICs) and Softmax Information Curves (SICs) to measure the effectiveness of saliency methods in retaining critical image information and their correlation with model predictions. Visual inspections indicate that methods such as ScoreCAM, XRAI, GradCAM, and GradCAM++ consistently produce focused and clinically interpretable attribution maps. These methods highlighted possible biomarkers, exposed model biases, and offered insights into the links between input features and predictions, demonstrating their ability to elucidate model reasoning on these datasets. Empirical evaluations reveal that ScoreCAM and XRAI are particularly effective in retaining relevant image regions, as reflected in their higher AUC values. However, SICs highlight variability, with instances of random saliency masks outperforming established methods, emphasizing the need for combining visual and empirical metrics for a comprehensive evaluation.The results underscore the importance of selecting appropriate saliency methods for specific medical imaging tasks and suggest that combining qualitative and quantitative approaches can enhance the transparency, trustworthiness, and clinical adoption of deep learning models in healthcare. This study advances model explainability to increase trust in deep learning among healthcare stakeholders by revealing the rationale behind predictions. Future research should refine empirical metrics for stability and reliability, include more diverse imaging modalities, and focus on improving model explainability to support clinical decision-making.
{"title":"Saliency-driven explainable deep learning in medical imaging: bridging visual explainability and statistical quantitative analysis.","authors":"Yusuf Brima, Marcellin Atemkeng","doi":"10.1186/s13040-024-00370-4","DOIUrl":"10.1186/s13040-024-00370-4","url":null,"abstract":"<p><p>Deep learning shows great promise for medical image analysis but often lacks explainability, hindering its adoption in healthcare. Attribution techniques that explain model reasoning can potentially increase trust in deep learning among clinical stakeholders. In the literature, much of the research on attribution in medical imaging focuses on visual inspection rather than statistical quantitative analysis.In this paper, we proposed an image-based saliency framework to enhance the explainability of deep learning models in medical image analysis. We use adaptive path-based gradient integration, gradient-free techniques, and class activation mapping along with its derivatives to attribute predictions from brain tumor MRI and COVID-19 chest X-ray datasets made by recent deep convolutional neural network models.The proposed framework integrates qualitative and statistical quantitative assessments, employing Accuracy Information Curves (AICs) and Softmax Information Curves (SICs) to measure the effectiveness of saliency methods in retaining critical image information and their correlation with model predictions. Visual inspections indicate that methods such as ScoreCAM, XRAI, GradCAM, and GradCAM++ consistently produce focused and clinically interpretable attribution maps. These methods highlighted possible biomarkers, exposed model biases, and offered insights into the links between input features and predictions, demonstrating their ability to elucidate model reasoning on these datasets. Empirical evaluations reveal that ScoreCAM and XRAI are particularly effective in retaining relevant image regions, as reflected in their higher AUC values. However, SICs highlight variability, with instances of random saliency masks outperforming established methods, emphasizing the need for combining visual and empirical metrics for a comprehensive evaluation.The results underscore the importance of selecting appropriate saliency methods for specific medical imaging tasks and suggest that combining qualitative and quantitative approaches can enhance the transparency, trustworthiness, and clinical adoption of deep learning models in healthcare. This study advances model explainability to increase trust in deep learning among healthcare stakeholders by revealing the rationale behind predictions. Future research should refine empirical metrics for stability and reliability, include more diverse imaging modalities, and focus on improving model explainability to support clinical decision-making.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"18"},"PeriodicalIF":4.0,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11193223/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141440989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-18DOI: 10.1186/s13040-024-00371-3
Zhiping Paul Wang, Priyanka Bhandary, Yizhou Wang, Jason H Moore
GPT-4, as the most advanced version of OpenAI's large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4's capabilities in generating text, tables, and diagrams for biomedical review papers. We also assessed the consistency in text generation by GPT-4, along with potential plagiarism issues when employing this model for the composition of scientific review papers. Based on the results, we suggest the development of enhanced functionalities in ChatGPT, aiming to meet the needs of the scientific community more effectively. This includes enhancements in uploaded document processing for reference materials, a deeper grasp of intricate biomedical concepts, more precise and efficient information distillation for table generation, and a further refined model specifically tailored for scientific diagram creation.
{"title":"Using GPT-4 to write a scientific review article: a pilot evaluation study.","authors":"Zhiping Paul Wang, Priyanka Bhandary, Yizhou Wang, Jason H Moore","doi":"10.1186/s13040-024-00371-3","DOIUrl":"10.1186/s13040-024-00371-3","url":null,"abstract":"<p><p>GPT-4, as the most advanced version of OpenAI's large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4's capabilities in generating text, tables, and diagrams for biomedical review papers. We also assessed the consistency in text generation by GPT-4, along with potential plagiarism issues when employing this model for the composition of scientific review papers. Based on the results, we suggest the development of enhanced functionalities in ChatGPT, aiming to meet the needs of the scientific community more effectively. This includes enhancements in uploaded document processing for reference materials, a deeper grasp of intricate biomedical concepts, more precise and efficient information distillation for table generation, and a further refined model specifically tailored for scientific diagram creation.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"16"},"PeriodicalIF":4.5,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11184879/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141421566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-11DOI: 10.1186/s13040-024-00368-y
Carolina Del-Valle-Soto, Ramon A Briseño, Leonardo J Valdivia, Juan Arturo Nolazco-Flores
The development of neuroscientific techniques enabling the recording of brain and peripheral nervous system activity has fueled research in cognitive science. Recent technological advancements offer new possibilities for inducing behavioral change, particularly through cost-effective Internet-based interventions. However, limitations in laboratory equipment volume have hindered the generalization of results to real-life contexts. The advent of Internet of Things (IoT) devices, such as wearables, equipped with sensors and microchips, has ushered in a new era in behavior change techniques. Wearables, including smartwatches, electronic tattoos, and more, are poised for massive adoption, with an expected annual growth rate of 55% over the next five years. These devices enable personalized instructions, leading to increased productivity and efficiency, particularly in industrial production. Additionally, the healthcare sector has seen a significant demand for wearables, with over 80% of global consumers willing to use them for health monitoring. This research explores the primary biometric applications of wearables and their impact on users' well-being, focusing on the integration of behavior change techniques facilitated by IoT devices. Wearables have revolutionized health monitoring by providing real-time feedback, personalized interventions, and gamification. They encourage positive behavior changes by delivering immediate feedback, tailored recommendations, and gamified experiences, leading to sustained improvements in health. Furthermore, wearables seamlessly integrate with digital platforms, enhancing their impact through social support and connectivity. However, privacy and data security concerns must be addressed to maintain users' trust. As technology continues to advance, the refinement of IoT devices' design and functionality is crucial for promoting behavior change and improving health outcomes. This study aims to investigate the effects of behavior change techniques facilitated by wearables on individuals' health outcomes and the role of wearables in promoting a healthier lifestyle.
{"title":"Unveiling wearables: exploring the global landscape of biometric applications and vital signs and behavioral impact.","authors":"Carolina Del-Valle-Soto, Ramon A Briseño, Leonardo J Valdivia, Juan Arturo Nolazco-Flores","doi":"10.1186/s13040-024-00368-y","DOIUrl":"10.1186/s13040-024-00368-y","url":null,"abstract":"<p><p>The development of neuroscientific techniques enabling the recording of brain and peripheral nervous system activity has fueled research in cognitive science. Recent technological advancements offer new possibilities for inducing behavioral change, particularly through cost-effective Internet-based interventions. However, limitations in laboratory equipment volume have hindered the generalization of results to real-life contexts. The advent of Internet of Things (IoT) devices, such as wearables, equipped with sensors and microchips, has ushered in a new era in behavior change techniques. Wearables, including smartwatches, electronic tattoos, and more, are poised for massive adoption, with an expected annual growth rate of 55% over the next five years. These devices enable personalized instructions, leading to increased productivity and efficiency, particularly in industrial production. Additionally, the healthcare sector has seen a significant demand for wearables, with over 80% of global consumers willing to use them for health monitoring. This research explores the primary biometric applications of wearables and their impact on users' well-being, focusing on the integration of behavior change techniques facilitated by IoT devices. Wearables have revolutionized health monitoring by providing real-time feedback, personalized interventions, and gamification. They encourage positive behavior changes by delivering immediate feedback, tailored recommendations, and gamified experiences, leading to sustained improvements in health. Furthermore, wearables seamlessly integrate with digital platforms, enhancing their impact through social support and connectivity. However, privacy and data security concerns must be addressed to maintain users' trust. As technology continues to advance, the refinement of IoT devices' design and functionality is crucial for promoting behavior change and improving health outcomes. This study aims to investigate the effects of behavior change techniques facilitated by wearables on individuals' health outcomes and the role of wearables in promoting a healthier lifestyle.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"15"},"PeriodicalIF":4.5,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165804/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141307145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A knowledge graph can effectively showcase the essential characteristics of data and is increasingly emerging as a significant means of integrating information in the field of artificial intelligence. Coronary artery plaque represents a significant etiology of cardiovascular events, posing a diagnostic challenge for clinicians who are confronted with a multitude of nonspecific symptoms. To visualize the hierarchical relationship network graph of the molecular mechanisms underlying plaque properties and symptom phenotypes, patient symptomatology was extracted from electronic health record data from real-world clinical settings. Phenotypic networks were constructed utilizing clinical data and protein‒protein interaction networks. Machine learning techniques, including convolutional neural networks, Dijkstra's algorithm, and gene ontology semantic similarity, were employed to quantify clinical and biological features within the network. The resulting features were then utilized to train a K-nearest neighbor model, yielding 23 symptoms, 41 association rules, and 61 hub genes across the three types of plaques studied, achieving an area under the curve of 92.5%. Weighted correlation network analysis and pathway enrichment were subsequently utilized to identify lipid status-related genes and inflammation-associated pathways that could help explain the differences in plaque properties. To confirm the validity of the network graph model, we conducted coexpression analysis of the hub genes to evaluate their potential diagnostic value. Additionally, we investigated immune cell infiltration, examined the correlations between hub genes and immune cells, and validated the reliability of the identified biological pathways. By integrating clinical data and molecular network information, this biomedical knowledge graph model effectively elucidated the potential molecular mechanisms that collude symptoms, diseases, and molecules.
知识图谱可以有效地展示数据的基本特征,并日益成为人工智能领域整合信息的重要手段。冠状动脉斑块是心血管事件的一个重要病因,给临床医生带来了诊断上的挑战,因为他们要面对众多非特异性症状。为了可视化斑块特性和症状表型的分子机制的层次关系网络图,我们从真实世界临床环境的电子健康记录数据中提取了患者症状。利用临床数据和蛋白质-蛋白质相互作用网络构建了表型网络。采用卷积神经网络、Dijkstra 算法和基因本体语义相似性等机器学习技术来量化网络中的临床和生物特征。然后利用由此产生的特征来训练 K 最近邻模型,在研究的三种斑块中得出了 23 种症状、41 条关联规则和 61 个中心基因,曲线下面积达到 92.5%。随后,研究人员利用加权相关网络分析和通路富集来确定与脂质状态相关的基因和与炎症相关的通路,这些基因和通路有助于解释斑块特性的差异。为了证实网络图模型的有效性,我们对中心基因进行了共表达分析,以评估其潜在的诊断价值。此外,我们还调查了免疫细胞浸润情况,研究了枢纽基因与免疫细胞之间的相关性,并验证了所识别生物通路的可靠性。通过整合临床数据和分子网络信息,该生物医学知识图谱模型有效地阐明了症状、疾病和分子之间的潜在分子机制。
{"title":"The biomedical knowledge graph of symptom phenotype in coronary artery plaque: machine learning-based analysis of real-world clinical data.","authors":"Jia-Ming Huan, Xiao-Jie Wang, Yuan Li, Shi-Jun Zhang, Yuan-Long Hu, Yun-Lun Li","doi":"10.1186/s13040-024-00365-1","DOIUrl":"10.1186/s13040-024-00365-1","url":null,"abstract":"<p><p>A knowledge graph can effectively showcase the essential characteristics of data and is increasingly emerging as a significant means of integrating information in the field of artificial intelligence. Coronary artery plaque represents a significant etiology of cardiovascular events, posing a diagnostic challenge for clinicians who are confronted with a multitude of nonspecific symptoms. To visualize the hierarchical relationship network graph of the molecular mechanisms underlying plaque properties and symptom phenotypes, patient symptomatology was extracted from electronic health record data from real-world clinical settings. Phenotypic networks were constructed utilizing clinical data and protein‒protein interaction networks. Machine learning techniques, including convolutional neural networks, Dijkstra's algorithm, and gene ontology semantic similarity, were employed to quantify clinical and biological features within the network. The resulting features were then utilized to train a K-nearest neighbor model, yielding 23 symptoms, 41 association rules, and 61 hub genes across the three types of plaques studied, achieving an area under the curve of 92.5%. Weighted correlation network analysis and pathway enrichment were subsequently utilized to identify lipid status-related genes and inflammation-associated pathways that could help explain the differences in plaque properties. To confirm the validity of the network graph model, we conducted coexpression analysis of the hub genes to evaluate their potential diagnostic value. Additionally, we investigated immune cell infiltration, examined the correlations between hub genes and immune cells, and validated the reliability of the identified biological pathways. By integrating clinical data and molecular network information, this biomedical knowledge graph model effectively elucidated the potential molecular mechanisms that collude symptoms, diseases, and molecules.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"13"},"PeriodicalIF":4.5,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11110203/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141077027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent researches have found a strong correlation between the triglyceride-glucose (TyG) index or the atherogenic index of plasma (AIP) and cardiovascular disease (CVD) risk. However, there is a lack of research on non-invasive and rapid prediction of cardiovascular risk. We aimed to develop and validate a machine-learning model for predicting cardiovascular risk based on variables encompassing clinical questionnaires and oculomics. We collected data from the Korean National Health and Nutrition Examination Survey (KNHANES). The training dataset (80% from the year 2008 to 2011 KNHANES) was used for machine learning model development, with internal validation using the remaining 20%. An external validation dataset from the year 2012 assessed the model’s predictive capacity for TyG-index or AIP in new cases. We included 32122 participants in the final dataset. Machine learning models used 25 algorithms were trained on oculomics measurements and clinical questionnaires to predict the range of TyG-index and AIP. The area under the receiver operating characteristic curve (AUC), accuracy, precision, recall, and F1 score were used to evaluate the performance of our machine learning models. Based on large-scale cohort studies, we determined TyG-index cut-off points at 8.0, 8.75 (upper one-third values), 8.93 (upper one-fourth values), and AIP cut-offs at 0.318, 0.34. Values surpassing these thresholds indicated elevated cardiovascular risk. The best-performing algorithm revealed TyG-index cut-offs at 8.0, 8.75, and 8.93 with internal validation AUCs of 0.812, 0.873, and 0.911, respectively. External validation AUCs were 0.809, 0.863, and 0.901. For AIP at 0.34, internal and external validation achieved similar AUCs of 0.849 and 0.842. Slightly lower performance was seen for the 0.318 cut-off, with AUCs of 0.844 and 0.836. Significant gender-based variations were noted for TyG-index at 8 (male AUC=0.832, female AUC=0.790) and 8.75 (male AUC=0.874, female AUC=0.862) and AIP at 0.318 (male AUC=0.853, female AUC=0.825) and 0.34 (male AUC=0.858, female AUC=0.831). Gender similarity in AUC (male AUC=0.907 versus female AUC=0.906) was observed only when the TyG-index cut-off point equals 8.93. We have established a simple and effective non-invasive machine learning model that has good clinical value for predicting cardiovascular risk in the general population.
{"title":"Machine-learning-based models to predict cardiovascular risk using oculomics and clinic variables in KNHANES","authors":"Yuqi Zhang, Sijin Li, Weijie Wu, Yanqing Zhao, Jintao Han, Chao Tong, Niansang Luo, Kun Zhang","doi":"10.1186/s13040-024-00363-3","DOIUrl":"https://doi.org/10.1186/s13040-024-00363-3","url":null,"abstract":"Recent researches have found a strong correlation between the triglyceride-glucose (TyG) index or the atherogenic index of plasma (AIP) and cardiovascular disease (CVD) risk. However, there is a lack of research on non-invasive and rapid prediction of cardiovascular risk. We aimed to develop and validate a machine-learning model for predicting cardiovascular risk based on variables encompassing clinical questionnaires and oculomics. We collected data from the Korean National Health and Nutrition Examination Survey (KNHANES). The training dataset (80% from the year 2008 to 2011 KNHANES) was used for machine learning model development, with internal validation using the remaining 20%. An external validation dataset from the year 2012 assessed the model’s predictive capacity for TyG-index or AIP in new cases. We included 32122 participants in the final dataset. Machine learning models used 25 algorithms were trained on oculomics measurements and clinical questionnaires to predict the range of TyG-index and AIP. The area under the receiver operating characteristic curve (AUC), accuracy, precision, recall, and F1 score were used to evaluate the performance of our machine learning models. Based on large-scale cohort studies, we determined TyG-index cut-off points at 8.0, 8.75 (upper one-third values), 8.93 (upper one-fourth values), and AIP cut-offs at 0.318, 0.34. Values surpassing these thresholds indicated elevated cardiovascular risk. The best-performing algorithm revealed TyG-index cut-offs at 8.0, 8.75, and 8.93 with internal validation AUCs of 0.812, 0.873, and 0.911, respectively. External validation AUCs were 0.809, 0.863, and 0.901. For AIP at 0.34, internal and external validation achieved similar AUCs of 0.849 and 0.842. Slightly lower performance was seen for the 0.318 cut-off, with AUCs of 0.844 and 0.836. Significant gender-based variations were noted for TyG-index at 8 (male AUC=0.832, female AUC=0.790) and 8.75 (male AUC=0.874, female AUC=0.862) and AIP at 0.318 (male AUC=0.853, female AUC=0.825) and 0.34 (male AUC=0.858, female AUC=0.831). Gender similarity in AUC (male AUC=0.907 versus female AUC=0.906) was observed only when the TyG-index cut-off point equals 8.93. We have established a simple and effective non-invasive machine learning model that has good clinical value for predicting cardiovascular risk in the general population.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"114 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1186/s13040-024-00362-4
Selcen Ari Yuka, Alper Yilmaz
Competing endogenous RNAs play key roles in cellular molecular mechanisms through cross-talk in post-transcriptional interactions. Studies on ceRNA cross-talk, which is particularly dependent on the abundance of free transcripts, generally involve large- and small-scale studies involving the integration of transcriptomic data from tissues and correlation analyses. This abundance-dependent nature of ceRNA interactions suggests that tissue- and condition-specific ceRNA dynamics may fluctuate. However, there are no comprehensive studies investigating the ceRNA interactions in normal tissue, ceRNAs that are lost and/or appear in cancerous tissues or their interactions. In this study, we comprehensively analyzed the tumor-specific ceRNA fluctuations observed in the three highest-incidence cancers, LUAD, PRAD, and BRCA, compared to healthy lung, prostate, and breast tissues, respectively. Our observations pertaining to tumor-specific competing endogenous RNA (ceRNA) interactions revealed that, in the cases of lung adenocarcinoma (LUAD), prostate adenocarcinoma (PRAD), and breast invasive carcinoma (BRCA), 3,204, 1,233, and 406 ceRNAs, respectively, engage in post-transcriptional intercommunication within tumor tissues, in contrast to their absence in corresponding healthy samples. We also found that 90 ceRNAs are shared by the three cancer types and that these ceRNAs participate in ceRNA interactions in tumor tissues compared to those in normal tissues. Among the 90 ceRNAs that directly interact with miRNAs, we uncovered a core network of 165 miRNAs and 63 ceRNAs that should be considered in RNA-targeted and RNA-mediated approaches in future studies and could be used in these three aggressive cancer types. More specifically, in this core interaction network, ceRNAs such as GALNT7, KLF9, and DAB2 and miRNAs like miR-106a/b-5p, miR-20a-5p, and miR-519d-3p may have potential as common targets in the three critical cancers. In contrast to conventional methods that construct ceRNA networks using differentially expressed genes compared to normal tissues, our proposed approach identifies ceRNA players by considering their context within the ceRNA:miRNA interactions. Our results have the potential to reveal distinct and common ceRNA interactions in cancer types and to pinpoint critical RNAs, thereby paving the way for RNA-based strategies in the battle against cancer.
{"title":"Decoding dynamic miRNA:ceRNA interactions unveils therapeutic insights and targets across predominant cancer landscapes","authors":"Selcen Ari Yuka, Alper Yilmaz","doi":"10.1186/s13040-024-00362-4","DOIUrl":"https://doi.org/10.1186/s13040-024-00362-4","url":null,"abstract":"Competing endogenous RNAs play key roles in cellular molecular mechanisms through cross-talk in post-transcriptional interactions. Studies on ceRNA cross-talk, which is particularly dependent on the abundance of free transcripts, generally involve large- and small-scale studies involving the integration of transcriptomic data from tissues and correlation analyses. This abundance-dependent nature of ceRNA interactions suggests that tissue- and condition-specific ceRNA dynamics may fluctuate. However, there are no comprehensive studies investigating the ceRNA interactions in normal tissue, ceRNAs that are lost and/or appear in cancerous tissues or their interactions. In this study, we comprehensively analyzed the tumor-specific ceRNA fluctuations observed in the three highest-incidence cancers, LUAD, PRAD, and BRCA, compared to healthy lung, prostate, and breast tissues, respectively. Our observations pertaining to tumor-specific competing endogenous RNA (ceRNA) interactions revealed that, in the cases of lung adenocarcinoma (LUAD), prostate adenocarcinoma (PRAD), and breast invasive carcinoma (BRCA), 3,204, 1,233, and 406 ceRNAs, respectively, engage in post-transcriptional intercommunication within tumor tissues, in contrast to their absence in corresponding healthy samples. We also found that 90 ceRNAs are shared by the three cancer types and that these ceRNAs participate in ceRNA interactions in tumor tissues compared to those in normal tissues. Among the 90 ceRNAs that directly interact with miRNAs, we uncovered a core network of 165 miRNAs and 63 ceRNAs that should be considered in RNA-targeted and RNA-mediated approaches in future studies and could be used in these three aggressive cancer types. More specifically, in this core interaction network, ceRNAs such as GALNT7, KLF9, and DAB2 and miRNAs like miR-106a/b-5p, miR-20a-5p, and miR-519d-3p may have potential as common targets in the three critical cancers. In contrast to conventional methods that construct ceRNA networks using differentially expressed genes compared to normal tissues, our proposed approach identifies ceRNA players by considering their context within the ceRNA:miRNA interactions. Our results have the potential to reveal distinct and common ceRNA interactions in cancer types and to pinpoint critical RNAs, thereby paving the way for RNA-based strategies in the battle against cancer.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140614489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-16DOI: 10.1186/s13040-024-00361-5
Jianchang Hu, Silke Szymczak
Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes. Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.
{"title":"Evaluation of network-guided random forest for disease gene discovery","authors":"Jianchang Hu, Silke Szymczak","doi":"10.1186/s13040-024-00361-5","DOIUrl":"https://doi.org/10.1186/s13040-024-00361-5","url":null,"abstract":"Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes. Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"55 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-05DOI: 10.1186/s13040-024-00360-6
Xiaohui Yao, Xiaohan Jiang, Haoran Luo, Hong Liang, Xiufen Ye, Yanhui Wei, Shan Cong
Integrating multi-omics data is emerging as a critical approach in enhancing our understanding of complex diseases. Innovative computational methods capable of managing high-dimensional and heterogeneous datasets are required to unlock the full potential of such rich and diverse data. We propose a Multi-Omics integration framework with auxiliary Classifiers-enhanced AuToencoders (MOCAT) to utilize intra- and inter-omics information comprehensively. Additionally, attention mechanisms with confidence learning are incorporated for enhanced feature representation and trustworthy prediction. Extensive experiments were conducted on four benchmark datasets to evaluate the effectiveness of our proposed model, including BRCA, ROSMAP, LGG, and KIPAN. Our model significantly improved most evaluation measurements and consistently surpassed the state-of-the-art methods. Ablation studies showed that the auxiliary classifiers significantly boosted classification accuracy in the ROSMAP and LGG datasets. Moreover, the attention mechanisms and confidence evaluation block contributed to improvements in the predictive accuracy and generalizability of our model. The proposed framework exhibits superior performance in disease classification and biomarker discovery, establishing itself as a robust and versatile tool for analyzing multi-layer biological data. This study highlights the significance of elaborated designed deep learning methodologies in dissecting complex disease phenotypes and improving the accuracy of disease predictions.
{"title":"MOCAT: multi-omics integration with auxiliary classifiers enhanced autoencoder","authors":"Xiaohui Yao, Xiaohan Jiang, Haoran Luo, Hong Liang, Xiufen Ye, Yanhui Wei, Shan Cong","doi":"10.1186/s13040-024-00360-6","DOIUrl":"https://doi.org/10.1186/s13040-024-00360-6","url":null,"abstract":"Integrating multi-omics data is emerging as a critical approach in enhancing our understanding of complex diseases. Innovative computational methods capable of managing high-dimensional and heterogeneous datasets are required to unlock the full potential of such rich and diverse data. We propose a Multi-Omics integration framework with auxiliary Classifiers-enhanced AuToencoders (MOCAT) to utilize intra- and inter-omics information comprehensively. Additionally, attention mechanisms with confidence learning are incorporated for enhanced feature representation and trustworthy prediction. Extensive experiments were conducted on four benchmark datasets to evaluate the effectiveness of our proposed model, including BRCA, ROSMAP, LGG, and KIPAN. Our model significantly improved most evaluation measurements and consistently surpassed the state-of-the-art methods. Ablation studies showed that the auxiliary classifiers significantly boosted classification accuracy in the ROSMAP and LGG datasets. Moreover, the attention mechanisms and confidence evaluation block contributed to improvements in the predictive accuracy and generalizability of our model. The proposed framework exhibits superior performance in disease classification and biomarker discovery, establishing itself as a robust and versatile tool for analyzing multi-layer biological data. This study highlights the significance of elaborated designed deep learning methodologies in dissecting complex disease phenotypes and improving the accuracy of disease predictions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"42 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140037570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Breast cancer is the most common malignancy among women worldwide. Despite advances in treating breast cancer over the past decades, drug resistance and adverse effects remain challenging. Recent therapeutic progress has shifted toward using drug combinations for better treatment efficiency. However, with a growing number of potential small-molecule cancer inhibitors, in silico strategies to predict pharmacological synergy before experimental trials are required to compensate for time and cost restrictions. Many deep learning models have been previously proposed to predict the synergistic effects of drug combinations with high performance. However, these models heavily relied on a large number of drug chemical structural fingerprints as their main features, which made model interpretation a challenge.
Results: This study developed a deep neural network model that predicts synergy between small-molecule pairs based on their inhibitory activities against 13 selected key proteins. The synergy prediction model achieved a Pearson correlation coefficient between model predictions and experimental data of 0.63 across five breast cancer cell lines. BT-549 and MCF-7 achieved the highest correlation of 0.67 when considering individual cell lines. Despite achieving a moderate correlation compared to previous deep learning models, our model offers a distinctive advantage in terms of interpretability. Using the inhibitory activities against key protein targets as the main features allowed a straightforward interpretation of the model since the individual features had direct biological meaning. By tracing the synergistic interactions of compounds through their target proteins, we gained insights into the patterns our model recognized as indicative of synergistic effects.
Conclusions: The framework employed in the present study lays the groundwork for future advancements, especially in model interpretation. By combining deep learning techniques and target-specific models, this study shed light on potential patterns of target-protein inhibition profiles that could be exploited in breast cancer treatment.
{"title":"Interpreting drug synergy in breast cancer with deep learning using target-protein inhibition profiles.","authors":"Thanyawee Srithanyarat, Kittisak Taoma, Thana Sutthibutpong, Marasri Ruengjitchatchawalya, Monrudee Liangruksa, Teeraphan Laomettachit","doi":"10.1186/s13040-024-00359-z","DOIUrl":"10.1186/s13040-024-00359-z","url":null,"abstract":"<p><strong>Background: </strong>Breast cancer is the most common malignancy among women worldwide. Despite advances in treating breast cancer over the past decades, drug resistance and adverse effects remain challenging. Recent therapeutic progress has shifted toward using drug combinations for better treatment efficiency. However, with a growing number of potential small-molecule cancer inhibitors, in silico strategies to predict pharmacological synergy before experimental trials are required to compensate for time and cost restrictions. Many deep learning models have been previously proposed to predict the synergistic effects of drug combinations with high performance. However, these models heavily relied on a large number of drug chemical structural fingerprints as their main features, which made model interpretation a challenge.</p><p><strong>Results: </strong>This study developed a deep neural network model that predicts synergy between small-molecule pairs based on their inhibitory activities against 13 selected key proteins. The synergy prediction model achieved a Pearson correlation coefficient between model predictions and experimental data of 0.63 across five breast cancer cell lines. BT-549 and MCF-7 achieved the highest correlation of 0.67 when considering individual cell lines. Despite achieving a moderate correlation compared to previous deep learning models, our model offers a distinctive advantage in terms of interpretability. Using the inhibitory activities against key protein targets as the main features allowed a straightforward interpretation of the model since the individual features had direct biological meaning. By tracing the synergistic interactions of compounds through their target proteins, we gained insights into the patterns our model recognized as indicative of synergistic effects.</p><p><strong>Conclusions: </strong>The framework employed in the present study lays the groundwork for future advancements, especially in model interpretation. By combining deep learning techniques and target-specific models, this study shed light on potential patterns of target-protein inhibition profiles that could be exploited in breast cancer treatment.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"8"},"PeriodicalIF":4.5,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10905801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139997938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}