首页 > 最新文献

Biodata Mining最新文献

英文 中文
MOCAT: multi-omics integration with auxiliary classifiers enhanced autoencoder MOCAT:带辅助分类器的多组学集成增强型自动编码器
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-03-05 DOI: 10.1186/s13040-024-00360-6
Xiaohui Yao, Xiaohan Jiang, Haoran Luo, Hong Liang, Xiufen Ye, Yanhui Wei, Shan Cong
Integrating multi-omics data is emerging as a critical approach in enhancing our understanding of complex diseases. Innovative computational methods capable of managing high-dimensional and heterogeneous datasets are required to unlock the full potential of such rich and diverse data. We propose a Multi-Omics integration framework with auxiliary Classifiers-enhanced AuToencoders (MOCAT) to utilize intra- and inter-omics information comprehensively. Additionally, attention mechanisms with confidence learning are incorporated for enhanced feature representation and trustworthy prediction. Extensive experiments were conducted on four benchmark datasets to evaluate the effectiveness of our proposed model, including BRCA, ROSMAP, LGG, and KIPAN. Our model significantly improved most evaluation measurements and consistently surpassed the state-of-the-art methods. Ablation studies showed that the auxiliary classifiers significantly boosted classification accuracy in the ROSMAP and LGG datasets. Moreover, the attention mechanisms and confidence evaluation block contributed to improvements in the predictive accuracy and generalizability of our model. The proposed framework exhibits superior performance in disease classification and biomarker discovery, establishing itself as a robust and versatile tool for analyzing multi-layer biological data. This study highlights the significance of elaborated designed deep learning methodologies in dissecting complex disease phenotypes and improving the accuracy of disease predictions.
整合多组学数据正在成为增进我们对复杂疾病了解的一种重要方法。我们需要能够管理高维异构数据集的创新计算方法,以充分挖掘这些丰富多样数据的潜力。我们提出了一个多组学集成框架,该框架带有辅助分类器增强型 AuToencoders(MOCAT),可全面利用组学内部和组学之间的信息。此外,还纳入了具有置信度学习的注意力机制,以增强特征表示和可信预测。我们在四个基准数据集(包括 BRCA、ROSMAP、LGG 和 KIPAN)上进行了广泛的实验,以评估我们提出的模型的有效性。我们的模型明显改善了大多数评估指标,并一直超越最先进的方法。消融研究表明,在 ROSMAP 和 LGG 数据集中,辅助分类器大大提高了分类准确率。此外,注意力机制和置信度评估块也有助于提高我们模型的预测准确性和普适性。所提出的框架在疾病分类和生物标记物发现方面表现出卓越的性能,使其成为分析多层生物数据的稳健而通用的工具。这项研究凸显了精心设计的深度学习方法在剖析复杂疾病表型和提高疾病预测准确性方面的重要意义。
{"title":"MOCAT: multi-omics integration with auxiliary classifiers enhanced autoencoder","authors":"Xiaohui Yao, Xiaohan Jiang, Haoran Luo, Hong Liang, Xiufen Ye, Yanhui Wei, Shan Cong","doi":"10.1186/s13040-024-00360-6","DOIUrl":"https://doi.org/10.1186/s13040-024-00360-6","url":null,"abstract":"Integrating multi-omics data is emerging as a critical approach in enhancing our understanding of complex diseases. Innovative computational methods capable of managing high-dimensional and heterogeneous datasets are required to unlock the full potential of such rich and diverse data. We propose a Multi-Omics integration framework with auxiliary Classifiers-enhanced AuToencoders (MOCAT) to utilize intra- and inter-omics information comprehensively. Additionally, attention mechanisms with confidence learning are incorporated for enhanced feature representation and trustworthy prediction. Extensive experiments were conducted on four benchmark datasets to evaluate the effectiveness of our proposed model, including BRCA, ROSMAP, LGG, and KIPAN. Our model significantly improved most evaluation measurements and consistently surpassed the state-of-the-art methods. Ablation studies showed that the auxiliary classifiers significantly boosted classification accuracy in the ROSMAP and LGG datasets. Moreover, the attention mechanisms and confidence evaluation block contributed to improvements in the predictive accuracy and generalizability of our model. The proposed framework exhibits superior performance in disease classification and biomarker discovery, establishing itself as a robust and versatile tool for analyzing multi-layer biological data. This study highlights the significance of elaborated designed deep learning methodologies in dissecting complex disease phenotypes and improving the accuracy of disease predictions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"42 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140037570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpreting drug synergy in breast cancer with deep learning using target-protein inhibition profiles. 利用靶蛋白抑制图谱,通过深度学习解读乳腺癌的药物协同作用。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-29 DOI: 10.1186/s13040-024-00359-z
Thanyawee Srithanyarat, Kittisak Taoma, Thana Sutthibutpong, Marasri Ruengjitchatchawalya, Monrudee Liangruksa, Teeraphan Laomettachit

Background: Breast cancer is the most common malignancy among women worldwide. Despite advances in treating breast cancer over the past decades, drug resistance and adverse effects remain challenging. Recent therapeutic progress has shifted toward using drug combinations for better treatment efficiency. However, with a growing number of potential small-molecule cancer inhibitors, in silico strategies to predict pharmacological synergy before experimental trials are required to compensate for time and cost restrictions. Many deep learning models have been previously proposed to predict the synergistic effects of drug combinations with high performance. However, these models heavily relied on a large number of drug chemical structural fingerprints as their main features, which made model interpretation a challenge.

Results: This study developed a deep neural network model that predicts synergy between small-molecule pairs based on their inhibitory activities against 13 selected key proteins. The synergy prediction model achieved a Pearson correlation coefficient between model predictions and experimental data of 0.63 across five breast cancer cell lines. BT-549 and MCF-7 achieved the highest correlation of 0.67 when considering individual cell lines. Despite achieving a moderate correlation compared to previous deep learning models, our model offers a distinctive advantage in terms of interpretability. Using the inhibitory activities against key protein targets as the main features allowed a straightforward interpretation of the model since the individual features had direct biological meaning. By tracing the synergistic interactions of compounds through their target proteins, we gained insights into the patterns our model recognized as indicative of synergistic effects.

Conclusions: The framework employed in the present study lays the groundwork for future advancements, especially in model interpretation. By combining deep learning techniques and target-specific models, this study shed light on potential patterns of target-protein inhibition profiles that could be exploited in breast cancer treatment.

背景:乳腺癌是全球妇女最常见的恶性肿瘤。尽管过去几十年来乳腺癌的治疗取得了进展,但耐药性和不良反应仍然是一项挑战。最近的治疗进展已转向使用药物组合来提高治疗效率。然而,由于潜在的小分子癌症抑制剂越来越多,因此需要在实验前采用硅学策略预测药理协同作用,以弥补时间和成本的限制。此前已有许多深度学习模型被提出来预测药物组合的高效协同效应。然而,这些模型严重依赖大量的药物化学结构指纹作为其主要特征,这使得模型解释成为一项挑战:本研究建立了一个深度神经网络模型,该模型可根据小分子对 13 种选定关键蛋白的抑制活性预测小分子对之间的协同作用。在五个乳腺癌细胞系中,协同作用预测模型与实验数据之间的皮尔逊相关系数达到 0.63。在考虑单个细胞系时,BT-549 和 MCF-7 的相关性最高,达到 0.67。尽管与之前的深度学习模型相比,我们的模型实现了中等程度的相关性,但在可解释性方面具有明显优势。将对关键蛋白靶点的抑制活性作为主要特征,可以直接解释模型,因为单个特征具有直接的生物学意义。通过追踪化合物与靶蛋白之间的协同作用,我们深入了解了我们的模型所识别的表明协同效应的模式:本研究采用的框架为未来的进步奠定了基础,尤其是在模型解释方面。通过将深度学习技术与靶点特异性模型相结合,本研究揭示了靶点蛋白抑制谱的潜在模式,可用于乳腺癌治疗。
{"title":"Interpreting drug synergy in breast cancer with deep learning using target-protein inhibition profiles.","authors":"Thanyawee Srithanyarat, Kittisak Taoma, Thana Sutthibutpong, Marasri Ruengjitchatchawalya, Monrudee Liangruksa, Teeraphan Laomettachit","doi":"10.1186/s13040-024-00359-z","DOIUrl":"10.1186/s13040-024-00359-z","url":null,"abstract":"<p><strong>Background: </strong>Breast cancer is the most common malignancy among women worldwide. Despite advances in treating breast cancer over the past decades, drug resistance and adverse effects remain challenging. Recent therapeutic progress has shifted toward using drug combinations for better treatment efficiency. However, with a growing number of potential small-molecule cancer inhibitors, in silico strategies to predict pharmacological synergy before experimental trials are required to compensate for time and cost restrictions. Many deep learning models have been previously proposed to predict the synergistic effects of drug combinations with high performance. However, these models heavily relied on a large number of drug chemical structural fingerprints as their main features, which made model interpretation a challenge.</p><p><strong>Results: </strong>This study developed a deep neural network model that predicts synergy between small-molecule pairs based on their inhibitory activities against 13 selected key proteins. The synergy prediction model achieved a Pearson correlation coefficient between model predictions and experimental data of 0.63 across five breast cancer cell lines. BT-549 and MCF-7 achieved the highest correlation of 0.67 when considering individual cell lines. Despite achieving a moderate correlation compared to previous deep learning models, our model offers a distinctive advantage in terms of interpretability. Using the inhibitory activities against key protein targets as the main features allowed a straightforward interpretation of the model since the individual features had direct biological meaning. By tracing the synergistic interactions of compounds through their target proteins, we gained insights into the patterns our model recognized as indicative of synergistic effects.</p><p><strong>Conclusions: </strong>The framework employed in the present study lays the groundwork for future advancements, especially in model interpretation. By combining deep learning techniques and target-specific models, this study shed light on potential patterns of target-protein inhibition profiles that could be exploited in breast cancer treatment.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"8"},"PeriodicalIF":4.5,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10905801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139997938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interaction models matter: an efficient, flexible computational framework for model-specific investigation of epistasis. 交互模型很重要:一种高效、灵活的计算框架,用于特定模型的表观性研究。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-28 DOI: 10.1186/s13040-024-00358-0
Sandra Batista, Vered Senderovich Madar, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Apurva S Chitre, Abraham A Palmer, Jason H Moore

Purpose: Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermore, existing methods for epistasis detection only consider a Cartesian (multiplicative) model for interaction terms. This is likely limiting as epistatic interactions can evolve to produce varied relationships between genetic loci, some complex and not linearly separable.

Methods: We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many varied models for the interaction terms for loci and efficient memory usage. The algorithms are given for two-way and three-way epistasis and may be generalized to higher order epistasis. Statistical tests for the interaction coefficients are also provided. We also present an efficient matrix based algorithm for permutation testing for two-way epistasis. We offer a proof and experimental evidence that methods that look for epistasis only at loci that have main effects may not be justified. Given the computational efficiency of the algorithm, we applied the method to a rat data set and mouse data set, with at least 10,000 loci and 1,000 samples each, using the standard Cartesian model and the XOR model to explore body mass index.

Results: This study reveals that although many of the loci found to exhibit significant statistical epistasis overlap between models in rats, the pairs are mostly distinct. Further, the XOR model found greater evidence for statistical epistasis in many more pairs of loci in both data sets with almost all significant epistasis in mice identified using XOR. In the rat data set, loci involved in epistasis under the XOR model are enriched for biologically relevant pathways.

Conclusion: Our results in both species show that many biologically relevant epistatic relationships would have been undetected if only one interaction model was applied, providing evidence that varied interaction models should be implemented to explore epistatic interactions that occur in living systems.

目的:外显子效应(两个或多个基因之间的相互作用)是遗传学研究中不可或缺的一部分,它存在于整个自然界中。然而,由于大多数方法主要关注单病灶效应,而分析所有成对和高阶相互作用需要大量计算资源,因此很少对其进行充分探索。此外,现有的外显子检测方法只考虑相互作用项的笛卡尔(乘法)模型。这很可能具有局限性,因为表观相互作用会在遗传位点之间演变出各种关系,有些关系很复杂,而且不是线性可分的:方法:我们针对表观遗传的标准回归模型提出了交互作用系数的新算法,这种算法允许为基因座的交互作用项建立多种不同的模型,并能有效地使用内存。这些算法适用于双向和三向外显率,并可推广到更高阶的外显率。我们还提供了交互作用系数的统计检验。我们还提出了一种基于矩阵的高效算法,用于双向外显率的置换检验。我们提供了证明和实验证据,说明只在具有主效应的位点上寻找表观性的方法可能是不合理的。鉴于该算法的计算效率,我们将该方法应用于大鼠数据集和小鼠数据集,每个数据集至少有 10,000 个位点和 1,000 个样本,使用标准笛卡尔模型和 XOR 模型来探讨体重指数:研究结果表明,虽然在大鼠中发现的许多基因位点在不同模型之间有显著的统计外显重叠,但这些位点对大多是不同的。此外,在两个数据集中,XOR 模型在更多的基因位点对中发现了更多的统计外显性证据,在小鼠中几乎所有的显著外显性都是通过 XOR 发现的。在大鼠的数据集中,XOR 模型中涉及外显的基因位点都富集在生物相关的通路上:我们在两个物种中的研究结果表明,如果只采用一种相互作用模型,许多与生物相关的表观关系可能不会被发现,这证明应该采用不同的相互作用模型来探索生命系统中发生的表观相互作用。
{"title":"Interaction models matter: an efficient, flexible computational framework for model-specific investigation of epistasis.","authors":"Sandra Batista, Vered Senderovich Madar, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Apurva S Chitre, Abraham A Palmer, Jason H Moore","doi":"10.1186/s13040-024-00358-0","DOIUrl":"10.1186/s13040-024-00358-0","url":null,"abstract":"<p><strong>Purpose: </strong>Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermore, existing methods for epistasis detection only consider a Cartesian (multiplicative) model for interaction terms. This is likely limiting as epistatic interactions can evolve to produce varied relationships between genetic loci, some complex and not linearly separable.</p><p><strong>Methods: </strong>We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many varied models for the interaction terms for loci and efficient memory usage. The algorithms are given for two-way and three-way epistasis and may be generalized to higher order epistasis. Statistical tests for the interaction coefficients are also provided. We also present an efficient matrix based algorithm for permutation testing for two-way epistasis. We offer a proof and experimental evidence that methods that look for epistasis only at loci that have main effects may not be justified. Given the computational efficiency of the algorithm, we applied the method to a rat data set and mouse data set, with at least 10,000 loci and 1,000 samples each, using the standard Cartesian model and the XOR model to explore body mass index.</p><p><strong>Results: </strong>This study reveals that although many of the loci found to exhibit significant statistical epistasis overlap between models in rats, the pairs are mostly distinct. Further, the XOR model found greater evidence for statistical epistasis in many more pairs of loci in both data sets with almost all significant epistasis in mice identified using XOR. In the rat data set, loci involved in epistasis under the XOR model are enriched for biologically relevant pathways.</p><p><strong>Conclusion: </strong>Our results in both species show that many biologically relevant epistatic relationships would have been undetected if only one interaction model was applied, providing evidence that varied interaction models should be implemented to explore epistatic interactions that occur in living systems.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"7"},"PeriodicalIF":4.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10900690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139991555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of the causal relationship between gut microbiota and cardiovascular diseases: a bidirectional Mendelian randomization analysis. 肠道微生物群与心血管疾病因果关系的评估:双向孟德尔随机分析。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-26 DOI: 10.1186/s13040-024-00356-2
Xiao-Ce Dai, Yi Yu, Si-Yu Zhou, Shuo Yu, Mei-Xiang Xiang, Hong Ma

Background: Previous studies have shown an association between gut microbiota and cardiovascular diseases (CVDs). However, the underlying causal relationship remains unclear. This study aims to elucidate the causal relationship between gut microbiota and CVDs and to explore the pathogenic role of gut microbiota in CVDs.

Methods: In this two-sample Mendelian randomization study, we used genetic instruments from publicly available genome-wide association studies, including single-nucleotide polymorphisms (SNPs) associated with gut microbiota (n = 14,306) and CVDs (n = 2,207,591). We employed multiple statistical analysis methods, including inverse variance weighting, MR Egger, weighted median, MR pleiotropic residuals and outliers, and the leave-one-out method, to estimate the causal relationship between gut microbiota and CVDs. Additionally, we conducted multiple analyses to assess horizontal pleiotropy and heterogeneity.

Results: GWAS summary data were available from a pooled sample of 2,221,897 adult and adolescent participants. Our findings indicated that specific gut microbiota had either protective or detrimental effects on CVDs. Notably, Howardella (OR = 0.955, 95% CI: 0.913-0.999, P = .05), Intestinibacter (OR = 0.908, 95% CI:0.831-0.993, P = .03), Lachnospiraceae (NK4A136 group) (OR = 0.904, 95% CI:0.841-0.973, P = .007), Turicibacter (OR = 0.904, 95% CI: 0.838-0.976, P = .01), Holdemania (OR, 0.898; 95% CI: 0.810-0.995, P = .04) and Odoribacter (OR, 0.835; 95% CI: 0.710-0.993, P = .04) exhibited a protective causal effect on atrial fibrillation, while other microbiota had adverse causal effects. Similar effects were observed with respect to coronary artery disease, myocardial infarction, ischemic stroke, and hypertension. Furthermore, reversed Mendelian randomization analyses revealed that atrial fibrillation and ischemic stroke had causal effects on certain gut microbiotas.

Conclusion: Our study underscored the importance of gut microbiota in the context of CVDs and lent support to the hypothesis that increasing the abundance of probiotics or decreasing the abundance of harmful bacterial populations may offer protection against specific CVDs. Nevertheless, further research is essential to translate these findings into clinical practice.

背景:以往的研究表明,肠道微生物群与心血管疾病(CVDs)之间存在关联。然而,其背后的因果关系仍不清楚。本研究旨在阐明肠道微生物群与心血管疾病之间的因果关系,并探讨肠道微生物群在心血管疾病中的致病作用:在这项双样本孟德尔随机研究中,我们使用了来自公开全基因组关联研究的遗传工具,包括与肠道微生物群(n = 14,306 个)和心血管疾病(n = 2,207,591 个)相关的单核苷酸多态性(SNPs)。我们采用了多种统计分析方法,包括反方差加权、MR Egger、加权中位数、MR 多态残差和离群值以及撇除法,来估计肠道微生物群与心血管疾病之间的因果关系。此外,我们还进行了多重分析,以评估水平多义性和异质性:GWAS汇总数据来自2,221,897名成人和青少年参与者的汇总样本。我们的研究结果表明,特定的肠道微生物群对心血管疾病具有保护或有害作用。值得注意的是,霍华德氏菌(OR = 0.955,95% CI:0.913-0.999,P = .05)、肠杆菌(OR = 0.908,95% CI:0.831-0.993,P = .03)、Lachnospiraceae(NK4A136 组)(OR = 0.904,95% CI:0.841-0.973,P = .007)、Turisibacter(OR = 0.904,95% CI:0.838-0.976,P = .01)、Holdemania(OR,0.898;95% CI:0.810-0.995,P = .04)和Odoribacter(OR,0.835;95% CI:0.710-0.993,P = .04)对心房颤动具有保护性因果效应,而其他微生物群则具有不利的因果效应。在冠状动脉疾病、心肌梗塞、缺血性中风和高血压方面也观察到类似的效应。此外,反向孟德尔随机分析显示,心房颤动和缺血性中风对某些肠道微生物群具有因果效应:我们的研究强调了肠道微生物群在心血管疾病中的重要性,并支持了增加益生菌数量或减少有害细菌数量可预防特定心血管疾病的假设。不过,要将这些发现转化为临床实践,还需要进一步的研究。
{"title":"Assessment of the causal relationship between gut microbiota and cardiovascular diseases: a bidirectional Mendelian randomization analysis.","authors":"Xiao-Ce Dai, Yi Yu, Si-Yu Zhou, Shuo Yu, Mei-Xiang Xiang, Hong Ma","doi":"10.1186/s13040-024-00356-2","DOIUrl":"10.1186/s13040-024-00356-2","url":null,"abstract":"<p><strong>Background: </strong>Previous studies have shown an association between gut microbiota and cardiovascular diseases (CVDs). However, the underlying causal relationship remains unclear. This study aims to elucidate the causal relationship between gut microbiota and CVDs and to explore the pathogenic role of gut microbiota in CVDs.</p><p><strong>Methods: </strong>In this two-sample Mendelian randomization study, we used genetic instruments from publicly available genome-wide association studies, including single-nucleotide polymorphisms (SNPs) associated with gut microbiota (n = 14,306) and CVDs (n = 2,207,591). We employed multiple statistical analysis methods, including inverse variance weighting, MR Egger, weighted median, MR pleiotropic residuals and outliers, and the leave-one-out method, to estimate the causal relationship between gut microbiota and CVDs. Additionally, we conducted multiple analyses to assess horizontal pleiotropy and heterogeneity.</p><p><strong>Results: </strong>GWAS summary data were available from a pooled sample of 2,221,897 adult and adolescent participants. Our findings indicated that specific gut microbiota had either protective or detrimental effects on CVDs. Notably, Howardella (OR = 0.955, 95% CI: 0.913-0.999, P = .05), Intestinibacter (OR = 0.908, 95% CI:0.831-0.993, P = .03), Lachnospiraceae (NK4A136 group) (OR = 0.904, 95% CI:0.841-0.973, P = .007), Turicibacter (OR = 0.904, 95% CI: 0.838-0.976, P = .01), Holdemania (OR, 0.898; 95% CI: 0.810-0.995, P = .04) and Odoribacter (OR, 0.835; 95% CI: 0.710-0.993, P = .04) exhibited a protective causal effect on atrial fibrillation, while other microbiota had adverse causal effects. Similar effects were observed with respect to coronary artery disease, myocardial infarction, ischemic stroke, and hypertension. Furthermore, reversed Mendelian randomization analyses revealed that atrial fibrillation and ischemic stroke had causal effects on certain gut microbiotas.</p><p><strong>Conclusion: </strong>Our study underscored the importance of gut microbiota in the context of CVDs and lent support to the hypothesis that increasing the abundance of probiotics or decreasing the abundance of harmful bacterial populations may offer protection against specific CVDs. Nevertheless, further research is essential to translate these findings into clinical practice.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"6"},"PeriodicalIF":4.5,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10898129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139974112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A network-based drug prioritization and combination analysis for the MEK5/ERK5 pathway in breast cancer. 基于网络的乳腺癌 MEK5/ERK5 通路药物优先排序和组合分析。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-21 DOI: 10.1186/s13040-024-00357-1
Regan Odongo, Asuman Demiroglu-Zergeroglu, Tunahan Çakır

Background: Prioritizing candidate drugs based on genome-wide expression data is an emerging approach in systems pharmacology due to its holistic perspective for preclinical drug evaluation. In the current study, a network-based approach was proposed and applied to prioritize plant polyphenols and identify potential drug combinations in breast cancer. We focused on MEK5/ERK5 signalling pathway genes, a recently identified potential drug target in cancer with roles spanning major carcinogenesis processes.

Results: By constructing and identifying perturbed protein-protein interaction networks for luminal A breast cancer, plant polyphenols and drugs from transcriptome data, we first demonstrated their systemic effects on the MEK5/ERK5 signalling pathway. Subsequently, we applied a pathway-specific network pharmacology pipeline to prioritize plant polyphenols and potential drug combinations for use in breast cancer. Our analysis prioritized genistein among plant polyphenols. Drug combination simulations predicted several FDA-approved drugs in breast cancer with well-established pharmacology as candidates for target network synergistic combination with genistein. This study also highlights the concept of target network enhancer drugs, with drugs previously not well characterised in breast cancer being prioritized for use in the MEK5/ERK5 pathway in breast cancer.

Conclusion: This study proposes a computational framework for drug prioritization and combination with the MEK5/ERK5 signaling pathway in breast cancer. The method is flexible and provides the scientific community with a robust method that can be applied to other complex diseases.

背景:基于全基因组表达数据对候选药物进行优先排序是系统药理学中的一种新兴方法,因为它能从整体角度对临床前药物进行评估。在本研究中,我们提出并应用了一种基于网络的方法来对植物多酚进行优先排序,并确定潜在的乳腺癌药物组合。我们重点研究了 MEK5/ERK5 信号通路基因,这是最近发现的癌症潜在药物靶点,其作用跨越了主要的致癌过程:结果:通过从转录组数据中构建和识别腔 A 型乳腺癌、植物多酚和药物的扰动蛋白-蛋白相互作用网络,我们首先证明了它们对 MEK5/ERK5 信号通路的系统性影响。随后,我们应用特定通路网络药理学管道,对植物多酚和可能用于乳腺癌的药物组合进行了优先排序。我们的分析在植物多酚中优先选择了染料木素。药物组合模拟预测了几种经 FDA 批准、药理学成熟的乳腺癌药物,它们是与染料木素进行靶向网络协同组合的候选药物。这项研究还强调了靶点网络增强药物的概念,将以前在乳腺癌中没有很好表征的药物优先用于乳腺癌的 MEK5/ERK5 通路:本研究提出了一个计算框架,用于确定乳腺癌中药物的优先顺序以及与 MEK5/ERK5 信号通路的结合。该方法非常灵活,为科学界提供了一种可应用于其他复杂疾病的稳健方法。
{"title":"A network-based drug prioritization and combination analysis for the MEK5/ERK5 pathway in breast cancer.","authors":"Regan Odongo, Asuman Demiroglu-Zergeroglu, Tunahan Çakır","doi":"10.1186/s13040-024-00357-1","DOIUrl":"10.1186/s13040-024-00357-1","url":null,"abstract":"<p><strong>Background: </strong>Prioritizing candidate drugs based on genome-wide expression data is an emerging approach in systems pharmacology due to its holistic perspective for preclinical drug evaluation. In the current study, a network-based approach was proposed and applied to prioritize plant polyphenols and identify potential drug combinations in breast cancer. We focused on MEK5/ERK5 signalling pathway genes, a recently identified potential drug target in cancer with roles spanning major carcinogenesis processes.</p><p><strong>Results: </strong>By constructing and identifying perturbed protein-protein interaction networks for luminal A breast cancer, plant polyphenols and drugs from transcriptome data, we first demonstrated their systemic effects on the MEK5/ERK5 signalling pathway. Subsequently, we applied a pathway-specific network pharmacology pipeline to prioritize plant polyphenols and potential drug combinations for use in breast cancer. Our analysis prioritized genistein among plant polyphenols. Drug combination simulations predicted several FDA-approved drugs in breast cancer with well-established pharmacology as candidates for target network synergistic combination with genistein. This study also highlights the concept of target network enhancer drugs, with drugs previously not well characterised in breast cancer being prioritized for use in the MEK5/ERK5 pathway in breast cancer.</p><p><strong>Conclusion: </strong>This study proposes a computational framework for drug prioritization and combination with the MEK5/ERK5 signaling pathway in breast cancer. The method is flexible and provides the scientific community with a robust method that can be applied to other complex diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"5"},"PeriodicalIF":4.5,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10880212/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139913853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
m1A-Ensem: accurate identification of 1-methyladenosine sites through ensemble models. m1A-Ensem:通过集合模型准确识别 1-甲基腺苷位点。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-15 DOI: 10.1186/s13040-023-00353-x
Muhammad Taseer Suleman, Fahad Alturise, Tamim Alkhalifah, Yaser Daanial Khan

Background: 1-methyladenosine (m1A) is a variant of methyladenosine that holds a methyl substituent in the 1st position having a prominent role in RNA stability and human metabolites.

Objective: Traditional approaches, such as mass spectrometry and site-directed mutagenesis, proved to be time-consuming and complicated.

Methodology: The present research focused on the identification of m1A sites within RNA sequences using novel feature development mechanisms. The obtained features were used to train the ensemble models, including blending, boosting, and bagging. Independent testing and k-fold cross validation were then performed on the trained ensemble models.

Results: The proposed model outperformed the preexisting predictors and revealed optimized scores based on major accuracy metrics.

Conclusion: For research purpose, a user-friendly webserver of the proposed model can be accessed through https://taseersuleman-m1a-ensem1.streamlit.app/ .

背景:1-甲基腺苷(m1A)是甲基腺苷的一种变体,其第 1 位上有一个甲基取代基,在 RNA 稳定性和人体代谢物中发挥着重要作用:传统的方法,如质谱法和定点诱变法,被证明是费时和复杂的:本研究的重点是利用新型特征开发机制识别 RNA 序列中的 m1A 位点。获得的特征被用于训练集合模型,包括混合、提升和装袋。然后对训练好的集合模型进行独立测试和 k 倍交叉验证:结果:所提出的模型优于先前存在的预测器,并根据主要的准确度指标显示出优化的分数:为便于研究,可通过 https://taseersuleman-m1a-ensem1.streamlit.app/ 访问所提模型的用户友好型网络服务器。
{"title":"m1A-Ensem: accurate identification of 1-methyladenosine sites through ensemble models.","authors":"Muhammad Taseer Suleman, Fahad Alturise, Tamim Alkhalifah, Yaser Daanial Khan","doi":"10.1186/s13040-023-00353-x","DOIUrl":"10.1186/s13040-023-00353-x","url":null,"abstract":"<p><strong>Background: </strong>1-methyladenosine (m1A) is a variant of methyladenosine that holds a methyl substituent in the 1st position having a prominent role in RNA stability and human metabolites.</p><p><strong>Objective: </strong>Traditional approaches, such as mass spectrometry and site-directed mutagenesis, proved to be time-consuming and complicated.</p><p><strong>Methodology: </strong>The present research focused on the identification of m1A sites within RNA sequences using novel feature development mechanisms. The obtained features were used to train the ensemble models, including blending, boosting, and bagging. Independent testing and k-fold cross validation were then performed on the trained ensemble models.</p><p><strong>Results: </strong>The proposed model outperformed the preexisting predictors and revealed optimized scores based on major accuracy metrics.</p><p><strong>Conclusion: </strong>For research purpose, a user-friendly webserver of the proposed model can be accessed through https://taseersuleman-m1a-ensem1.streamlit.app/ .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"4"},"PeriodicalIF":4.5,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139742372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies 通过在基因组研究中整合机器学习和熵方法揭示三阶相互作用
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-01-30 DOI: 10.1186/s13040-024-00355-3
Burcu Yaldız, Onur Erdoğan, Sevda Rafatov, Cem Iyigün, Yeşim Aydın Son
Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.
基因型水平的非线性关系对于理解复杂疾病性状的遗传相互作用至关重要。全基因组关联研究(GWAS)揭示了许多复杂疾病的 SNPs 统计关联。由于全基因组关联研究的结果无法彻底揭示这些疾病的遗传背景,全基因组相互作用研究开始受到重视。近年来,人们提出了各种统计方法,如基于熵的方法,用于揭示变异之间的非加性相互作用。本研究提出了一种新颖的优先排序工作流程,该流程整合了两步随机森林(RF)建模和 PLINK 过滤后的熵分析。PLINK-RF-RF 工作流程之后是基于熵的三向交互信息(3WII)方法,以捕捉晚发性阿尔茨海默病基因型之间非线性关系产生的隐藏模式,从而发现早期和鉴别诊断标记物。通过整合 PLINK-RF-RF 分析和基于熵的三向相互作用信息(3WII)计算方法,从不同的数据集中建立了三个模型,从而能够检测表观相互作用研究中主要未考虑的三阶相互作用。通过PLINK过滤和RF-RF建模对SNP进行优先排序,3WII分析为所有三个数据集选择了一个缩小的SNP集,这是一种有前途的模型最小化方法。在 3WII 发现的 SNPs 中,GenADA 的 19 个 SNPs 中有 4 个、ADNI 的 27 个 SNPs 中有 1 个、NCRAD 的 106 个 SNPs 中有 4 个与阿尔茨海默病直接相关。此外,还有几个 SNP 与其他神经系统疾病相关。此外,在所有数据集中,变异映射到的基因在钙离子结合、细胞外基质、外部包裹结构和 RUNX1 调控雌激素受体介导的转录途径中都有显著的富集。因此,建议进一步研究这些功能通路与 LOAD 的可能关联。此外,所有的3WII变体都被建议作为基于基因分型诊断LOAD的候选生物标记物。本研究中采用的熵方法揭示了对 LOAD 风险有重大影响的复杂遗传相互作用。我们利用基于熵的 3WII 作为模型最小化步骤,并通过 PLINK-RF-RF 确定了优先 SNPs 之间的显著 3 向相互作用。该框架是一种很有前景的疾病关联研究方法,还可以通过整合其他机器学习和基于熵的相互作用方法对其进行修改。
{"title":"Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies","authors":"Burcu Yaldız, Onur Erdoğan, Sevda Rafatov, Cem Iyigün, Yeşim Aydın Son","doi":"10.1186/s13040-024-00355-3","DOIUrl":"https://doi.org/10.1186/s13040-024-00355-3","url":null,"abstract":"Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"217 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Antibody selection strategies and their impact in predicting clinical malaria based on multi-sera data. 基于多血清数据的抗体选择策略及其对预测临床疟疾的影响。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-01-25 DOI: 10.1186/s13040-024-00354-4
André Fonseca, Mikolaj Spytek, Przemysław Biecek, Clara Cordeiro, Nuno Sepúlveda

Background: Nowadays, the chance of discovering the best antibody candidates for predicting clinical malaria has notably increased due to the availability of multi-sera data. The analysis of these data is typically divided into a feature selection phase followed by a predictive one where several models are constructed for predicting the outcome of interest. A key question in the analysis is to determine which antibodies  should be included in the predictive stage and whether they should be included in the original or a transformed scale (i.e. binary/dichotomized).

Methods: To answer this question, we developed three approaches for antibody selection in the context of predicting clinical malaria: (i) a basic and simple approach based on selecting antibodies via the nonparametric Mann-Whitney-Wilcoxon test; (ii) an optimal dychotomizationdichotomization approach where each antibody was selected according to the optimal cut-off via maximization of the chi-squared (χ2) statistic for two-way tables; (iii) a hybrid parametric/non-parametric approach that integrates Box-Cox transformation followed by a t-test, together with the use of finite mixture models and the Mann-Whitney-Wilcoxon test as a last resort. We illustrated the application of these three approaches with published serological data of 36 Plasmodium falciparum antigens for predicting clinical malaria in 121 Kenyan children. The predictive analysis was based on a Super Learner where predictions from multiple classifiers including the Random Forest were pooled together.

Results: Our results led to almost similar areas under the Receiver Operating Characteristic curves of 0.72 (95% CI = [0.62, 0.82]), 0.80 (95% CI = [0.71, 0.89]), 0.79 (95% CI = [0.7, 0.88]) for the simple, dichotomization and hybrid approaches, respectively. These approaches were based on 6, 20, and 16 antibodies, respectively.

Conclusions: The three feature selection strategies provided a better predictive performance of the outcome when compared to the previous results relying on Random Forest including all the 36 antibodies (AUC = 0.68, 95% CI = [0.57;0.79]). Given the similar predictive performance, we recommended that the three strategies should be used in conjunction in the same data set and selected according to their complexity.

背景:如今,由于多序列数据的可用性,发现用于预测临床疟疾的最佳候选抗体的机会显著增加。对这些数据的分析通常分为特征选择阶段和预测阶段,在预测阶段,需要构建多个模型来预测相关结果。分析中的一个关键问题是确定哪些抗体应纳入预测阶段,以及这些抗体应纳入原始量表还是转换量表(即二元/二分法):为了回答这个问题,我们开发了三种预测临床疟疾的抗体选择方法:(i) 通过非参数曼-惠特尼-威尔库克森检验(Mann-Whitney-Wilcoxon test)选择抗体的基本而简单的方法;(ii) 最佳二分法(optimal dychotomizationdichotomization),即通过最大化双向表的秩方(χ2)统计量,根据最佳截断值选择每种抗体;(iii) 参数/非参数混合法,即在进行方框-考克斯转换后进行 t 检验,同时使用有限混合物模型和 Mann-Whitney-Wilcoxon 检验作为最后手段。我们用已公布的 36 种恶性疟原虫抗原血清学数据说明了这三种方法在预测 121 名肯尼亚儿童临床疟疾方面的应用。预测分析以超级学习器为基础,将包括随机森林在内的多个分类器的预测结果汇集在一起:我们的结果表明,简单方法、二分法和混合方法的接收者工作特征曲线下的面积几乎相似,分别为 0.72 (95% CI = [0.62, 0.82])、0.80 (95% CI = [0.71, 0.89])、0.79 (95% CI = [0.7, 0.88])。这些方法分别基于 6、20 和 16 种抗体:与之前基于随机森林(包括所有 36 种抗体)的结果相比,这三种特征选择策略提供了更好的结果预测性能(AUC = 0.68,95% CI = [0.57;0.79])。鉴于预测性能相似,我们建议在同一数据集中同时使用这三种策略,并根据其复杂程度进行选择。
{"title":"Antibody selection strategies and their impact in predicting clinical malaria based on multi-sera data.","authors":"André Fonseca, Mikolaj Spytek, Przemysław Biecek, Clara Cordeiro, Nuno Sepúlveda","doi":"10.1186/s13040-024-00354-4","DOIUrl":"10.1186/s13040-024-00354-4","url":null,"abstract":"<p><strong>Background: </strong>Nowadays, the chance of discovering the best antibody candidates for predicting clinical malaria has notably increased due to the availability of multi-sera data. The analysis of these data is typically divided into a feature selection phase followed by a predictive one where several models are constructed for predicting the outcome of interest. A key question in the analysis is to determine which antibodies  should be included in the predictive stage and whether they should be included in the original or a transformed scale (i.e. binary/dichotomized).</p><p><strong>Methods: </strong>To answer this question, we developed three approaches for antibody selection in the context of predicting clinical malaria: (i) a basic and simple approach based on selecting antibodies via the nonparametric Mann-Whitney-Wilcoxon test; (ii) an optimal dychotomizationdichotomization approach where each antibody was selected according to the optimal cut-off via maximization of the chi-squared (χ<sup>2</sup>) statistic for two-way tables; (iii) a hybrid parametric/non-parametric approach that integrates Box-Cox transformation followed by a t-test, together with the use of finite mixture models and the Mann-Whitney-Wilcoxon test as a last resort. We illustrated the application of these three approaches with published serological data of 36 Plasmodium falciparum antigens for predicting clinical malaria in 121 Kenyan children. The predictive analysis was based on a Super Learner where predictions from multiple classifiers including the Random Forest were pooled together.</p><p><strong>Results: </strong>Our results led to almost similar areas under the Receiver Operating Characteristic curves of 0.72 (95% CI = [0.62, 0.82]), 0.80 (95% CI = [0.71, 0.89]), 0.79 (95% CI = [0.7, 0.88]) for the simple, dichotomization and hybrid approaches, respectively. These approaches were based on 6, 20, and 16 antibodies, respectively.</p><p><strong>Conclusions: </strong>The three feature selection strategies provided a better predictive performance of the outcome when compared to the previous results relying on Random Forest including all the 36 antibodies (AUC = 0.68, 95% CI = [0.57;0.79]). Given the similar predictive performance, we recommended that the three strategies should be used in conjunction in the same data set and selected according to their complexity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"2"},"PeriodicalIF":4.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10811867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139564720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records. 利用基因组数据和电子健康记录的机器学习方法识别抗核抗体阳性患者的系统性红斑狼疮。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-01-05 DOI: 10.1186/s13040-023-00352-y
Chih-Wei Chung, Seng-Cho Chou, Tzu-Hung Hsiao, Grace Joyce Zhang, Yu-Fang Chung, Yi-Ming Chen

Background: Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records.

Methods: Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640.

Results: A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered.

Conclusions: ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.

背景:尽管2019年EULAR/ACR系统性红斑狼疮(SLE)分类标准要求抗核抗体(ANA)滴度至少为阳性(≥1:80),但临床医生识别SLE患者仍面临挑战。本研究旨在开发一种机器学习(ML)方法,利用基因组数据和电子健康记录协助检测系统性红斑狼疮患者:方法:从台湾精准医疗计划队列中选取ANA阳性(≥ 1:80)的参与者。使用台湾生物库第二版阵列检测单核苷酸多态性(SNP)数据。研究人员使用逻辑回归、随机森林(RF)、支持向量机、轻梯度提升机、梯度树提升和极端梯度提升(XGB)等六种多重L模型来识别系统性红斑狼疮患者。临床和遗传特征的重要性由夏普利加性解释(SHAP)值决定。在 ANA 等于或超过 1:640 的患者子集中,采用逻辑回归模型确定与系统性红斑狼疮相关的遗传变异:结果:共有 946 名系统性红斑狼疮患者和 1,892 名非系统性红斑狼疮对照患者参与了此次分析。在六个 ML 模型中,RF 和 XGB 在区分系统性红斑狼疮和非系统性红斑狼疮方面表现优异。SHAP图中的主要特征是抗双链DNA抗体、ANA滴度、AC4 ANA模式、多基因风险评分、补体水平和SNPs。此外,在 ANA 滴度较高(≥ 1:640)的亚组中,发现了 6 个与系统性红斑狼疮正相关的 SNPs 和 5 个与系统性红斑狼疮负相关的 SNPs:ML方法有可能帮助诊断系统性红斑狼疮,并在一组自身免疫患者中发现新的SNPs。
{"title":"Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records.","authors":"Chih-Wei Chung, Seng-Cho Chou, Tzu-Hung Hsiao, Grace Joyce Zhang, Yu-Fang Chung, Yi-Ming Chen","doi":"10.1186/s13040-023-00352-y","DOIUrl":"10.1186/s13040-023-00352-y","url":null,"abstract":"<p><strong>Background: </strong>Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records.</p><p><strong>Methods: </strong>Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640.</p><p><strong>Results: </strong>A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered.</p><p><strong>Conclusions: </strong>ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"1"},"PeriodicalIF":4.5,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10770905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139106801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing age-related hearing risk predictions: an advanced machine learning integration with HHIE-S 优化年龄相关听力风险预测:先进的机器学习与 HHIE-S 集成
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-12-14 DOI: 10.1186/s13040-023-00351-z
Tzong-Hann Yang, Yu-Fu Chen, Yen-Fu Cheng, Jue-Ni Huang, Chuan-Song Wu, Yuan-Chia Chu
The elderly are disproportionately affected by age-related hearing loss (ARHL). Despite being a well-known tool for ARHL evaluation, the Hearing Handicap Inventory for the Elderly Screening version (HHIE-S) has only traditionally been used for direct screening using self-reported outcomes. This work uses a novel integration of machine learning approaches to improve the predicted accuracy of the HHIE-S tool for ARHL in older adults. We employed a dataset that was gathered between 2016 and 2018 and included 1,526 senior citizens from several Taipei City Hospital branches. 80% of the data were used for training (n = 1220) and 20% were used for testing (n = 356). XGBoost, Gradient Boosting, and LightGBM were among the machine learning models that were only used and assessed on the training set. In order to prevent data leakage and overfitting, the Light Gradient Boosting Machine (LGBM) model—which had the greatest AUC of 0.83 (95% CI 0.81–0.85)—was then only used on the holdout testing data. On the testing set, the LGBM model showed a strong AUC of 0.82 (95% CI 0.79–0.86), far outperforming conventional techniques. Notably, several HHIE-S items and age were found to be significant characteristics. In contrast to traditional HHIE research, which concentrates on the psychological effects of hearing loss, this study combines cutting-edge machine learning techniques—specifically, the LGBM classifier—with the HHIE-S tool. The incorporation of SHAP values enhances the interpretability of the model's predictions and provides a more comprehensive comprehension of the significance of various aspects. Our methodology highlights the great potential that arises from combining machine learning with validated hearing evaluation instruments such as the HHIE-S. Healthcare practitioners can anticipate ARHL more accurately thanks to this integration, which makes it easier to intervene quickly and precisely.
老年人受年龄相关性听力损失(ARHL)的影响尤为严重。尽管老年人听力障碍量表筛查版(HHIE-S)是众所周知的 ARHL 评估工具,但传统上仅用于使用自我报告结果进行直接筛查。这项工作采用了一种新颖的机器学习方法集成,以提高 HHIE-S 工具对老年人听力障碍的预测准确性。我们采用了 2016 年至 2018 年间收集的数据集,其中包括来自台北市立医院多家分院的 1526 名老年人。其中 80% 的数据用于训练(n = 1220),20% 的数据用于测试(n = 356)。XGBoost、梯度提升和LightGBM等机器学习模型仅在训练集上使用和评估。为了防止数据泄漏和过拟合,轻梯度提升机(Light Gradient Boosting Machine,LGBM)模型的 AUC 最高,为 0.83(95% CI 0.81-0.85),因此只用于保留测试数据。在测试集上,LGBM 模型的 AUC 高达 0.82(95% CI 0.79-0.86),远远超过了传统技术。值得注意的是,几个 HHIE-S 项目和年龄被认为是重要特征。传统的 HHIE 研究侧重于听力损失的心理影响,而本研究则将前沿的机器学习技术(特别是 LGBM 分类器)与 HHIE-S 工具相结合。SHAP 值的加入增强了模型预测的可解释性,并提供了对各方面重要性的更全面理解。我们的方法凸显了将机器学习与 HHIE-S 等经过验证的听力评估工具相结合的巨大潜力。通过这种整合,医疗从业人员可以更准确地预测 ARHL,从而更容易快速、准确地进行干预。
{"title":"Optimizing age-related hearing risk predictions: an advanced machine learning integration with HHIE-S","authors":"Tzong-Hann Yang, Yu-Fu Chen, Yen-Fu Cheng, Jue-Ni Huang, Chuan-Song Wu, Yuan-Chia Chu","doi":"10.1186/s13040-023-00351-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00351-z","url":null,"abstract":"The elderly are disproportionately affected by age-related hearing loss (ARHL). Despite being a well-known tool for ARHL evaluation, the Hearing Handicap Inventory for the Elderly Screening version (HHIE-S) has only traditionally been used for direct screening using self-reported outcomes. This work uses a novel integration of machine learning approaches to improve the predicted accuracy of the HHIE-S tool for ARHL in older adults. We employed a dataset that was gathered between 2016 and 2018 and included 1,526 senior citizens from several Taipei City Hospital branches. 80% of the data were used for training (n = 1220) and 20% were used for testing (n = 356). XGBoost, Gradient Boosting, and LightGBM were among the machine learning models that were only used and assessed on the training set. In order to prevent data leakage and overfitting, the Light Gradient Boosting Machine (LGBM) model—which had the greatest AUC of 0.83 (95% CI 0.81–0.85)—was then only used on the holdout testing data. On the testing set, the LGBM model showed a strong AUC of 0.82 (95% CI 0.79–0.86), far outperforming conventional techniques. Notably, several HHIE-S items and age were found to be significant characteristics. In contrast to traditional HHIE research, which concentrates on the psychological effects of hearing loss, this study combines cutting-edge machine learning techniques—specifically, the LGBM classifier—with the HHIE-S tool. The incorporation of SHAP values enhances the interpretability of the model's predictions and provides a more comprehensive comprehension of the significance of various aspects. Our methodology highlights the great potential that arises from combining machine learning with validated hearing evaluation instruments such as the HHIE-S. Healthcare practitioners can anticipate ARHL more accurately thanks to this integration, which makes it easier to intervene quickly and precisely.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"33 4 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138691791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1