首页 > 最新文献

Biodata Mining最新文献

英文 中文
Interaction models matter: an efficient, flexible computational framework for model-specific investigation of epistasis. 交互模型很重要:一种高效、灵活的计算框架,用于特定模型的表观性研究。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-28 DOI: 10.1186/s13040-024-00358-0
Sandra Batista, Vered Senderovich Madar, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Apurva S Chitre, Abraham A Palmer, Jason H Moore

Purpose: Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermore, existing methods for epistasis detection only consider a Cartesian (multiplicative) model for interaction terms. This is likely limiting as epistatic interactions can evolve to produce varied relationships between genetic loci, some complex and not linearly separable.

Methods: We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many varied models for the interaction terms for loci and efficient memory usage. The algorithms are given for two-way and three-way epistasis and may be generalized to higher order epistasis. Statistical tests for the interaction coefficients are also provided. We also present an efficient matrix based algorithm for permutation testing for two-way epistasis. We offer a proof and experimental evidence that methods that look for epistasis only at loci that have main effects may not be justified. Given the computational efficiency of the algorithm, we applied the method to a rat data set and mouse data set, with at least 10,000 loci and 1,000 samples each, using the standard Cartesian model and the XOR model to explore body mass index.

Results: This study reveals that although many of the loci found to exhibit significant statistical epistasis overlap between models in rats, the pairs are mostly distinct. Further, the XOR model found greater evidence for statistical epistasis in many more pairs of loci in both data sets with almost all significant epistasis in mice identified using XOR. In the rat data set, loci involved in epistasis under the XOR model are enriched for biologically relevant pathways.

Conclusion: Our results in both species show that many biologically relevant epistatic relationships would have been undetected if only one interaction model was applied, providing evidence that varied interaction models should be implemented to explore epistatic interactions that occur in living systems.

目的:外显子效应(两个或多个基因之间的相互作用)是遗传学研究中不可或缺的一部分,它存在于整个自然界中。然而,由于大多数方法主要关注单病灶效应,而分析所有成对和高阶相互作用需要大量计算资源,因此很少对其进行充分探索。此外,现有的外显子检测方法只考虑相互作用项的笛卡尔(乘法)模型。这很可能具有局限性,因为表观相互作用会在遗传位点之间演变出各种关系,有些关系很复杂,而且不是线性可分的:方法:我们针对表观遗传的标准回归模型提出了交互作用系数的新算法,这种算法允许为基因座的交互作用项建立多种不同的模型,并能有效地使用内存。这些算法适用于双向和三向外显率,并可推广到更高阶的外显率。我们还提供了交互作用系数的统计检验。我们还提出了一种基于矩阵的高效算法,用于双向外显率的置换检验。我们提供了证明和实验证据,说明只在具有主效应的位点上寻找表观性的方法可能是不合理的。鉴于该算法的计算效率,我们将该方法应用于大鼠数据集和小鼠数据集,每个数据集至少有 10,000 个位点和 1,000 个样本,使用标准笛卡尔模型和 XOR 模型来探讨体重指数:研究结果表明,虽然在大鼠中发现的许多基因位点在不同模型之间有显著的统计外显重叠,但这些位点对大多是不同的。此外,在两个数据集中,XOR 模型在更多的基因位点对中发现了更多的统计外显性证据,在小鼠中几乎所有的显著外显性都是通过 XOR 发现的。在大鼠的数据集中,XOR 模型中涉及外显的基因位点都富集在生物相关的通路上:我们在两个物种中的研究结果表明,如果只采用一种相互作用模型,许多与生物相关的表观关系可能不会被发现,这证明应该采用不同的相互作用模型来探索生命系统中发生的表观相互作用。
{"title":"Interaction models matter: an efficient, flexible computational framework for model-specific investigation of epistasis.","authors":"Sandra Batista, Vered Senderovich Madar, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Apurva S Chitre, Abraham A Palmer, Jason H Moore","doi":"10.1186/s13040-024-00358-0","DOIUrl":"10.1186/s13040-024-00358-0","url":null,"abstract":"<p><strong>Purpose: </strong>Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermore, existing methods for epistasis detection only consider a Cartesian (multiplicative) model for interaction terms. This is likely limiting as epistatic interactions can evolve to produce varied relationships between genetic loci, some complex and not linearly separable.</p><p><strong>Methods: </strong>We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many varied models for the interaction terms for loci and efficient memory usage. The algorithms are given for two-way and three-way epistasis and may be generalized to higher order epistasis. Statistical tests for the interaction coefficients are also provided. We also present an efficient matrix based algorithm for permutation testing for two-way epistasis. We offer a proof and experimental evidence that methods that look for epistasis only at loci that have main effects may not be justified. Given the computational efficiency of the algorithm, we applied the method to a rat data set and mouse data set, with at least 10,000 loci and 1,000 samples each, using the standard Cartesian model and the XOR model to explore body mass index.</p><p><strong>Results: </strong>This study reveals that although many of the loci found to exhibit significant statistical epistasis overlap between models in rats, the pairs are mostly distinct. Further, the XOR model found greater evidence for statistical epistasis in many more pairs of loci in both data sets with almost all significant epistasis in mice identified using XOR. In the rat data set, loci involved in epistasis under the XOR model are enriched for biologically relevant pathways.</p><p><strong>Conclusion: </strong>Our results in both species show that many biologically relevant epistatic relationships would have been undetected if only one interaction model was applied, providing evidence that varied interaction models should be implemented to explore epistatic interactions that occur in living systems.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"7"},"PeriodicalIF":4.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10900690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139991555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of the causal relationship between gut microbiota and cardiovascular diseases: a bidirectional Mendelian randomization analysis. 肠道微生物群与心血管疾病因果关系的评估:双向孟德尔随机分析。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-26 DOI: 10.1186/s13040-024-00356-2
Xiao-Ce Dai, Yi Yu, Si-Yu Zhou, Shuo Yu, Mei-Xiang Xiang, Hong Ma

Background: Previous studies have shown an association between gut microbiota and cardiovascular diseases (CVDs). However, the underlying causal relationship remains unclear. This study aims to elucidate the causal relationship between gut microbiota and CVDs and to explore the pathogenic role of gut microbiota in CVDs.

Methods: In this two-sample Mendelian randomization study, we used genetic instruments from publicly available genome-wide association studies, including single-nucleotide polymorphisms (SNPs) associated with gut microbiota (n = 14,306) and CVDs (n = 2,207,591). We employed multiple statistical analysis methods, including inverse variance weighting, MR Egger, weighted median, MR pleiotropic residuals and outliers, and the leave-one-out method, to estimate the causal relationship between gut microbiota and CVDs. Additionally, we conducted multiple analyses to assess horizontal pleiotropy and heterogeneity.

Results: GWAS summary data were available from a pooled sample of 2,221,897 adult and adolescent participants. Our findings indicated that specific gut microbiota had either protective or detrimental effects on CVDs. Notably, Howardella (OR = 0.955, 95% CI: 0.913-0.999, P = .05), Intestinibacter (OR = 0.908, 95% CI:0.831-0.993, P = .03), Lachnospiraceae (NK4A136 group) (OR = 0.904, 95% CI:0.841-0.973, P = .007), Turicibacter (OR = 0.904, 95% CI: 0.838-0.976, P = .01), Holdemania (OR, 0.898; 95% CI: 0.810-0.995, P = .04) and Odoribacter (OR, 0.835; 95% CI: 0.710-0.993, P = .04) exhibited a protective causal effect on atrial fibrillation, while other microbiota had adverse causal effects. Similar effects were observed with respect to coronary artery disease, myocardial infarction, ischemic stroke, and hypertension. Furthermore, reversed Mendelian randomization analyses revealed that atrial fibrillation and ischemic stroke had causal effects on certain gut microbiotas.

Conclusion: Our study underscored the importance of gut microbiota in the context of CVDs and lent support to the hypothesis that increasing the abundance of probiotics or decreasing the abundance of harmful bacterial populations may offer protection against specific CVDs. Nevertheless, further research is essential to translate these findings into clinical practice.

背景:以往的研究表明,肠道微生物群与心血管疾病(CVDs)之间存在关联。然而,其背后的因果关系仍不清楚。本研究旨在阐明肠道微生物群与心血管疾病之间的因果关系,并探讨肠道微生物群在心血管疾病中的致病作用:在这项双样本孟德尔随机研究中,我们使用了来自公开全基因组关联研究的遗传工具,包括与肠道微生物群(n = 14,306 个)和心血管疾病(n = 2,207,591 个)相关的单核苷酸多态性(SNPs)。我们采用了多种统计分析方法,包括反方差加权、MR Egger、加权中位数、MR 多态残差和离群值以及撇除法,来估计肠道微生物群与心血管疾病之间的因果关系。此外,我们还进行了多重分析,以评估水平多义性和异质性:GWAS汇总数据来自2,221,897名成人和青少年参与者的汇总样本。我们的研究结果表明,特定的肠道微生物群对心血管疾病具有保护或有害作用。值得注意的是,霍华德氏菌(OR = 0.955,95% CI:0.913-0.999,P = .05)、肠杆菌(OR = 0.908,95% CI:0.831-0.993,P = .03)、Lachnospiraceae(NK4A136 组)(OR = 0.904,95% CI:0.841-0.973,P = .007)、Turisibacter(OR = 0.904,95% CI:0.838-0.976,P = .01)、Holdemania(OR,0.898;95% CI:0.810-0.995,P = .04)和Odoribacter(OR,0.835;95% CI:0.710-0.993,P = .04)对心房颤动具有保护性因果效应,而其他微生物群则具有不利的因果效应。在冠状动脉疾病、心肌梗塞、缺血性中风和高血压方面也观察到类似的效应。此外,反向孟德尔随机分析显示,心房颤动和缺血性中风对某些肠道微生物群具有因果效应:我们的研究强调了肠道微生物群在心血管疾病中的重要性,并支持了增加益生菌数量或减少有害细菌数量可预防特定心血管疾病的假设。不过,要将这些发现转化为临床实践,还需要进一步的研究。
{"title":"Assessment of the causal relationship between gut microbiota and cardiovascular diseases: a bidirectional Mendelian randomization analysis.","authors":"Xiao-Ce Dai, Yi Yu, Si-Yu Zhou, Shuo Yu, Mei-Xiang Xiang, Hong Ma","doi":"10.1186/s13040-024-00356-2","DOIUrl":"10.1186/s13040-024-00356-2","url":null,"abstract":"<p><strong>Background: </strong>Previous studies have shown an association between gut microbiota and cardiovascular diseases (CVDs). However, the underlying causal relationship remains unclear. This study aims to elucidate the causal relationship between gut microbiota and CVDs and to explore the pathogenic role of gut microbiota in CVDs.</p><p><strong>Methods: </strong>In this two-sample Mendelian randomization study, we used genetic instruments from publicly available genome-wide association studies, including single-nucleotide polymorphisms (SNPs) associated with gut microbiota (n = 14,306) and CVDs (n = 2,207,591). We employed multiple statistical analysis methods, including inverse variance weighting, MR Egger, weighted median, MR pleiotropic residuals and outliers, and the leave-one-out method, to estimate the causal relationship between gut microbiota and CVDs. Additionally, we conducted multiple analyses to assess horizontal pleiotropy and heterogeneity.</p><p><strong>Results: </strong>GWAS summary data were available from a pooled sample of 2,221,897 adult and adolescent participants. Our findings indicated that specific gut microbiota had either protective or detrimental effects on CVDs. Notably, Howardella (OR = 0.955, 95% CI: 0.913-0.999, P = .05), Intestinibacter (OR = 0.908, 95% CI:0.831-0.993, P = .03), Lachnospiraceae (NK4A136 group) (OR = 0.904, 95% CI:0.841-0.973, P = .007), Turicibacter (OR = 0.904, 95% CI: 0.838-0.976, P = .01), Holdemania (OR, 0.898; 95% CI: 0.810-0.995, P = .04) and Odoribacter (OR, 0.835; 95% CI: 0.710-0.993, P = .04) exhibited a protective causal effect on atrial fibrillation, while other microbiota had adverse causal effects. Similar effects were observed with respect to coronary artery disease, myocardial infarction, ischemic stroke, and hypertension. Furthermore, reversed Mendelian randomization analyses revealed that atrial fibrillation and ischemic stroke had causal effects on certain gut microbiotas.</p><p><strong>Conclusion: </strong>Our study underscored the importance of gut microbiota in the context of CVDs and lent support to the hypothesis that increasing the abundance of probiotics or decreasing the abundance of harmful bacterial populations may offer protection against specific CVDs. Nevertheless, further research is essential to translate these findings into clinical practice.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"6"},"PeriodicalIF":4.5,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10898129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139974112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A network-based drug prioritization and combination analysis for the MEK5/ERK5 pathway in breast cancer. 基于网络的乳腺癌 MEK5/ERK5 通路药物优先排序和组合分析。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-21 DOI: 10.1186/s13040-024-00357-1
Regan Odongo, Asuman Demiroglu-Zergeroglu, Tunahan Çakır

Background: Prioritizing candidate drugs based on genome-wide expression data is an emerging approach in systems pharmacology due to its holistic perspective for preclinical drug evaluation. In the current study, a network-based approach was proposed and applied to prioritize plant polyphenols and identify potential drug combinations in breast cancer. We focused on MEK5/ERK5 signalling pathway genes, a recently identified potential drug target in cancer with roles spanning major carcinogenesis processes.

Results: By constructing and identifying perturbed protein-protein interaction networks for luminal A breast cancer, plant polyphenols and drugs from transcriptome data, we first demonstrated their systemic effects on the MEK5/ERK5 signalling pathway. Subsequently, we applied a pathway-specific network pharmacology pipeline to prioritize plant polyphenols and potential drug combinations for use in breast cancer. Our analysis prioritized genistein among plant polyphenols. Drug combination simulations predicted several FDA-approved drugs in breast cancer with well-established pharmacology as candidates for target network synergistic combination with genistein. This study also highlights the concept of target network enhancer drugs, with drugs previously not well characterised in breast cancer being prioritized for use in the MEK5/ERK5 pathway in breast cancer.

Conclusion: This study proposes a computational framework for drug prioritization and combination with the MEK5/ERK5 signaling pathway in breast cancer. The method is flexible and provides the scientific community with a robust method that can be applied to other complex diseases.

背景:基于全基因组表达数据对候选药物进行优先排序是系统药理学中的一种新兴方法,因为它能从整体角度对临床前药物进行评估。在本研究中,我们提出并应用了一种基于网络的方法来对植物多酚进行优先排序,并确定潜在的乳腺癌药物组合。我们重点研究了 MEK5/ERK5 信号通路基因,这是最近发现的癌症潜在药物靶点,其作用跨越了主要的致癌过程:结果:通过从转录组数据中构建和识别腔 A 型乳腺癌、植物多酚和药物的扰动蛋白-蛋白相互作用网络,我们首先证明了它们对 MEK5/ERK5 信号通路的系统性影响。随后,我们应用特定通路网络药理学管道,对植物多酚和可能用于乳腺癌的药物组合进行了优先排序。我们的分析在植物多酚中优先选择了染料木素。药物组合模拟预测了几种经 FDA 批准、药理学成熟的乳腺癌药物,它们是与染料木素进行靶向网络协同组合的候选药物。这项研究还强调了靶点网络增强药物的概念,将以前在乳腺癌中没有很好表征的药物优先用于乳腺癌的 MEK5/ERK5 通路:本研究提出了一个计算框架,用于确定乳腺癌中药物的优先顺序以及与 MEK5/ERK5 信号通路的结合。该方法非常灵活,为科学界提供了一种可应用于其他复杂疾病的稳健方法。
{"title":"A network-based drug prioritization and combination analysis for the MEK5/ERK5 pathway in breast cancer.","authors":"Regan Odongo, Asuman Demiroglu-Zergeroglu, Tunahan Çakır","doi":"10.1186/s13040-024-00357-1","DOIUrl":"10.1186/s13040-024-00357-1","url":null,"abstract":"<p><strong>Background: </strong>Prioritizing candidate drugs based on genome-wide expression data is an emerging approach in systems pharmacology due to its holistic perspective for preclinical drug evaluation. In the current study, a network-based approach was proposed and applied to prioritize plant polyphenols and identify potential drug combinations in breast cancer. We focused on MEK5/ERK5 signalling pathway genes, a recently identified potential drug target in cancer with roles spanning major carcinogenesis processes.</p><p><strong>Results: </strong>By constructing and identifying perturbed protein-protein interaction networks for luminal A breast cancer, plant polyphenols and drugs from transcriptome data, we first demonstrated their systemic effects on the MEK5/ERK5 signalling pathway. Subsequently, we applied a pathway-specific network pharmacology pipeline to prioritize plant polyphenols and potential drug combinations for use in breast cancer. Our analysis prioritized genistein among plant polyphenols. Drug combination simulations predicted several FDA-approved drugs in breast cancer with well-established pharmacology as candidates for target network synergistic combination with genistein. This study also highlights the concept of target network enhancer drugs, with drugs previously not well characterised in breast cancer being prioritized for use in the MEK5/ERK5 pathway in breast cancer.</p><p><strong>Conclusion: </strong>This study proposes a computational framework for drug prioritization and combination with the MEK5/ERK5 signaling pathway in breast cancer. The method is flexible and provides the scientific community with a robust method that can be applied to other complex diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"5"},"PeriodicalIF":4.5,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10880212/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139913853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
m1A-Ensem: accurate identification of 1-methyladenosine sites through ensemble models. m1A-Ensem:通过集合模型准确识别 1-甲基腺苷位点。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-02-15 DOI: 10.1186/s13040-023-00353-x
Muhammad Taseer Suleman, Fahad Alturise, Tamim Alkhalifah, Yaser Daanial Khan

Background: 1-methyladenosine (m1A) is a variant of methyladenosine that holds a methyl substituent in the 1st position having a prominent role in RNA stability and human metabolites.

Objective: Traditional approaches, such as mass spectrometry and site-directed mutagenesis, proved to be time-consuming and complicated.

Methodology: The present research focused on the identification of m1A sites within RNA sequences using novel feature development mechanisms. The obtained features were used to train the ensemble models, including blending, boosting, and bagging. Independent testing and k-fold cross validation were then performed on the trained ensemble models.

Results: The proposed model outperformed the preexisting predictors and revealed optimized scores based on major accuracy metrics.

Conclusion: For research purpose, a user-friendly webserver of the proposed model can be accessed through https://taseersuleman-m1a-ensem1.streamlit.app/ .

背景:1-甲基腺苷(m1A)是甲基腺苷的一种变体,其第 1 位上有一个甲基取代基,在 RNA 稳定性和人体代谢物中发挥着重要作用:传统的方法,如质谱法和定点诱变法,被证明是费时和复杂的:本研究的重点是利用新型特征开发机制识别 RNA 序列中的 m1A 位点。获得的特征被用于训练集合模型,包括混合、提升和装袋。然后对训练好的集合模型进行独立测试和 k 倍交叉验证:结果:所提出的模型优于先前存在的预测器,并根据主要的准确度指标显示出优化的分数:为便于研究,可通过 https://taseersuleman-m1a-ensem1.streamlit.app/ 访问所提模型的用户友好型网络服务器。
{"title":"m1A-Ensem: accurate identification of 1-methyladenosine sites through ensemble models.","authors":"Muhammad Taseer Suleman, Fahad Alturise, Tamim Alkhalifah, Yaser Daanial Khan","doi":"10.1186/s13040-023-00353-x","DOIUrl":"10.1186/s13040-023-00353-x","url":null,"abstract":"<p><strong>Background: </strong>1-methyladenosine (m1A) is a variant of methyladenosine that holds a methyl substituent in the 1st position having a prominent role in RNA stability and human metabolites.</p><p><strong>Objective: </strong>Traditional approaches, such as mass spectrometry and site-directed mutagenesis, proved to be time-consuming and complicated.</p><p><strong>Methodology: </strong>The present research focused on the identification of m1A sites within RNA sequences using novel feature development mechanisms. The obtained features were used to train the ensemble models, including blending, boosting, and bagging. Independent testing and k-fold cross validation were then performed on the trained ensemble models.</p><p><strong>Results: </strong>The proposed model outperformed the preexisting predictors and revealed optimized scores based on major accuracy metrics.</p><p><strong>Conclusion: </strong>For research purpose, a user-friendly webserver of the proposed model can be accessed through https://taseersuleman-m1a-ensem1.streamlit.app/ .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"4"},"PeriodicalIF":4.5,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139742372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies 通过在基因组研究中整合机器学习和熵方法揭示三阶相互作用
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-01-30 DOI: 10.1186/s13040-024-00355-3
Burcu Yaldız, Onur Erdoğan, Sevda Rafatov, Cem Iyigün, Yeşim Aydın Son
Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.
基因型水平的非线性关系对于理解复杂疾病性状的遗传相互作用至关重要。全基因组关联研究(GWAS)揭示了许多复杂疾病的 SNPs 统计关联。由于全基因组关联研究的结果无法彻底揭示这些疾病的遗传背景,全基因组相互作用研究开始受到重视。近年来,人们提出了各种统计方法,如基于熵的方法,用于揭示变异之间的非加性相互作用。本研究提出了一种新颖的优先排序工作流程,该流程整合了两步随机森林(RF)建模和 PLINK 过滤后的熵分析。PLINK-RF-RF 工作流程之后是基于熵的三向交互信息(3WII)方法,以捕捉晚发性阿尔茨海默病基因型之间非线性关系产生的隐藏模式,从而发现早期和鉴别诊断标记物。通过整合 PLINK-RF-RF 分析和基于熵的三向相互作用信息(3WII)计算方法,从不同的数据集中建立了三个模型,从而能够检测表观相互作用研究中主要未考虑的三阶相互作用。通过PLINK过滤和RF-RF建模对SNP进行优先排序,3WII分析为所有三个数据集选择了一个缩小的SNP集,这是一种有前途的模型最小化方法。在 3WII 发现的 SNPs 中,GenADA 的 19 个 SNPs 中有 4 个、ADNI 的 27 个 SNPs 中有 1 个、NCRAD 的 106 个 SNPs 中有 4 个与阿尔茨海默病直接相关。此外,还有几个 SNP 与其他神经系统疾病相关。此外,在所有数据集中,变异映射到的基因在钙离子结合、细胞外基质、外部包裹结构和 RUNX1 调控雌激素受体介导的转录途径中都有显著的富集。因此,建议进一步研究这些功能通路与 LOAD 的可能关联。此外,所有的3WII变体都被建议作为基于基因分型诊断LOAD的候选生物标记物。本研究中采用的熵方法揭示了对 LOAD 风险有重大影响的复杂遗传相互作用。我们利用基于熵的 3WII 作为模型最小化步骤,并通过 PLINK-RF-RF 确定了优先 SNPs 之间的显著 3 向相互作用。该框架是一种很有前景的疾病关联研究方法,还可以通过整合其他机器学习和基于熵的相互作用方法对其进行修改。
{"title":"Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies","authors":"Burcu Yaldız, Onur Erdoğan, Sevda Rafatov, Cem Iyigün, Yeşim Aydın Son","doi":"10.1186/s13040-024-00355-3","DOIUrl":"https://doi.org/10.1186/s13040-024-00355-3","url":null,"abstract":"Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"217 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Antibody selection strategies and their impact in predicting clinical malaria based on multi-sera data. 基于多血清数据的抗体选择策略及其对预测临床疟疾的影响。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-01-25 DOI: 10.1186/s13040-024-00354-4
André Fonseca, Mikolaj Spytek, Przemysław Biecek, Clara Cordeiro, Nuno Sepúlveda

Background: Nowadays, the chance of discovering the best antibody candidates for predicting clinical malaria has notably increased due to the availability of multi-sera data. The analysis of these data is typically divided into a feature selection phase followed by a predictive one where several models are constructed for predicting the outcome of interest. A key question in the analysis is to determine which antibodies  should be included in the predictive stage and whether they should be included in the original or a transformed scale (i.e. binary/dichotomized).

Methods: To answer this question, we developed three approaches for antibody selection in the context of predicting clinical malaria: (i) a basic and simple approach based on selecting antibodies via the nonparametric Mann-Whitney-Wilcoxon test; (ii) an optimal dychotomizationdichotomization approach where each antibody was selected according to the optimal cut-off via maximization of the chi-squared (χ2) statistic for two-way tables; (iii) a hybrid parametric/non-parametric approach that integrates Box-Cox transformation followed by a t-test, together with the use of finite mixture models and the Mann-Whitney-Wilcoxon test as a last resort. We illustrated the application of these three approaches with published serological data of 36 Plasmodium falciparum antigens for predicting clinical malaria in 121 Kenyan children. The predictive analysis was based on a Super Learner where predictions from multiple classifiers including the Random Forest were pooled together.

Results: Our results led to almost similar areas under the Receiver Operating Characteristic curves of 0.72 (95% CI = [0.62, 0.82]), 0.80 (95% CI = [0.71, 0.89]), 0.79 (95% CI = [0.7, 0.88]) for the simple, dichotomization and hybrid approaches, respectively. These approaches were based on 6, 20, and 16 antibodies, respectively.

Conclusions: The three feature selection strategies provided a better predictive performance of the outcome when compared to the previous results relying on Random Forest including all the 36 antibodies (AUC = 0.68, 95% CI = [0.57;0.79]). Given the similar predictive performance, we recommended that the three strategies should be used in conjunction in the same data set and selected according to their complexity.

背景:如今,由于多序列数据的可用性,发现用于预测临床疟疾的最佳候选抗体的机会显著增加。对这些数据的分析通常分为特征选择阶段和预测阶段,在预测阶段,需要构建多个模型来预测相关结果。分析中的一个关键问题是确定哪些抗体应纳入预测阶段,以及这些抗体应纳入原始量表还是转换量表(即二元/二分法):为了回答这个问题,我们开发了三种预测临床疟疾的抗体选择方法:(i) 通过非参数曼-惠特尼-威尔库克森检验(Mann-Whitney-Wilcoxon test)选择抗体的基本而简单的方法;(ii) 最佳二分法(optimal dychotomizationdichotomization),即通过最大化双向表的秩方(χ2)统计量,根据最佳截断值选择每种抗体;(iii) 参数/非参数混合法,即在进行方框-考克斯转换后进行 t 检验,同时使用有限混合物模型和 Mann-Whitney-Wilcoxon 检验作为最后手段。我们用已公布的 36 种恶性疟原虫抗原血清学数据说明了这三种方法在预测 121 名肯尼亚儿童临床疟疾方面的应用。预测分析以超级学习器为基础,将包括随机森林在内的多个分类器的预测结果汇集在一起:我们的结果表明,简单方法、二分法和混合方法的接收者工作特征曲线下的面积几乎相似,分别为 0.72 (95% CI = [0.62, 0.82])、0.80 (95% CI = [0.71, 0.89])、0.79 (95% CI = [0.7, 0.88])。这些方法分别基于 6、20 和 16 种抗体:与之前基于随机森林(包括所有 36 种抗体)的结果相比,这三种特征选择策略提供了更好的结果预测性能(AUC = 0.68,95% CI = [0.57;0.79])。鉴于预测性能相似,我们建议在同一数据集中同时使用这三种策略,并根据其复杂程度进行选择。
{"title":"Antibody selection strategies and their impact in predicting clinical malaria based on multi-sera data.","authors":"André Fonseca, Mikolaj Spytek, Przemysław Biecek, Clara Cordeiro, Nuno Sepúlveda","doi":"10.1186/s13040-024-00354-4","DOIUrl":"10.1186/s13040-024-00354-4","url":null,"abstract":"<p><strong>Background: </strong>Nowadays, the chance of discovering the best antibody candidates for predicting clinical malaria has notably increased due to the availability of multi-sera data. The analysis of these data is typically divided into a feature selection phase followed by a predictive one where several models are constructed for predicting the outcome of interest. A key question in the analysis is to determine which antibodies  should be included in the predictive stage and whether they should be included in the original or a transformed scale (i.e. binary/dichotomized).</p><p><strong>Methods: </strong>To answer this question, we developed three approaches for antibody selection in the context of predicting clinical malaria: (i) a basic and simple approach based on selecting antibodies via the nonparametric Mann-Whitney-Wilcoxon test; (ii) an optimal dychotomizationdichotomization approach where each antibody was selected according to the optimal cut-off via maximization of the chi-squared (χ<sup>2</sup>) statistic for two-way tables; (iii) a hybrid parametric/non-parametric approach that integrates Box-Cox transformation followed by a t-test, together with the use of finite mixture models and the Mann-Whitney-Wilcoxon test as a last resort. We illustrated the application of these three approaches with published serological data of 36 Plasmodium falciparum antigens for predicting clinical malaria in 121 Kenyan children. The predictive analysis was based on a Super Learner where predictions from multiple classifiers including the Random Forest were pooled together.</p><p><strong>Results: </strong>Our results led to almost similar areas under the Receiver Operating Characteristic curves of 0.72 (95% CI = [0.62, 0.82]), 0.80 (95% CI = [0.71, 0.89]), 0.79 (95% CI = [0.7, 0.88]) for the simple, dichotomization and hybrid approaches, respectively. These approaches were based on 6, 20, and 16 antibodies, respectively.</p><p><strong>Conclusions: </strong>The three feature selection strategies provided a better predictive performance of the outcome when compared to the previous results relying on Random Forest including all the 36 antibodies (AUC = 0.68, 95% CI = [0.57;0.79]). Given the similar predictive performance, we recommended that the three strategies should be used in conjunction in the same data set and selected according to their complexity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"2"},"PeriodicalIF":4.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10811867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139564720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records. 利用基因组数据和电子健康记录的机器学习方法识别抗核抗体阳性患者的系统性红斑狼疮。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-01-05 DOI: 10.1186/s13040-023-00352-y
Chih-Wei Chung, Seng-Cho Chou, Tzu-Hung Hsiao, Grace Joyce Zhang, Yu-Fang Chung, Yi-Ming Chen

Background: Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records.

Methods: Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640.

Results: A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered.

Conclusions: ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.

背景:尽管2019年EULAR/ACR系统性红斑狼疮(SLE)分类标准要求抗核抗体(ANA)滴度至少为阳性(≥1:80),但临床医生识别SLE患者仍面临挑战。本研究旨在开发一种机器学习(ML)方法,利用基因组数据和电子健康记录协助检测系统性红斑狼疮患者:方法:从台湾精准医疗计划队列中选取ANA阳性(≥ 1:80)的参与者。使用台湾生物库第二版阵列检测单核苷酸多态性(SNP)数据。研究人员使用逻辑回归、随机森林(RF)、支持向量机、轻梯度提升机、梯度树提升和极端梯度提升(XGB)等六种多重L模型来识别系统性红斑狼疮患者。临床和遗传特征的重要性由夏普利加性解释(SHAP)值决定。在 ANA 等于或超过 1:640 的患者子集中,采用逻辑回归模型确定与系统性红斑狼疮相关的遗传变异:结果:共有 946 名系统性红斑狼疮患者和 1,892 名非系统性红斑狼疮对照患者参与了此次分析。在六个 ML 模型中,RF 和 XGB 在区分系统性红斑狼疮和非系统性红斑狼疮方面表现优异。SHAP图中的主要特征是抗双链DNA抗体、ANA滴度、AC4 ANA模式、多基因风险评分、补体水平和SNPs。此外,在 ANA 滴度较高(≥ 1:640)的亚组中,发现了 6 个与系统性红斑狼疮正相关的 SNPs 和 5 个与系统性红斑狼疮负相关的 SNPs:ML方法有可能帮助诊断系统性红斑狼疮,并在一组自身免疫患者中发现新的SNPs。
{"title":"Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records.","authors":"Chih-Wei Chung, Seng-Cho Chou, Tzu-Hung Hsiao, Grace Joyce Zhang, Yu-Fang Chung, Yi-Ming Chen","doi":"10.1186/s13040-023-00352-y","DOIUrl":"10.1186/s13040-023-00352-y","url":null,"abstract":"<p><strong>Background: </strong>Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records.</p><p><strong>Methods: </strong>Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640.</p><p><strong>Results: </strong>A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered.</p><p><strong>Conclusions: </strong>ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"1"},"PeriodicalIF":4.5,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10770905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139106801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing age-related hearing risk predictions: an advanced machine learning integration with HHIE-S 优化年龄相关听力风险预测:先进的机器学习与 HHIE-S 集成
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-12-14 DOI: 10.1186/s13040-023-00351-z
Tzong-Hann Yang, Yu-Fu Chen, Yen-Fu Cheng, Jue-Ni Huang, Chuan-Song Wu, Yuan-Chia Chu
The elderly are disproportionately affected by age-related hearing loss (ARHL). Despite being a well-known tool for ARHL evaluation, the Hearing Handicap Inventory for the Elderly Screening version (HHIE-S) has only traditionally been used for direct screening using self-reported outcomes. This work uses a novel integration of machine learning approaches to improve the predicted accuracy of the HHIE-S tool for ARHL in older adults. We employed a dataset that was gathered between 2016 and 2018 and included 1,526 senior citizens from several Taipei City Hospital branches. 80% of the data were used for training (n = 1220) and 20% were used for testing (n = 356). XGBoost, Gradient Boosting, and LightGBM were among the machine learning models that were only used and assessed on the training set. In order to prevent data leakage and overfitting, the Light Gradient Boosting Machine (LGBM) model—which had the greatest AUC of 0.83 (95% CI 0.81–0.85)—was then only used on the holdout testing data. On the testing set, the LGBM model showed a strong AUC of 0.82 (95% CI 0.79–0.86), far outperforming conventional techniques. Notably, several HHIE-S items and age were found to be significant characteristics. In contrast to traditional HHIE research, which concentrates on the psychological effects of hearing loss, this study combines cutting-edge machine learning techniques—specifically, the LGBM classifier—with the HHIE-S tool. The incorporation of SHAP values enhances the interpretability of the model's predictions and provides a more comprehensive comprehension of the significance of various aspects. Our methodology highlights the great potential that arises from combining machine learning with validated hearing evaluation instruments such as the HHIE-S. Healthcare practitioners can anticipate ARHL more accurately thanks to this integration, which makes it easier to intervene quickly and precisely.
老年人受年龄相关性听力损失(ARHL)的影响尤为严重。尽管老年人听力障碍量表筛查版(HHIE-S)是众所周知的 ARHL 评估工具,但传统上仅用于使用自我报告结果进行直接筛查。这项工作采用了一种新颖的机器学习方法集成,以提高 HHIE-S 工具对老年人听力障碍的预测准确性。我们采用了 2016 年至 2018 年间收集的数据集,其中包括来自台北市立医院多家分院的 1526 名老年人。其中 80% 的数据用于训练(n = 1220),20% 的数据用于测试(n = 356)。XGBoost、梯度提升和LightGBM等机器学习模型仅在训练集上使用和评估。为了防止数据泄漏和过拟合,轻梯度提升机(Light Gradient Boosting Machine,LGBM)模型的 AUC 最高,为 0.83(95% CI 0.81-0.85),因此只用于保留测试数据。在测试集上,LGBM 模型的 AUC 高达 0.82(95% CI 0.79-0.86),远远超过了传统技术。值得注意的是,几个 HHIE-S 项目和年龄被认为是重要特征。传统的 HHIE 研究侧重于听力损失的心理影响,而本研究则将前沿的机器学习技术(特别是 LGBM 分类器)与 HHIE-S 工具相结合。SHAP 值的加入增强了模型预测的可解释性,并提供了对各方面重要性的更全面理解。我们的方法凸显了将机器学习与 HHIE-S 等经过验证的听力评估工具相结合的巨大潜力。通过这种整合,医疗从业人员可以更准确地预测 ARHL,从而更容易快速、准确地进行干预。
{"title":"Optimizing age-related hearing risk predictions: an advanced machine learning integration with HHIE-S","authors":"Tzong-Hann Yang, Yu-Fu Chen, Yen-Fu Cheng, Jue-Ni Huang, Chuan-Song Wu, Yuan-Chia Chu","doi":"10.1186/s13040-023-00351-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00351-z","url":null,"abstract":"The elderly are disproportionately affected by age-related hearing loss (ARHL). Despite being a well-known tool for ARHL evaluation, the Hearing Handicap Inventory for the Elderly Screening version (HHIE-S) has only traditionally been used for direct screening using self-reported outcomes. This work uses a novel integration of machine learning approaches to improve the predicted accuracy of the HHIE-S tool for ARHL in older adults. We employed a dataset that was gathered between 2016 and 2018 and included 1,526 senior citizens from several Taipei City Hospital branches. 80% of the data were used for training (n = 1220) and 20% were used for testing (n = 356). XGBoost, Gradient Boosting, and LightGBM were among the machine learning models that were only used and assessed on the training set. In order to prevent data leakage and overfitting, the Light Gradient Boosting Machine (LGBM) model—which had the greatest AUC of 0.83 (95% CI 0.81–0.85)—was then only used on the holdout testing data. On the testing set, the LGBM model showed a strong AUC of 0.82 (95% CI 0.79–0.86), far outperforming conventional techniques. Notably, several HHIE-S items and age were found to be significant characteristics. In contrast to traditional HHIE research, which concentrates on the psychological effects of hearing loss, this study combines cutting-edge machine learning techniques—specifically, the LGBM classifier—with the HHIE-S tool. The incorporation of SHAP values enhances the interpretability of the model's predictions and provides a more comprehensive comprehension of the significance of various aspects. Our methodology highlights the great potential that arises from combining machine learning with validated hearing evaluation instruments such as the HHIE-S. Healthcare practitioners can anticipate ARHL more accurately thanks to this integration, which makes it easier to intervene quickly and precisely.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"33 4 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138691791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. 6mA-StackingCV:一种用于预测DNA n6 -甲基ladenine位点的改进的堆叠集成模型。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-27 DOI: 10.1186/s13040-023-00348-8
Guohua Huang, Xiaohong Huang, Wei Luo

DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .

DNA n6 -腺嘌呤甲基化(n6 - methylladenine, 6mA)在细胞过程中起着关键的调节作用。准确识别6mA位点对进一步探索其生物学功能具有重要意义。在过去的几十年里,虽然有许多成熟的6mA场址预测计算方法,但仍有很大的改进余地。我们提出了一个基于交叉验证的用于6mA位点预测的堆叠集成模型,称为6mA- stackingcv。6mA-StackingCV是一种元学习算法,它使用交叉验证的输出作为最终分类器的输入。6mA-StackingCV在蔷薇科独立测试中达到了最先进的性能。广泛的测试证明了6mA-StackingCV的稳定性和灵活性。我们将6mA-StackingCV实现为一个用户友好的web应用程序,它允许人们限制性地选择表示或学习算法。该应用程序可在http://www.biolscience.cn/6mA-stackingCV/免费获得。源代码和实验数据可在https://github.com/Xiaohong-source/6mA-stackingCV上获得。
{"title":"6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site.","authors":"Guohua Huang, Xiaohong Huang, Wei Luo","doi":"10.1186/s13040-023-00348-8","DOIUrl":"10.1186/s13040-023-00348-8","url":null,"abstract":"<p><p>DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"34"},"PeriodicalIF":4.5,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10680251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138446729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Endoscopy-based IBD identification by a quantized deep learning pipeline. 基于内窥镜的IBD量化深度学习管道识别。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-25 DOI: 10.1186/s13040-023-00350-0
Massimiliano Datres, Elisa Paolazzi, Marco Chierici, Matteo Pozzi, Antonio Colangelo, Marcello Dorian Donzella, Giuseppe Jurman

Background: Discrimination between patients affected by inflammatory bowel diseases and healthy controls on the basis of endoscopic imaging is an challenging problem for machine learning models. Such task is used here as the testbed for a novel deep learning classification pipeline, powered by a set of solutions enhancing characterising elements such as reproducibility, interpretability, reduced computational workload, bias-free modeling and careful image preprocessing.

Results: First, an automatic preprocessing procedure is devised, aimed to remove artifacts from clinical data, feeding then the resulting images to an aggregated per-patient model to mimic the clinicians decision process. The predictions are based on multiple snapshots obtained through resampling, reducing the risk of misleading outcomes by removing the low confidence predictions. Each patient's outcome is explained by returning the images the prediction is based upon, supporting clinicians in verifying diagnoses without the need for evaluating the full set of endoscopic images. As a major theoretical contribution, quantization is employed to reduce the complexity and the computational cost of the model, allowing its deployment on small power devices with an almost negligible 3% performance degradation. Such quantization procedure holds relevance not only in the context of per-patient models but also for assessing its feasibility in providing real-time support to clinicians even in low-resources environments. The pipeline is demonstrated on a private dataset of endoscopic images of 758 IBD patients and 601 healthy controls, achieving Matthews Correlation Coefficient 0.9 as top performance on test set.

Conclusion: We highlighted how a comprehensive pre-processing pipeline plays a crucial role in identifying and removing artifacts from data, solving one of the principal challenges encountered when working with clinical data. Furthermore, we constructively showed how it is possible to emulate clinicians decision process and how it offers significant advantages, particularly in terms of explainability and trust within the healthcare context. Last but not least, we proved that quantization can be a useful tool to reduce the time and resources consumption with an acceptable degradation of the model performs. The quantization study proposed in this work points up the potential development of real-time quantized algorithms as valuable tools to support clinicians during endoscopy procedures.

背景:基于内镜成像区分炎症性肠病患者和健康对照者是机器学习模型面临的一个具有挑战性的问题。这种任务在这里被用作新型深度学习分类管道的测试平台,由一组解决方案提供支持,这些解决方案增强了再现性、可解释性、减少计算工作量、无偏见建模和仔细的图像预处理等特征元素。结果:首先,设计了一个自动预处理程序,旨在从临床数据中去除伪影,然后将生成的图像输入到汇总的每个患者模型中,以模拟临床医生的决策过程。预测基于通过重新采样获得的多个快照,通过去除低置信度预测来降低误导性结果的风险。通过返回预测所基于的图像来解释每个患者的结果,支持临床医生验证诊断,而无需评估全套内窥镜图像。作为主要的理论贡献,量化被用于降低模型的复杂性和计算成本,允许其部署在小功率器件上,几乎可以忽略3%的性能下降。这种量化程序不仅在每个患者模型的背景下具有相关性,而且在评估其在低资源环境中为临床医生提供实时支持的可行性时也具有相关性。该管道在一个包含758名IBD患者和601名健康对照者的内窥镜图像的私有数据集上进行了演示,在测试集上达到了马修斯相关系数0.9的最佳性能。结论:我们强调了一个全面的预处理管道如何在识别和去除数据中的伪像方面发挥关键作用,解决了处理临床数据时遇到的主要挑战之一。此外,我们建设性地展示了如何模拟临床医生的决策过程,以及它如何提供显著的优势,特别是在医疗保健环境中的可解释性和信任方面。最后但并非最不重要的是,我们证明了量化可以是一个有用的工具,可以在可接受的模型性能下降的情况下减少时间和资源消耗。在这项工作中提出的量化研究指出了实时量化算法的潜在发展,作为有价值的工具,在内窥镜检查过程中支持临床医生。
{"title":"Endoscopy-based IBD identification by a quantized deep learning pipeline.","authors":"Massimiliano Datres, Elisa Paolazzi, Marco Chierici, Matteo Pozzi, Antonio Colangelo, Marcello Dorian Donzella, Giuseppe Jurman","doi":"10.1186/s13040-023-00350-0","DOIUrl":"10.1186/s13040-023-00350-0","url":null,"abstract":"<p><strong>Background: </strong>Discrimination between patients affected by inflammatory bowel diseases and healthy controls on the basis of endoscopic imaging is an challenging problem for machine learning models. Such task is used here as the testbed for a novel deep learning classification pipeline, powered by a set of solutions enhancing characterising elements such as reproducibility, interpretability, reduced computational workload, bias-free modeling and careful image preprocessing.</p><p><strong>Results: </strong>First, an automatic preprocessing procedure is devised, aimed to remove artifacts from clinical data, feeding then the resulting images to an aggregated per-patient model to mimic the clinicians decision process. The predictions are based on multiple snapshots obtained through resampling, reducing the risk of misleading outcomes by removing the low confidence predictions. Each patient's outcome is explained by returning the images the prediction is based upon, supporting clinicians in verifying diagnoses without the need for evaluating the full set of endoscopic images. As a major theoretical contribution, quantization is employed to reduce the complexity and the computational cost of the model, allowing its deployment on small power devices with an almost negligible 3% performance degradation. Such quantization procedure holds relevance not only in the context of per-patient models but also for assessing its feasibility in providing real-time support to clinicians even in low-resources environments. The pipeline is demonstrated on a private dataset of endoscopic images of 758 IBD patients and 601 healthy controls, achieving Matthews Correlation Coefficient 0.9 as top performance on test set.</p><p><strong>Conclusion: </strong>We highlighted how a comprehensive pre-processing pipeline plays a crucial role in identifying and removing artifacts from data, solving one of the principal challenges encountered when working with clinical data. Furthermore, we constructively showed how it is possible to emulate clinicians decision process and how it offers significant advantages, particularly in terms of explainability and trust within the healthcare context. Last but not least, we proved that quantization can be a useful tool to reduce the time and resources consumption with an acceptable degradation of the model performs. The quantization study proposed in this work points up the potential development of real-time quantized algorithms as valuable tools to support clinicians during endoscopy procedures.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"33"},"PeriodicalIF":4.5,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10675910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138435274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1