Biodata Mining最新文献_第5页

From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases. 从 COVID-19 到猴痘：新出现传染病的新型预测模型。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-10-22 DOI: 10.1186/s13040-024-00396-8

Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy

The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.

新发传染病的爆发给全球公共卫生带来了重大挑战。准确的早期预测对于有效的资源分配和应急计划至关重要。本研究旨在开发一种针对新发传染病的综合预测模型，将混合框架、迁移学习、增量学习和生物特征 Rt 整合在一起，以提高预测的准确性和实用性。通过将 COVID-19 数据集的特征转移到猴痘数据集，并引入动态更新的增量学习技术，该模型在数据稀缺情况下的预测能力得到了显著提高。研究结果表明，混合框架在短期（7 天）预测中表现优异。此外，迁移学习和增量学习技术的结合大大提高了适应性和精确度，均方根误差（RMSE）提高了 91.41%，均方根误差（MAE）提高了 89.13%。特别是 Rt 特征的加入，使模型能够更准确地反映疾病传播的动态，进一步将 RMSE 提高了 1.91%，MAE 提高了 2.17%。这项研究强调了多模型融合和实时数据更新在传染病预测中的巨大应用潜力，提供了新的理论视角和技术支持。这项研究不仅丰富了传染病预测模型的理论基础，也为公共卫生应急响应提供了可靠的技术支持。未来的研究应继续探索整合多源数据，增强模型泛化能力，进一步提高预测工具的实用性和可靠性。

{"title":"From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases.","authors":"Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy","doi":"10.1186/s13040-024-00396-8","DOIUrl":"https://doi.org/10.1186/s13040-024-00396-8","url":null,"abstract":"The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"42"},"PeriodicalIF":4.0,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494870/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies. PAGER：一种新的基因型编码策略，用于对复杂性状关联研究中的加性偏差进行建模。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-10-11 DOI: 10.1186/s13040-024-00393-x

Philip J Freda, Attri Ghosh, Priyanka Bhandary, Nicholas Matsumoto, Apurva S Chitre, Jiayan Zhou, Molly A Hall, Abraham A Palmer, Tayo Obafemi-Ajayi, Jason H Moore

Background: The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. This assumption can introduce type I and type II errors by overestimating or underestimating the effects of variants that deviate from additivity. Alternative genotype encoding strategies have been explored to account for different inheritance patterns, but they often incur significant computational or methodological costs. To address these challenges, we introduce PAGER (Phenotype Adjusted Genotype Encoding and Ranking), an efficient pre-processing method that encodes each genetic variant based on normalized mean phenotypic differences between diallelic genotype classes (AA, Aa, and aa). This approach more accurately reflects each variant's true inheritance model, improving model precision while minimizing the costs associated with alternative encoding strategies.

Results: Through extensive benchmarking on SNPs simulated with both binary and continuous phenotypes, we demonstrate that PAGER accurately represents various inheritance patterns (including additive, dominant, recessive, and heterosis), achieves levels of statistical power that meet or exceed other encoding strategies, and attains computation speeds up to 55 times faster than a similar method, EDGE. We also apply PAGER to publicly available real-world data and identify a novel, relevant putative QTL associated with body mass index in rats (Rattus norvegicus) that is not detected with the additive model.

Conclusions: Overall, we show that PAGER is an efficient genotype encoding approach that can uncover sources of missing heritability and reveal novel insights in the study of complex traits while incurring minimal costs.

背景：加性遗传模型假定杂合子（Aa）与同源杂合子（AA 和 aa）完全处于中间状态。虽然这一模型通常用于单病灶遗传关联研究，但与加性遗传的显著偏差已得到充分证实，并导致许多性状和系统的表型变异。这一假设可能会高估或低估偏离可加性的变异的效应，从而导致 I 型和 II 型错误。为了解释不同的遗传模式，人们探索了其他基因型编码策略，但这些策略往往会产生巨大的计算或方法成本。为了应对这些挑战，我们引入了 PAGER（表型调整基因型编码和排序），这是一种高效的预处理方法，它根据二联基因型类别（AA、Aa 和 aa）之间的归一化平均表型差异对每个遗传变异进行编码。这种方法更准确地反映了每个变体的真实遗传模型，提高了模型的精确度，同时最大限度地降低了与其他编码策略相关的成本：通过对具有二元和连续表型的 SNPs 模拟进行广泛的基准测试，我们证明 PAGER 能准确表示各种遗传模式（包括加性、显性、隐性和杂合性），达到或超过其他编码策略的统计能力水平，而且计算速度比类似方法 EDGE 快达 55 倍。我们还将 PAGER 应用于公开的真实世界数据，并发现了一个与大鼠体重指数相关的新的、相关的假定 QTL，该 QTL 在加性模型中未被检测到：总之，我们证明了 PAGER 是一种高效的基因型编码方法，它能发现缺失遗传性的来源，并揭示复杂性状研究中的新见解，同时将成本降到最低。

{"title":"PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies.","authors":"Philip J Freda, Attri Ghosh, Priyanka Bhandary, Nicholas Matsumoto, Apurva S Chitre, Jiayan Zhou, Molly A Hall, Abraham A Palmer, Tayo Obafemi-Ajayi, Jason H Moore","doi":"10.1186/s13040-024-00393-x","DOIUrl":"10.1186/s13040-024-00393-x","url":null,"abstract":"Background: The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. This assumption can introduce type I and type II errors by overestimating or underestimating the effects of variants that deviate from additivity. Alternative genotype encoding strategies have been explored to account for different inheritance patterns, but they often incur significant computational or methodological costs. To address these challenges, we introduce PAGER (Phenotype Adjusted Genotype Encoding and Ranking), an efficient pre-processing method that encodes each genetic variant based on normalized mean phenotypic differences between diallelic genotype classes (AA, Aa, and aa). This approach more accurately reflects each variant's true inheritance model, improving model precision while minimizing the costs associated with alternative encoding strategies.Results: Through extensive benchmarking on SNPs simulated with both binary and continuous phenotypes, we demonstrate that PAGER accurately represents various inheritance patterns (including additive, dominant, recessive, and heterosis), achieves levels of statistical power that meet or exceed other encoding strategies, and attains computation speeds up to 55 times faster than a similar method, EDGE. We also apply PAGER to publicly available real-world data and identify a novel, relevant putative QTL associated with body mass index in rats (Rattus norvegicus) that is not detected with the additive model.Conclusions: Overall, we show that PAGER is an efficient genotype encoding approach that can uncover sources of missing heritability and reveal novel insights in the study of complex traits while incurring minimal costs.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"41"},"PeriodicalIF":4.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142407082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Decoding the genetic comorbidity network of Alzheimer's disease. 解码阿尔茨海默病的遗传合并症网络。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-10-09 DOI: 10.1186/s13040-024-00394-w

Xueli Zhang, Dantong Li, Siting Ye, Shunming Liu, Shuo Ma, Min Li, Qiliang Peng, Lianting Hu, Xianwen Shang, Mingguang He, Lei Zhang

Alzheimer's disease (AD) has emerged as the most prevalent and complex neurodegenerative disorder among the elderly population. However, the genetic comorbidity etiology for AD remains poorly understood. In this study, we conducted pleiotropic analysis for 41 AD phenotypic comorbidities, identifying ten genetic comorbidities with 16 pleiotropy genes associated with AD. Through biological functional and network analysis, we elucidated the molecular and functional landscape of AD genetic comorbidities. Furthermore, leveraging the pleiotropic genes and reported biomarkers for AD genetic comorbidities, we identified 50 potential biomarkers for AD diagnosis. Our findings deepen the understanding of the occurrence of AD genetic comorbidities and provide new insights for the search for AD diagnostic markers.

阿尔茨海默病（AD）已成为老年人群中最常见、最复杂的神经退行性疾病。然而，人们对阿尔茨海默病的遗传合并症病因仍知之甚少。在这项研究中，我们对41种AD表型合并症进行了多效性分析，确定了10种遗传合并症与16个与AD相关的多效性基因。通过生物功能和网络分析，我们阐明了AD遗传合并症的分子和功能图谱。此外，利用AD遗传合并症的多效基因和已报道的生物标志物，我们还发现了50种潜在的AD诊断生物标志物。我们的研究结果加深了人们对AD遗传合并症发生的理解，并为寻找AD诊断标志物提供了新的见解。

引用次数: 0

MDVarP: modifier ~ disease-causing variant pairs predictor. MDVarP：修饰符 ~ 致病变异对预测器。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-10-08 DOI: 10.1186/s13040-024-00392-y

Hong Sun, Yunqin Chen, Liangxiao Ma

Background: Modifiers significantly impact disease phenotypes by modulating the effects of disease-causing variants, resulting in varying disease manifestations among individuals. However, identifying genetic interactions between modifier and disease-causing variants is challenging.

Results: We developed MDVarP, an ensemble model comprising 1000 random forest predictors, to identify modifier ~ disease-causing variant combinations. MDVarP achieves high accuracy and precision, as verified using an independent dataset with published evidence of genetic interactions. We identified 25 novel modifier ~ disease-causing variant combinations and obtained supporting evidence for these associations. MDVarP outputs a class label ("Associated-pair" or "Nonrelevant-pair") and two prediction scores indicating the probability of a true association.

Conclusions: MDVarP prioritizes variant pairs associated with phenotypic modulations, enabling more effective mapping of functional contributions from disease-causing and modifier variants. This framework interprets genetic interactions underlying phenotypic variations in human diseases, with potential applications in personalized medicine and disease prevention.

背景：修饰因子通过调节致病变异体的效应对疾病表型产生重大影响，导致个体间疾病表现各不相同。然而，识别修饰基因与致病变异基因之间的遗传相互作用是一项挑战：我们开发了一个由 1000 个随机森林预测因子组成的集合模型 MDVarP，用于识别修饰因子和致病变异体的组合。MDVarP 具有很高的准确性和精确性，这一点已通过一个独立数据集得到验证，该数据集已公布了基因相互作用的证据。我们确定了 25 个新的修饰因子与致病变异体组合，并获得了这些关联的支持性证据。MDVarP 输出了一个类别标签（"相关-配对 "或 "非相关-配对"）和两个预测分数，这两个分数显示了真正关联的概率：MDVarP 优先考虑与表型调节相关的变异对，从而能更有效地绘制致病变异和调节变异的功能贡献图。该框架解释了人类疾病表型变异背后的基因相互作用，有望应用于个性化医疗和疾病预防。

{"title":"MDVarP: modifier ~ disease-causing variant pairs predictor.","authors":"Hong Sun, Yunqin Chen, Liangxiao Ma","doi":"10.1186/s13040-024-00392-y","DOIUrl":"10.1186/s13040-024-00392-y","url":null,"abstract":"Background: Modifiers significantly impact disease phenotypes by modulating the effects of disease-causing variants, resulting in varying disease manifestations among individuals. However, identifying genetic interactions between modifier and disease-causing variants is challenging.Results: We developed MDVarP, an ensemble model comprising 1000 random forest predictors, to identify modifier ~ disease-causing variant combinations. MDVarP achieves high accuracy and precision, as verified using an independent dataset with published evidence of genetic interactions. We identified 25 novel modifier ~ disease-causing variant combinations and obtained supporting evidence for these associations. MDVarP outputs a class label (\"Associated-pair\" or \"Nonrelevant-pair\") and two prediction scores indicating the probability of a true association.Conclusions: MDVarP prioritizes variant pairs associated with phenotypic modulations, enabling more effective mapping of functional contributions from disease-causing and modifier variants. This framework interprets genetic interactions underlying phenotypic variations in human diseases, with potential applications in personalized medicine and disease prevention.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"39"},"PeriodicalIF":4.0,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11460193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142394497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep learning-based approaches for multi-omics data integration and analysis. 基于深度学习的多组学数据整合与分析方法。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-10-02 DOI: 10.1186/s13040-024-00391-z

Jenna L Ballard, Zexuan Wang, Wenrui Li, Li Shen, Qi Long

Background: The rapid growth of deep learning, as well as the vast and ever-growing amount of available data, have provided ample opportunity for advances in fusion and analysis of complex and heterogeneous data types. Different data modalities provide complementary information that can be leveraged to gain a more complete understanding of each subject. In the biomedical domain, multi-omics data includes molecular (genomics, transcriptomics, proteomics, epigenomics, metabolomics, etc.) and imaging (radiomics, pathomics) modalities which, when combined, have the potential to improve performance on prediction, classification, clustering and other tasks. Deep learning encompasses a wide variety of methods, each of which have certain strengths and weaknesses for multi-omics integration.

Method: In this review, we categorize recent deep learning-based approaches by their basic architectures and discuss their unique capabilities in relation to one another. We also discuss some emerging themes advancing the field of multi-omics integration.

Results: Deep learning-based multi-omics integration methods were categorized broadly into non-generative (feedforward neural networks, graph convolutional neural networks, and autoencoders) and generative (variational methods, generative adversarial models, and a generative pretrained model). Generative methods have the advantage of being able to impose constraints on the shared representations to enforce certain properties or incorporate prior knowledge. They can also be used to generate or impute missing modalities. Recent advances achieved by these methods include the ability to handle incomplete data as well as going beyond the traditional molecular omics data types to integrate other modalities such as imaging data.

Conclusion: We expect to see further growth in methods that can handle missingness, as this is a common challenge in working with complex and heterogeneous data. Additionally, methods that integrate more data types are expected to improve performance on downstream tasks by capturing a comprehensive view of each sample.

背景：深度学习的迅猛发展以及不断增长的海量可用数据，为复杂和异构数据类型的融合与分析提供了充足的进步机会。不同的数据模式可以提供互补信息，利用这些信息可以更全面地了解每个主题。在生物医学领域，多组学数据包括分子（基因组学、转录组学、蛋白质组学、表观基因组学、代谢组学等）和成像（放射组学、病理组学）模式，这些模式结合在一起，有可能提高预测、分类、聚类和其他任务的性能。深度学习包含多种方法，每种方法在多组学整合方面都有一定的优缺点：在这篇综述中，我们按照基本架构对近期基于深度学习的方法进行了分类，并讨论了它们相互之间的独特能力。我们还讨论了推动多组学整合领域发展的一些新兴主题：基于深度学习的多组学整合方法大致分为非生成型（前馈神经网络、图卷积神经网络和自动编码器）和生成型（变异方法、生成对抗模型和生成预训练模型）。生成式方法的优势在于能够对共享表征施加约束，以强制执行某些属性或纳入先验知识。它们还可用于生成或估算缺失的模态。这些方法最近取得的进展包括能够处理不完整数据，以及超越传统的分子 omics 数据类型，整合成像数据等其他模态：我们希望看到能够处理缺失数据的方法进一步发展，因为这是处理复杂和异构数据时面临的共同挑战。此外，整合更多数据类型的方法有望通过捕捉每个样本的综合视图来提高下游任务的性能。

{"title":"Deep learning-based approaches for multi-omics data integration and analysis.","authors":"Jenna L Ballard, Zexuan Wang, Wenrui Li, Li Shen, Qi Long","doi":"10.1186/s13040-024-00391-z","DOIUrl":"10.1186/s13040-024-00391-z","url":null,"abstract":"Background: The rapid growth of deep learning, as well as the vast and ever-growing amount of available data, have provided ample opportunity for advances in fusion and analysis of complex and heterogeneous data types. Different data modalities provide complementary information that can be leveraged to gain a more complete understanding of each subject. In the biomedical domain, multi-omics data includes molecular (genomics, transcriptomics, proteomics, epigenomics, metabolomics, etc.) and imaging (radiomics, pathomics) modalities which, when combined, have the potential to improve performance on prediction, classification, clustering and other tasks. Deep learning encompasses a wide variety of methods, each of which have certain strengths and weaknesses for multi-omics integration.Method: In this review, we categorize recent deep learning-based approaches by their basic architectures and discuss their unique capabilities in relation to one another. We also discuss some emerging themes advancing the field of multi-omics integration.Results: Deep learning-based multi-omics integration methods were categorized broadly into non-generative (feedforward neural networks, graph convolutional neural networks, and autoencoders) and generative (variational methods, generative adversarial models, and a generative pretrained model). Generative methods have the advantage of being able to impose constraints on the shared representations to enforce certain properties or incorporate prior knowledge. They can also be used to generate or impute missing modalities. Recent advances achieved by these methods include the ability to handle incomplete data as well as going beyond the traditional molecular omics data types to integrate other modalities such as imaging data.Conclusion: We expect to see further growth in methods that can handle missingness, as this is a common challenge in working with complex and heterogeneous data. Additionally, methods that integrate more data types are expected to improve performance on downstream tasks by capturing a comprehensive view of each sample.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"38"},"PeriodicalIF":4.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446004/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142367123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessing the limitations of relief-based algorithms in detecting higher-order interactions. 评估基于浮雕的算法在检测高阶交互作用方面的局限性。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-10-01 DOI: 10.1186/s13040-024-00390-0

Philip J Freda, Suyu Ye, Robert Zhang, Jason H Moore, Ryan J Urbanowicz

Background: Epistasis, the interaction between genetic loci where the effect of one locus is influenced by one or more other loci, plays a crucial role in the genetic architecture of complex traits. However, as the number of loci considered increases, the investigation of epistasis becomes exponentially more complex, making the selection of key features vital for effective downstream analyses. Relief-Based Algorithms (RBAs) are often employed for this purpose due to their reputation as "interaction-sensitive" algorithms and uniquely non-exhaustive approach. However, the limitations of RBAs in detecting interactions, particularly those involving multiple loci, have not been thoroughly defined. This study seeks to address this gap by evaluating the efficiency of RBAs in detecting higher-order epistatic interactions. Motivated by previous findings that suggest some RBAs may rank predictive features involved in higher-order epistasis negatively, we explore the potential of absolute value ranking of RBA feature weights as an alternative approach for capturing complex interactions. In this study, we assess the performance of ReliefF, MultiSURF, and MultiSURFstar on simulated genetic datasets that model various patterns of genotype-phenotype associations, including 2-way to 5-way genetic interactions, and compare their performance to two control methods: a random shuffle and mutual information.

Results: Our findings indicate that while RBAs effectively identify lower-order (2 to 3-way) interactions, their capability to detect higher-order interactions is significantly limited, primarily by large feature count but also by signal noise. Specifically, we observe that RBAs are successful in detecting fully penetrant 4-way XOR interactions using an absolute value ranking approach, but this is restricted to datasets with only 20 total features.

Conclusions: These results highlight the inherent limitations of current RBAs and underscore the need for the development of Relief-based approaches with enhanced detection capabilities for the investigation of epistasis, particularly in datasets with large feature counts and complex higher-order interactions.

背景：外显性是遗传位点之间的相互作用，其中一个位点的效应受一个或多个其他位点的影响，在复杂性状的遗传结构中起着至关重要的作用。然而，随着所考虑的基因位点数量的增加，外显性的研究也变得更加复杂，因此选择关键特征对于有效的下游分析至关重要。基于救济的算法（RBA）因其 "交互敏感 "算法的美誉和独特的非穷举方法而经常被用于此目的。然而，RBA 在检测相互作用，尤其是涉及多个位点的相互作用方面的局限性尚未得到彻底界定。本研究试图通过评估 RBA 在检测高阶表观相互作用方面的效率来弥补这一不足。之前的研究结果表明，一些 RBA 可能会对涉及高阶表观相互作用的预测特征进行负排序，受此启发，我们探索了 RBA 特征权重绝对值排序作为捕捉复杂相互作用的另一种方法的潜力。在这项研究中，我们评估了 ReliefF、MultiSURF 和 MultiSURFstar 在模拟遗传数据集上的表现，这些数据集模拟了基因型与表型关联的各种模式，包括 2 向到 5 向遗传相互作用，并将它们的表现与两种对照方法（随机洗牌和互信息）进行了比较：我们的研究结果表明，虽然 RBA 能有效识别低阶（2 至 3 向）相互作用，但其检测高阶相互作用的能力却受到很大限制，这主要是由于特征数量较大，同时也受到信号噪声的影响。具体来说，我们观察到，使用绝对值排序方法，RBA 可以成功检测出完全穿透的 4 向 XOR 相互作用，但这仅限于总特征数只有 20 个的数据集：这些结果凸显了当前 RBAs 的固有局限性，并强调了开发基于 Relief 的方法的必要性，这种方法具有更强的检测能力，可用于研究表观性，特别是在具有大量特征和复杂高阶相互作用的数据集中。

{"title":"Assessing the limitations of relief-based algorithms in detecting higher-order interactions.","authors":"Philip J Freda, Suyu Ye, Robert Zhang, Jason H Moore, Ryan J Urbanowicz","doi":"10.1186/s13040-024-00390-0","DOIUrl":"10.1186/s13040-024-00390-0","url":null,"abstract":"Background: Epistasis, the interaction between genetic loci where the effect of one locus is influenced by one or more other loci, plays a crucial role in the genetic architecture of complex traits. However, as the number of loci considered increases, the investigation of epistasis becomes exponentially more complex, making the selection of key features vital for effective downstream analyses. Relief-Based Algorithms (RBAs) are often employed for this purpose due to their reputation as \"interaction-sensitive\" algorithms and uniquely non-exhaustive approach. However, the limitations of RBAs in detecting interactions, particularly those involving multiple loci, have not been thoroughly defined. This study seeks to address this gap by evaluating the efficiency of RBAs in detecting higher-order epistatic interactions. Motivated by previous findings that suggest some RBAs may rank predictive features involved in higher-order epistasis negatively, we explore the potential of absolute value ranking of RBA feature weights as an alternative approach for capturing complex interactions. In this study, we assess the performance of ReliefF, MultiSURF, and MultiSURFstar on simulated genetic datasets that model various patterns of genotype-phenotype associations, including 2-way to 5-way genetic interactions, and compare their performance to two control methods: a random shuffle and mutual information.Results: Our findings indicate that while RBAs effectively identify lower-order (2 to 3-way) interactions, their capability to detect higher-order interactions is significantly limited, primarily by large feature count but also by signal noise. Specifically, we observe that RBAs are successful in detecting fully penetrant 4-way XOR interactions using an absolute value ranking approach, but this is restricted to datasets with only 20 total features.Conclusions: These results highlight the inherent limitations of current RBAs and underscore the need for the development of Relief-based approaches with enhanced detection capabilities for the investigation of epistasis, particularly in datasets with large feature counts and complex higher-order interactions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"37"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142362274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying heterogeneous subgroups of systemic autoimmune diseases by applying a joint dimension reduction and clustering approach to immunomarkers 通过对免疫标记物采用联合降维和聚类方法识别全身性自身免疫疾病的异质亚组

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-09-16 DOI: 10.1186/s13040-024-00389-7

Chia-Wei Chang, Hsin-Yao Wang, Wan-Ying Lin, Yu-Chiang Wang, Wei-Lin Lo, Ting-Wei Lin, Jia-Ruei Yu, Yi-Ju Tseng

The high complexity of systemic autoimmune diseases (SADs) has hindered precise management. This study aims to investigate heterogeneity in SADs. We applied a joint cluster analysis, which jointed multiple correspondence analysis and k-means, to immunomarkers and measured the heterogeneity of clusters by examining differences in immunomarkers and clinical manifestations. The electronic health records of patients who received an antinuclear antibody test and were diagnosed with SADs, namely systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), and Sjögren’s syndrome (SS), were retrieved between 2001 and 2016 from hospitals in Taiwan. With distinctive patterns of immunomarkers, a total of 11,923 patients with the three SADs were grouped into six clusters. None of the clusters was composed only of a single SAD, and these clusters demonstrated considerable differences in clinical manifestation. Both patients with SLE and SS had a more dispersed distribution in the six clusters. Among patients with SLE, the occurrence of renal compromise was higher in Clusters 3 and 6 (52% and 51%) than in the other clusters (p < 0.001). Cluster 3 also had a high proportion of patients with discoid lupus (60%) than did Cluster 6 (39%; p < 0.001). Patients with SS in Cluster 3 were the most distinctive because of the high occurrence of immunity disorders (63%) and other and unspecified benign neoplasm (58%) with statistical significance compared with the other clusters (all p < 0.05). The immunomarker-driven clustering method could recognise more clinically relevant subgroups of the SADs and would provide a more precise diagnosis basis.

系统性自身免疫性疾病（SAD）的高度复杂性阻碍了精确的管理。本研究旨在调查 SAD 的异质性。我们对免疫标志物进行了联合聚类分析，将多重对应分析和k-means联合起来，通过研究免疫标志物和临床表现的差异来衡量聚类的异质性。研究人员检索了台湾各医院2001年至2016年间接受抗核抗体检测并被诊断为系统性红斑狼疮（SLE）、类风湿性关节炎（RA）和斯约格伦综合征（SS）的患者的电子病历。三种 SAD 患者的免疫标志物模式各不相同，共有 11,923 名患者被分为六个群组。没有一个群组仅由单一的 SAD 组成，而且这些群组在临床表现上有很大差异。系统性红斑狼疮和 SS 患者在六个群组中的分布较为分散。在系统性红斑狼疮患者中，第 3 组和第 6 组的肾功能损害发生率（52% 和 51%）高于其他组群（P < 0.001）。群组 3 中盘状狼疮患者的比例（60%）也高于群组 6（39%；P < 0.001）。群组 3 中的 SS 患者与其他群组相比，免疫紊乱（63%）和其他及未指定的良性肿瘤（58%）的发生率较高，具有统计学意义（均为 p <0.05），因此群组 3 的 SS 患者最具特色。免疫标记物驱动的聚类方法可以识别出更多与临床相关的 SADs 亚群，并提供更精确的诊断依据。

{"title":"Identifying heterogeneous subgroups of systemic autoimmune diseases by applying a joint dimension reduction and clustering approach to immunomarkers","authors":"Chia-Wei Chang, Hsin-Yao Wang, Wan-Ying Lin, Yu-Chiang Wang, Wei-Lin Lo, Ting-Wei Lin, Jia-Ruei Yu, Yi-Ju Tseng","doi":"10.1186/s13040-024-00389-7","DOIUrl":"https://doi.org/10.1186/s13040-024-00389-7","url":null,"abstract":"The high complexity of systemic autoimmune diseases (SADs) has hindered precise management. This study aims to investigate heterogeneity in SADs. We applied a joint cluster analysis, which jointed multiple correspondence analysis and k-means, to immunomarkers and measured the heterogeneity of clusters by examining differences in immunomarkers and clinical manifestations. The electronic health records of patients who received an antinuclear antibody test and were diagnosed with SADs, namely systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), and Sjögren’s syndrome (SS), were retrieved between 2001 and 2016 from hospitals in Taiwan. With distinctive patterns of immunomarkers, a total of 11,923 patients with the three SADs were grouped into six clusters. None of the clusters was composed only of a single SAD, and these clusters demonstrated considerable differences in clinical manifestation. Both patients with SLE and SS had a more dispersed distribution in the six clusters. Among patients with SLE, the occurrence of renal compromise was higher in Clusters 3 and 6 (52% and 51%) than in the other clusters (p < 0.001). Cluster 3 also had a high proportion of patients with discoid lupus (60%) than did Cluster 6 (39%; p < 0.001). Patients with SS in Cluster 3 were the most distinctive because of the high occurrence of immunity disorders (63%) and other and unspecified benign neoplasm (58%) with statistical significance compared with the other clusters (all p < 0.05). The immunomarker-driven clustering method could recognise more clinically relevant subgroups of the SADs and would provide a more precise diagnosis basis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"117 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142258992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development, evaluation and comparison of machine learning algorithms for predicting in-hospital patient charges for congestive heart failure exacerbations, chronic obstructive pulmonary disease exacerbations and diabetic ketoacidosis 开发、评估和比较用于预测充血性心力衰竭加重、慢性阻塞性肺病加重和糖尿病酮症酸中毒住院患者费用的机器学习算法

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-09-12 DOI: 10.1186/s13040-024-00387-9

Monique Arnold, Lathan Liou, Mary Regina Boland

Hospitalizations for exacerbations of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetic ketoacidosis (DKA) are costly in the United States. The purpose of this study was to predict in-hospital charges for each condition using machine learning (ML) models. We conducted a retrospective cohort study on national discharge records of hospitalized adult patients from January 1st, 2016, to December 31st, 2019. We constructed six ML models (linear regression, ridge regression, support vector machine, random forest, gradient boosting and extreme gradient boosting) to predict total in-hospital cost for admission for each condition. Our models had good predictive performance, with testing R-squared values of 0.701-0.750 (mean of 0.713) for CHF; 0.694-0.724 (mean 0.709) for COPD; and 0.615-0.729 (mean 0.694) for DKA. We identified important key features driving costs, including patient age, length of stay, number of procedures, and elective/nonelective admission. ML methods may be used to accurately predict costs and identify drivers of high cost for COPD exacerbations, CHF exacerbations and DKA. Overall, our findings may inform future studies that seek to decrease the underlying high patient costs for these conditions.

在美国，因充血性心力衰竭 (CHF)、慢性阻塞性肺病 (COPD) 和糖尿病酮症酸中毒 (DKA) 恶化而住院的费用很高。本研究的目的是利用机器学习（ML）模型预测每种疾病的住院费用。我们对 2016 年 1 月 1 日至 2019 年 12 月 31 日住院成人患者的全国出院记录进行了回顾性队列研究。我们构建了六个 ML 模型（线性回归、脊回归、支持向量机、随机森林、梯度提升和极端梯度提升）来预测每种病症的住院总费用。我们的模型具有良好的预测性能，对慢性阻塞性肺病的测试 R 平方值为 0.701-0.750（平均值为 0.713）；对慢性阻塞性肺病的测试 R 平方值为 0.694-0.724（平均值为 0.709）；对 DKA 的测试 R 平方值为 0.615-0.729（平均值为 0.694）。我们确定了影响成本的重要关键特征，包括患者年龄、住院时间、手术次数和选择性/非选择性入院。ML 方法可用于准确预测慢性阻塞性肺病加重、慢性阻塞性肺病加重和 DKA 的成本，并确定导致高成本的因素。总之，我们的研究结果可为今后旨在降低这些疾病潜在高额患者费用的研究提供参考。

{"title":"Development, evaluation and comparison of machine learning algorithms for predicting in-hospital patient charges for congestive heart failure exacerbations, chronic obstructive pulmonary disease exacerbations and diabetic ketoacidosis","authors":"Monique Arnold, Lathan Liou, Mary Regina Boland","doi":"10.1186/s13040-024-00387-9","DOIUrl":"https://doi.org/10.1186/s13040-024-00387-9","url":null,"abstract":"Hospitalizations for exacerbations of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetic ketoacidosis (DKA) are costly in the United States. The purpose of this study was to predict in-hospital charges for each condition using machine learning (ML) models. We conducted a retrospective cohort study on national discharge records of hospitalized adult patients from January 1st, 2016, to December 31st, 2019. We constructed six ML models (linear regression, ridge regression, support vector machine, random forest, gradient boosting and extreme gradient boosting) to predict total in-hospital cost for admission for each condition. Our models had good predictive performance, with testing R-squared values of 0.701-0.750 (mean of 0.713) for CHF; 0.694-0.724 (mean 0.709) for COPD; and 0.615-0.729 (mean 0.694) for DKA. We identified important key features driving costs, including patient age, length of stay, number of procedures, and elective/nonelective admission. ML methods may be used to accurately predict costs and identify drivers of high cost for COPD exacerbations, CHF exacerbations and DKA. Overall, our findings may inform future studies that seek to decrease the underlying high patient costs for these conditions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"40 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Private pathological assessment via machine learning and homomorphic encryption 通过机器学习和同态加密进行私人病理评估

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-09-10 DOI: 10.1186/s13040-024-00379-9

Ahmad Al Badawi, Mohd Faizal Bin Yusof

The objective of this research is to explore the applicability of machine learning and fully homomorphic encryption (FHE) in the private pathological assessment, with a focus on the inference phase of support vector machines (SVM) for the classification of confidential medical data. A framework is introduced that utilizes the Cheon-Kim-Kim-Song (CKKS) FHE scheme, facilitating the execution of SVM inference on encrypted datasets. This framework ensures the privacy of patient data and negates the necessity of decryption during the analytical process. Additionally, an efficient feature extraction technique is presented for the transformation of medical imagery into vectorial representations. The system’s evaluation across various datasets substantiates its practicality and efficacy. The proposed method delivers classification accuracy and performance on par with traditional, non-encrypted SVM inference, while upholding a 128-bit security level against established cryptographic attacks targeting the CKKS scheme. The secure inference process is executed within a temporal span of mere seconds. The findings of this study underscore the viability of FHE in enhancing the security and efficiency of bioinformatics analyses, potentially benefiting fields such as cardiology, oncology, and medical imagery. The implications of this research are significant for the future of privacy-preserving machine learning, promoting progress in diagnostic procedures, tailored medical treatments, and clinical investigations.

本研究的目的是探索机器学习和全同态加密（FHE）在私密病理评估中的适用性，重点是用于机密医疗数据分类的支持向量机（SVM）的推理阶段。本文介绍了一种利用 Cheon-Kim-Kim-Song (CKKS) FHE 方案的框架，该框架有助于在加密数据集上执行 SVM 推断。该框架确保了患者数据的隐私性，并消除了分析过程中解密的必要性。此外，还介绍了一种高效的特征提取技术，用于将医学图像转换为矢量表示。该系统在各种数据集上的评估证明了其实用性和有效性。所提出的方法在分类准确性和性能上与传统的非加密 SVM 推理不相上下，同时还具有 128 位的安全级别，可抵御针对 CKKS 方案的加密攻击。安全推理过程的执行时间跨度仅为几秒钟。这项研究的发现强调了 FHE 在提高生物信息学分析的安全性和效率方面的可行性，可能会使心脏病学、肿瘤学和医学影像等领域受益。这项研究对保护隐私的机器学习的未来意义重大，可促进诊断程序、定制医疗和临床研究的进步。

{"title":"Private pathological assessment via machine learning and homomorphic encryption","authors":"Ahmad Al Badawi, Mohd Faizal Bin Yusof","doi":"10.1186/s13040-024-00379-9","DOIUrl":"https://doi.org/10.1186/s13040-024-00379-9","url":null,"abstract":"The objective of this research is to explore the applicability of machine learning and fully homomorphic encryption (FHE) in the private pathological assessment, with a focus on the inference phase of support vector machines (SVM) for the classification of confidential medical data. A framework is introduced that utilizes the Cheon-Kim-Kim-Song (CKKS) FHE scheme, facilitating the execution of SVM inference on encrypted datasets. This framework ensures the privacy of patient data and negates the necessity of decryption during the analytical process. Additionally, an efficient feature extraction technique is presented for the transformation of medical imagery into vectorial representations. The system’s evaluation across various datasets substantiates its practicality and efficacy. The proposed method delivers classification accuracy and performance on par with traditional, non-encrypted SVM inference, while upholding a 128-bit security level against established cryptographic attacks targeting the CKKS scheme. The secure inference process is executed within a temporal span of mere seconds. The findings of this study underscore the viability of FHE in enhancing the security and efficiency of bioinformatics analyses, potentially benefiting fields such as cardiology, oncology, and medical imagery. The implications of this research are significant for the future of privacy-preserving machine learning, promoting progress in diagnostic procedures, tailored medical treatments, and clinical investigations.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"71 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data 针对高维数据和小样本量的知识倾斜随机森林方法与基因表达数据的特征选择应用

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-09-10 DOI: 10.1186/s13040-024-00388-8

Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas

The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n le 30)$$ comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.

在机器学习框架中使用先验知识一直被认为是处理遗传和基因组学数据维度诅咒的潜在工具。虽然随机森林（RF）是一种灵活的非参数方法，具有多种优势，但在高维环境下，主要是在样本量较小的情况下，其准确性可能较差。我们提出了一种知识倾斜 RF，将生物网络作为先验知识整合到模型中，以提高其性能和可解释性，并将其用于选择和识别相关基因。首先，通过运行带重启算法的随机行走来转换由图代表的先验知识，从而根据每个基因在蛋白质-蛋白质相互作用网络上的连接和定位来确定其相关性。然后，利用每个相关性来修改选择概率，从而在传统的 RF 中将某个基因作为候选分割特征提取出来。在样本量极小的模拟数据集上进行的实验表明，知识倾斜RF与传统RF和logistic lasso回归相比，结果预测的精确度有所提高。通过引入改进版的 Boruta 特征选择算法，知识倾斜 RF 得到了完善。最后，与传统 RF 相比，知识倾斜 RF 识别出了更多相关的生物基因，为用户提供了更高水平的可解释性。这些发现在一个真实病例中得到了证实，从而确定了钙化性主动脉瓣狭窄的相关基因。

{"title":"Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data","authors":"Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas","doi":"10.1186/s13040-024-00388-8","DOIUrl":"https://doi.org/10.1186/s13040-024-00388-8","url":null,"abstract":"The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n le 30)$$ comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"10 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0