首页 > 最新文献

Biodata Mining最新文献

英文 中文
EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multiplatform GWASs for late-onset alzheimer's disease. encan:跨多平台GWASs对迟发性阿尔茨海默病致病变异进行优先排序的综合评分。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-04 DOI: 10.1186/s13040-025-00436-x
Onur Erdogan, Cem Iyigun, Yeşim Aydın Son

Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.

迟发性阿尔茨海默病(LOAD)是一种进行性和复杂的老年人群神经退行性疾病。负荷性脑损伤的特点是认知能力下降,如记忆退化、智力丧失和其他认知领域的丧失。阿尔茨海默病(AD)是一种复杂的遗传病因,目前尚不清楚,这限制了其早期或鉴别诊断。全基因组关联研究(GWAS)能够探索单个变异在候选基因座上的统计相互作用,但单变量分析忽略了变异之间的相互作用。机器学习(ML)算法可以捕捉隐藏的、新颖的和重要的模式,同时考虑变量之间的非线性相互作用,以了解复杂遗传疾病的遗传易感性。当在不同的平台上工作时,多数投票不能应用,因为属性不同。因此,研究人员开发了一种新的后ml集成方法,通过多个基因分型平台选择显著的snv。我们提出了EnSCAN框架,使用一种新的算法来集成来自不同平台的选定变体,以优先考虑候选致病位点,从而通过结合从每个数据集捕获的先验信息来帮助提高ML结果。所提出的集成算法利用snv的染色体位置,通过映射到细胞遗传带,以及对之间的接近度和多模型随机森林(RF)验证来优先考虑snv和候选LOAD致病基因。该评分方法具有可扩展性,可应用于任何多平台基因分型研究。我们介绍了所提出的EnSCAN评分算法如何在三个GWAS数据集中优先考虑与LOAD相关的候选致病变异。
{"title":"EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multiplatform GWASs for late-onset alzheimer's disease.","authors":"Onur Erdogan, Cem Iyigun, Yeşim Aydın Son","doi":"10.1186/s13040-025-00436-x","DOIUrl":"10.1186/s13040-025-00436-x","url":null,"abstract":"<p><p>Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"20"},"PeriodicalIF":4.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11881353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of global trends and hotspots of skin microbiome in acne: a bibliometric perspective. 痤疮皮肤微生物组的全球趋势和热点分析:文献计量学视角。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-03 DOI: 10.1186/s13040-025-00433-0
Lanfang Zhang, Yuan Cai, Lin Li, Jie Hu, Changsha Jia, Xu Kuang, Yi Zhou, Zhiai Lan, Chunyan Liu, Feng Jiang, Nana Sun, Ni Zeng

Background: Acne is a chronic inflammatory condition affecting the hair follicles and sebaceous glands. Recent research has revealed significant advances in the study of the acne skin microbiome. Systematic analysis of research trends and hotspots in the acne skin microbiome is lacking. This study utilized bibliometric methods to conduct in-depth research on the recognition structure of the acne skin microbiome, identifying hot trends and emerging topics.

Methods: We performed a topic search to retrieve articles about skin microbiome in acne from the Web of Science Core Collection. Bibliometric research was conducted using CiteSpace, VOSviewer, and R language.

Results: This study analyzed 757 articles from 1362 institutions in 68 countries, the United States leading the research efforts. Notably, Brigitte Dréno from the University of Nantes emerged as the most prolific author in this field, with 19 papers and 334 co-citations. The research output on the skin microbiome of acne continues to increase, with Experimental Dermatology being the journal with the highest number of published articles. The primary focus is investigating the skin microbiome's mechanisms in acne development and exploring treatment strategies. These findings have important implications for developing microbiome-targeted therapies, which could provide new, personalized treatment options for patients with acne. Emerging research hotspots include skincare, gut microbiome, and treatment.

Conclusion: The study's findings indicate a thriving research interest in the skin microbiome and its relationship to acne, focusing on acne treatment through the regulation of the skin microbiome balance. Currently, the development of skincare products targeting the regulation of the skin microbiome represents a research hotspot, reflecting the transition from basic scientific research to clinical practice.

背景:痤疮是一种影响毛囊和皮脂腺的慢性炎症。最近的研究揭示了痤疮皮肤微生物组研究的重大进展。缺乏对痤疮皮肤微生物组研究趋势和热点的系统分析。本研究运用文献计量学方法,对痤疮皮肤微生物群的识别结构进行深入研究,识别热点趋势和新兴课题。方法:我们从Web of Science Core Collection中检索关于痤疮皮肤微生物组的文章进行主题搜索。采用CiteSpace、VOSviewer和R语言进行文献计量学研究。结果:本研究分析了来自68个国家1362个机构的757篇文章,其中美国在研究方面处于领先地位。值得注意的是,南特大学(University of Nantes)的布里吉特·德雷姆萨诺(Brigitte drsamno)是这一领域最多产的作者,发表了19篇论文,共被引用334次。关于痤疮皮肤微生物组的研究成果不断增加,其中《实验皮肤病学》是发表文章最多的期刊。主要重点是研究皮肤微生物组在痤疮发展中的机制和探索治疗策略。这些发现对开发微生物组靶向治疗具有重要意义,可以为痤疮患者提供新的个性化治疗选择。新兴的研究热点包括皮肤护理、肠道微生物组和治疗。结论:本研究结果表明,皮肤微生物群及其与痤疮的关系是一个蓬勃发展的研究兴趣,重点是通过调节皮肤微生物群平衡来治疗痤疮。目前,针对皮肤微生物群调控的护肤品开发是一个研究热点,反映了从基础科学研究向临床实践的过渡。
{"title":"Analysis of global trends and hotspots of skin microbiome in acne: a bibliometric perspective.","authors":"Lanfang Zhang, Yuan Cai, Lin Li, Jie Hu, Changsha Jia, Xu Kuang, Yi Zhou, Zhiai Lan, Chunyan Liu, Feng Jiang, Nana Sun, Ni Zeng","doi":"10.1186/s13040-025-00433-0","DOIUrl":"10.1186/s13040-025-00433-0","url":null,"abstract":"<p><strong>Background: </strong>Acne is a chronic inflammatory condition affecting the hair follicles and sebaceous glands. Recent research has revealed significant advances in the study of the acne skin microbiome. Systematic analysis of research trends and hotspots in the acne skin microbiome is lacking. This study utilized bibliometric methods to conduct in-depth research on the recognition structure of the acne skin microbiome, identifying hot trends and emerging topics.</p><p><strong>Methods: </strong>We performed a topic search to retrieve articles about skin microbiome in acne from the Web of Science Core Collection. Bibliometric research was conducted using CiteSpace, VOSviewer, and R language.</p><p><strong>Results: </strong>This study analyzed 757 articles from 1362 institutions in 68 countries, the United States leading the research efforts. Notably, Brigitte Dréno from the University of Nantes emerged as the most prolific author in this field, with 19 papers and 334 co-citations. The research output on the skin microbiome of acne continues to increase, with Experimental Dermatology being the journal with the highest number of published articles. The primary focus is investigating the skin microbiome's mechanisms in acne development and exploring treatment strategies. These findings have important implications for developing microbiome-targeted therapies, which could provide new, personalized treatment options for patients with acne. Emerging research hotspots include skincare, gut microbiome, and treatment.</p><p><strong>Conclusion: </strong>The study's findings indicate a thriving research interest in the skin microbiome and its relationship to acne, focusing on acne treatment through the regulation of the skin microbiome balance. Currently, the development of skincare products targeting the regulation of the skin microbiome represents a research hotspot, reflecting the transition from basic scientific research to clinical practice.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"19"},"PeriodicalIF":4.0,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11874858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143544184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inferring protein from transcript abundances using convolutional neural networks. 利用卷积神经网络从转录物丰度推断蛋白质。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-27 DOI: 10.1186/s13040-025-00434-z
Patrick Maximilian Schwehn, Pascal Falter-Braun

Background: Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).

Results: After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r2) of 0.30 in H. sapiens and 0.32 in A. thaliana.

Conclusions: For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model's learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.

背景:虽然转录本丰度经常被用作蛋白质丰度的替代物,但它是一种不可靠的预测指标。由于蛋白质执行生物功能,其表达水平会影响表型结果,因此我们开发了一种卷积神经网络(CNN),根据智人(H. sapiens)和参照植物拟南芥(A. thaliana)的 mRNA 丰度、蛋白质序列和 mRNA 序列预测蛋白质丰度:经过超参数优化和初始数据探索,我们为基于值和基于序列的数据实施了不同的训练模块。通过分析学习到的权重,我们揭示了影响蛋白质-mRNA比值(PTRs)的常见和生物特异性序列特征,包括已知和推测的序列母题。加入特定条件下的蛋白质相互作用信息后,发现了与许多 PTRs 相关的基因,但并没有提高预测结果,这可能是由于数据不足造成的。综合模型预测了未见基因的蛋白质丰度,其决定系数(r2)在智人中为 0.30,在大连人中为 0.32:结论:对于智人来说,我们的模型比以前基于序列的方法提高了近 50%的预测性能,而对于三叶虫来说,它是首个同类模型。该模型学习到的图案再现了已知的调控元素,支持其在与蛋白质调控相关的系统级和假设驱动研究方法中的实用性。
{"title":"Inferring protein from transcript abundances using convolutional neural networks.","authors":"Patrick Maximilian Schwehn, Pascal Falter-Braun","doi":"10.1186/s13040-025-00434-z","DOIUrl":"10.1186/s13040-025-00434-z","url":null,"abstract":"<p><strong>Background: </strong>Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).</p><p><strong>Results: </strong>After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r<sup>2</sup>) of 0.30 in H. sapiens and 0.32 in A. thaliana.</p><p><strong>Conclusions: </strong>For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model's learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"18"},"PeriodicalIF":4.0,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143525013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI as an accelerator for defining new problems that transcends boundaries. 人工智能是定义超越边界的新问题的加速器。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-18 DOI: 10.1186/s13040-025-00429-w
Tayo Obafemi-Ajayi, Steven F Jennings, Yu Zhang, Kara Li Liu, Joan Peckham, Jason H Moore

Interdisciplinary, transdisciplinary, convergence, and No-Boundary Thinking (NBT) research are methodology and technology-agnostic approaches to problem solving. The focus is on defining problems informed by access to multiple knowledge sources and expert perspectives across different domains, with the goal of accessing all available knowledge sources and perspectives. While access to all available knowledge sources and perspectives could be seen as a difficult to attain objective, with the recent rise of AI we might be closer to approaching this goal. We review several examples of methodologies and technologies that have been used to put these strategies into action, but the primary focus of this paper is on how recent advances in AI now enable a quantum leap forward in defining new problems. By leveraging the capacity of AI to synthesize knowledge from multiple domains, these tools can be used to propose multiple candidate problem definitions. AI is uniquely able to draw upon many more knowledge sources than any individual-or even a very large team-could. Coupled with human intelligence, better problems can be defined to address complex scholarly or societal challenges.

跨学科、跨学科、融合和无边界思维(NBT)研究是解决问题的方法和技术不可知的方法。重点是通过访问不同领域的多个知识来源和专家观点来定义问题,目标是访问所有可用的知识来源和观点。虽然访问所有可用的知识来源和观点可能被视为难以实现的目标,但随着最近人工智能的兴起,我们可能更接近这一目标。我们回顾了几个用于将这些策略付诸行动的方法和技术的例子,但本文的主要重点是人工智能的最新进展如何在定义新问题方面实现巨大飞跃。通过利用人工智能综合多个领域知识的能力,这些工具可以用来提出多个候选问题定义。人工智能具有独特的能力,能够利用比任何个人——甚至是一个非常大的团队——更多的知识来源。加上人类的智慧,可以定义更好的问题来解决复杂的学术或社会挑战。
{"title":"AI as an accelerator for defining new problems that transcends boundaries.","authors":"Tayo Obafemi-Ajayi, Steven F Jennings, Yu Zhang, Kara Li Liu, Joan Peckham, Jason H Moore","doi":"10.1186/s13040-025-00429-w","DOIUrl":"10.1186/s13040-025-00429-w","url":null,"abstract":"<p><p>Interdisciplinary, transdisciplinary, convergence, and No-Boundary Thinking (NBT) research are methodology and technology-agnostic approaches to problem solving. The focus is on defining problems informed by access to multiple knowledge sources and expert perspectives across different domains, with the goal of accessing all available knowledge sources and perspectives. While access to all available knowledge sources and perspectives could be seen as a difficult to attain objective, with the recent rise of AI we might be closer to approaching this goal. We review several examples of methodologies and technologies that have been used to put these strategies into action, but the primary focus of this paper is on how recent advances in AI now enable a quantum leap forward in defining new problems. By leveraging the capacity of AI to synthesize knowledge from multiple domains, these tools can be used to propose multiple candidate problem definitions. AI is uniquely able to draw upon many more knowledge sources than any individual-or even a very large team-could. Coupled with human intelligence, better problems can be defined to address complex scholarly or societal challenges.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"17"},"PeriodicalIF":4.0,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11837601/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143450623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning models for reinjury risk prediction using cardiopulmonary exercise testing (CPET) data: optimizing athlete recovery. 利用心肺运动测试 (CPET) 数据预测再受伤风险的机器学习模型:优化运动员的恢复。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-17 DOI: 10.1186/s13040-025-00431-2
Arezoo Abasi, Ahmad Nazari, Azar Moezy, Seyed Ali Fatemi Aghda

Background: Cardiopulmonary Exercise Testing (CPET) provides detailed insights into athletes' cardiovascular and pulmonary function, making it a valuable tool in assessing recovery and injury risks. However, traditional statistical models often fail to leverage the full potential of CPET data in predicting reinjury. Machine learning (ML) algorithms offer promising capabilities in uncovering complex patterns within this data, allowing for more accurate injury risk assessment.

Objective: This study aimed to develop machine learning models to predict reinjury risk among elite soccer players using CPET data. Specifically, we sought to identify key physiological and performance variables that correlate with reinjury and to evaluate the performance of various ML algorithms in generating accurate predictions.

Methods: A dataset of 256 elite soccer players from 16 national and top-tier teams in Iran was analyzed, incorporating physiological variables and categorical data. Several machine learning models, including CatBoost, SVM, Random Forest, and XGBoost, were employed to predict reinjury risk. Model performance was assessed using metrics such as accuracy, precision, recall, F1-score, AUC, and SHAP values to ensure robust evaluation and interpretability.

Results: CatBoost and SVM exhibited the best performance, with CatBoost achieving the highest accuracy (0.9138) and F1-score (0.9148), and SVM achieving the highest AUC (0.9725). A significant association was found between a history of concussion and reinjury risk (χ² = 13.0360, p = 0.0015), highlighting the importance of neurological recovery in preventing future injuries. Heart rate metrics, particularly HRmax and HR2, were also significantly lower in players who experienced reinjury, indicating reduced cardiovascular capacity in this group.

Conclusion: Machine learning models, particularly CatBoost and SVM, provide promising tools for predicting reinjury risk using CPET data. These models offer clinicians more precise, data-driven insights into athlete recovery and risk management. Future research should explore the integration of external factors such as training load and psychological readiness to further refine these predictions and enhance injury prevention protocols.

背景:心肺运动测试(CPET)提供了运动员心血管和肺功能的详细信息,使其成为评估恢复和损伤风险的有价值的工具。然而,传统的统计模型往往不能充分利用CPET数据预测再损伤的潜力。机器学习(ML)算法在揭示这些数据中的复杂模式方面提供了很有前途的能力,可以更准确地评估伤害风险。目的:本研究旨在利用CPET数据开发机器学习模型来预测精英足球运动员的再损伤风险。具体来说,我们试图确定与再损伤相关的关键生理和性能变量,并评估各种ML算法在生成准确预测方面的性能。方法:对来自伊朗16支国家队和顶级球队的256名优秀足球运动员的数据集进行分析,结合生理变量和分类数据。使用CatBoost、SVM、Random Forest和XGBoost等机器学习模型来预测再损伤风险。使用准确性、精密度、召回率、f1分数、AUC和SHAP值等指标评估模型性能,以确保可靠的评估和可解释性。结果:CatBoost和SVM表现最好,其中CatBoost的准确率最高(0.9138),f1得分最高(0.9148),SVM的AUC最高(0.9725)。发现脑震荡史与再损伤风险之间存在显著关联(χ²= 13.0360,p = 0.0015),突出了神经恢复对预防未来损伤的重要性。再次受伤的运动员的心率指标,尤其是HRmax和HR2,也明显较低,这表明这组运动员的心血管容量降低。结论:机器学习模型,特别是CatBoost和SVM,为利用CPET数据预测再损伤风险提供了很有前途的工具。这些模型为临床医生提供了更精确的、数据驱动的运动员康复和风险管理见解。未来的研究应该探索外部因素的整合,如训练负荷和心理准备,以进一步完善这些预测并加强伤害预防方案。
{"title":"Machine learning models for reinjury risk prediction using cardiopulmonary exercise testing (CPET) data: optimizing athlete recovery.","authors":"Arezoo Abasi, Ahmad Nazari, Azar Moezy, Seyed Ali Fatemi Aghda","doi":"10.1186/s13040-025-00431-2","DOIUrl":"10.1186/s13040-025-00431-2","url":null,"abstract":"<p><strong>Background: </strong>Cardiopulmonary Exercise Testing (CPET) provides detailed insights into athletes' cardiovascular and pulmonary function, making it a valuable tool in assessing recovery and injury risks. However, traditional statistical models often fail to leverage the full potential of CPET data in predicting reinjury. Machine learning (ML) algorithms offer promising capabilities in uncovering complex patterns within this data, allowing for more accurate injury risk assessment.</p><p><strong>Objective: </strong>This study aimed to develop machine learning models to predict reinjury risk among elite soccer players using CPET data. Specifically, we sought to identify key physiological and performance variables that correlate with reinjury and to evaluate the performance of various ML algorithms in generating accurate predictions.</p><p><strong>Methods: </strong>A dataset of 256 elite soccer players from 16 national and top-tier teams in Iran was analyzed, incorporating physiological variables and categorical data. Several machine learning models, including CatBoost, SVM, Random Forest, and XGBoost, were employed to predict reinjury risk. Model performance was assessed using metrics such as accuracy, precision, recall, F1-score, AUC, and SHAP values to ensure robust evaluation and interpretability.</p><p><strong>Results: </strong>CatBoost and SVM exhibited the best performance, with CatBoost achieving the highest accuracy (0.9138) and F1-score (0.9148), and SVM achieving the highest AUC (0.9725). A significant association was found between a history of concussion and reinjury risk (χ² = 13.0360, p = 0.0015), highlighting the importance of neurological recovery in preventing future injuries. Heart rate metrics, particularly HRmax and HR2, were also significantly lower in players who experienced reinjury, indicating reduced cardiovascular capacity in this group.</p><p><strong>Conclusion: </strong>Machine learning models, particularly CatBoost and SVM, provide promising tools for predicting reinjury risk using CPET data. These models offer clinicians more precise, data-driven insights into athlete recovery and risk management. Future research should explore the integration of external factors such as training load and psychological readiness to further refine these predictions and enhance injury prevention protocols.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"16"},"PeriodicalIF":4.0,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11834553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143442544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping. 可解释无监督树集成的特征图:中心性、相互作用和疾病亚型分型中的应用。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-15 DOI: 10.1186/s13040-025-00430-3
Christel Sirocchi, Martin Urschler, Bastian Pfeifer

Explainable and interpretable machine learning has emerged as essential in leveraging artificial intelligence within high-stakes domains such as healthcare to ensure transparency and trustworthiness. Feature importance analysis plays a crucial role in improving model interpretability by pinpointing the most relevant input features, particularly in disease subtyping applications, aimed at stratifying patients based on a small set of signature genes and biomarkers. While clustering methods, including unsupervised random forests, have demonstrated good performance, approaches for evaluating feature contributions in an unsupervised regime are notably scarce. To address this gap, we introduce a novel methodology to enhance the interpretability of unsupervised random forests by elucidating feature contributions through the construction of feature graphs, both over the entire dataset and individual clusters, that leverage parent-child node splits within the trees. Feature selection strategies to derive effective feature combinations from these graphs are presented and extensively evaluated on synthetic and benchmark datasets against state-of-the-art methods, standing out for performance, computational efficiency, reliability, versatility and ability to provide cluster-specific insights. In a disease subtyping application, clustering kidney cancer gene expression data over a feature subset selected with our approach reveals three patient groups with different survival outcomes. Cluster-specific analysis identifies distinctive feature contributions and interactions, essential for devising targeted interventions, conducting personalised risk assessments, and enhancing our understanding of the underlying molecular complexities.

可解释和可解释的机器学习已经成为在医疗保健等高风险领域利用人工智能以确保透明度和可信度的关键。特征重要性分析通过精确定位最相关的输入特征,在提高模型可解释性方面发挥着至关重要的作用,特别是在疾病亚型应用中,旨在根据一小组特征基因和生物标志物对患者进行分层。虽然聚类方法,包括无监督随机森林,已经证明了良好的性能,但在无监督状态下评估特征贡献的方法非常少。为了解决这一差距,我们引入了一种新的方法,通过在整个数据集和单个集群上构建特征图来阐明特征贡献,从而增强无监督随机森林的可解释性,该方法利用树内的父子节点分裂。本文介绍了从这些图中获得有效特征组合的特征选择策略,并针对最先进的方法在合成和基准数据集上进行了广泛的评估,突出了性能、计算效率、可靠性、通用性和提供特定于集群的见解的能力。在疾病亚型应用中,通过我们的方法选择的特征子集聚类肾癌基因表达数据揭示了具有不同生存结果的三组患者。聚类特异性分析确定了独特的特征贡献和相互作用,这对于设计有针对性的干预措施、进行个性化风险评估以及增强我们对潜在分子复杂性的理解至关重要。
{"title":"Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping.","authors":"Christel Sirocchi, Martin Urschler, Bastian Pfeifer","doi":"10.1186/s13040-025-00430-3","DOIUrl":"10.1186/s13040-025-00430-3","url":null,"abstract":"<p><p>Explainable and interpretable machine learning has emerged as essential in leveraging artificial intelligence within high-stakes domains such as healthcare to ensure transparency and trustworthiness. Feature importance analysis plays a crucial role in improving model interpretability by pinpointing the most relevant input features, particularly in disease subtyping applications, aimed at stratifying patients based on a small set of signature genes and biomarkers. While clustering methods, including unsupervised random forests, have demonstrated good performance, approaches for evaluating feature contributions in an unsupervised regime are notably scarce. To address this gap, we introduce a novel methodology to enhance the interpretability of unsupervised random forests by elucidating feature contributions through the construction of feature graphs, both over the entire dataset and individual clusters, that leverage parent-child node splits within the trees. Feature selection strategies to derive effective feature combinations from these graphs are presented and extensively evaluated on synthetic and benchmark datasets against state-of-the-art methods, standing out for performance, computational efficiency, reliability, versatility and ability to provide cluster-specific insights. In a disease subtyping application, clustering kidney cancer gene expression data over a feature subset selected with our approach reveals three patient groups with different survival outcomes. Cluster-specific analysis identifies distinctive feature contributions and interactions, essential for devising targeted interventions, conducting personalised risk assessments, and enhancing our understanding of the underlying molecular complexities.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"15"},"PeriodicalIF":4.0,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829558/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143426202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agenda setting for health equity assessment through the lenses of social determinants of health using machine learning approach: a framework and preliminary pilot study. 通过使用机器学习方法的健康社会决定因素制定卫生公平评估议程:框架和初步试点研究。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-10 DOI: 10.1186/s13040-025-00428-x
Maryam Ramezani, Mohammadreza Mobinizadeh, Ahad Bakhtiari, Hamid R Rabiee, Maryam Ramezani, Hakimeh Mostafavi, Alireza Olyaeemanesh, Ali Akbar Fazaeli, Alireza Atashi, Saharnaz Sazgarnejad, Efat Mohamadi, Amirhossein Takian

Introduction: The integration of Artificial Intelligence (AI) and Machine Learning (ML) is transforming public health by enhancing the assessment and mitigation of health inequities. As the use of AI tools, especially ML techniques, rises, they play a pivotal role in informing policies that promote a more equitable society. This study aims to develop a framework utilizing ML to analyze health system data and set agendas for health equity interventions, focusing on social determinants of health (SDH).

Method: This study utilized the CRISP-ML(Q) model to introduce a platform for health equity assessment, facilitating its design and implementation in health systems. Initially, a conceptual model was developed through a comprehensive literature review and document analysis. A pilot implementation was conducted to test the feasibility and effectiveness of using ML algorithms in assessing health equity. Life expectancy was chosen as the health outcome for this pilot; data from 2000 to 2020 with 140 features was cleaned, transformed, and prepared for modeling. Multiple ML models were developed and evaluated using SPSS Modeler software version 18.0.

Results: ML algorithms effectively identified key SDH influencing life expectancy. Among algorithms, the Linear Discriminant algorithm as classification model was selected as the best model due to its high accuracy in both testing and training phases, its strong performance in identifying key features, and its good generalizability to new data. Additionally, CHAID in numeric models was the best for predicting the actual value of life expectancy based on various features. These models highlighted the importance of features like current health expenditure, domestic general government health expenditure, and GDP in predicting life expectancy.

Conclusion: The findings underscore the significance of employing innovative methods like CRISP-ML(Q) and ML algorithms to enhance health equity. Integrating this platform into health systems can help countries better prioritize and address health inequities. The pilot implementation demonstrated these methods' practical applicability and effectiveness, aiding policymakers in making informed decisions to improve health equity.

人工智能(AI)和机器学习(ML)的整合正在通过加强对卫生不公平的评估和缓解来改变公共卫生。随着人工智能工具,特别是机器学习技术的使用增加,它们在为促进更公平社会的政策提供信息方面发挥着关键作用。本研究旨在开发一个框架,利用机器学习分析卫生系统数据,并为卫生公平干预制定议程,重点关注健康的社会决定因素(SDH)。方法:利用CRISP-ML(Q)模型引入卫生公平评估平台,促进其在卫生系统中的设计与实施。首先,通过全面的文献回顾和文献分析,建立了一个概念模型。进行了试点实施,以测试使用ML算法评估卫生公平的可行性和有效性。选择预期寿命作为该试点的健康结果;对2000年至2020年的140个特征的数据进行了清理、转换并准备建模。使用SPSS Modeler 18.0版软件开发和评估多个ML模型。结果:ML算法有效识别了影响预期寿命的关键SDH。在所有算法中,线性判别算法(Linear Discriminant)作为分类模型,由于其在测试和训练阶段的准确率高,识别关键特征的能力强,以及对新数据的良好泛化能力,被选为最佳模型。此外,数值模型中的CHAID在预测基于各种特征的实际预期寿命值方面效果最好。这些模型强调了当前卫生支出、国内一般政府卫生支出和国内生产总值等特征在预测预期寿命方面的重要性。结论:研究结果强调了采用CRISP-ML(Q)和ML算法等创新方法对提高卫生公平的重要性。将这一平台纳入卫生系统可以帮助各国更好地优先考虑和解决卫生不公平问题。试点实施证明了这些方法的实际适用性和有效性,有助于决策者做出明智的决策,以改善卫生公平。
{"title":"Agenda setting for health equity assessment through the lenses of social determinants of health using machine learning approach: a framework and preliminary pilot study.","authors":"Maryam Ramezani, Mohammadreza Mobinizadeh, Ahad Bakhtiari, Hamid R Rabiee, Maryam Ramezani, Hakimeh Mostafavi, Alireza Olyaeemanesh, Ali Akbar Fazaeli, Alireza Atashi, Saharnaz Sazgarnejad, Efat Mohamadi, Amirhossein Takian","doi":"10.1186/s13040-025-00428-x","DOIUrl":"10.1186/s13040-025-00428-x","url":null,"abstract":"<p><strong>Introduction: </strong>The integration of Artificial Intelligence (AI) and Machine Learning (ML) is transforming public health by enhancing the assessment and mitigation of health inequities. As the use of AI tools, especially ML techniques, rises, they play a pivotal role in informing policies that promote a more equitable society. This study aims to develop a framework utilizing ML to analyze health system data and set agendas for health equity interventions, focusing on social determinants of health (SDH).</p><p><strong>Method: </strong>This study utilized the CRISP-ML(Q) model to introduce a platform for health equity assessment, facilitating its design and implementation in health systems. Initially, a conceptual model was developed through a comprehensive literature review and document analysis. A pilot implementation was conducted to test the feasibility and effectiveness of using ML algorithms in assessing health equity. Life expectancy was chosen as the health outcome for this pilot; data from 2000 to 2020 with 140 features was cleaned, transformed, and prepared for modeling. Multiple ML models were developed and evaluated using SPSS Modeler software version 18.0.</p><p><strong>Results: </strong>ML algorithms effectively identified key SDH influencing life expectancy. Among algorithms, the Linear Discriminant algorithm as classification model was selected as the best model due to its high accuracy in both testing and training phases, its strong performance in identifying key features, and its good generalizability to new data. Additionally, CHAID in numeric models was the best for predicting the actual value of life expectancy based on various features. These models highlighted the importance of features like current health expenditure, domestic general government health expenditure, and GDP in predicting life expectancy.</p><p><strong>Conclusion: </strong>The findings underscore the significance of employing innovative methods like CRISP-ML(Q) and ML algorithms to enhance health equity. Integrating this platform into health systems can help countries better prioritize and address health inequities. The pilot implementation demonstrated these methods' practical applicability and effectiveness, aiding policymakers in making informed decisions to improve health equity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"14"},"PeriodicalIF":4.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11808983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143392203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Immune cell profiles and predictive modeling in osteoporotic vertebral fractures using XGBoost machine learning algorithms. 利用 XGBoost 机器学习算法建立骨质疏松性脊椎骨折的免疫细胞图谱和预测模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-04 DOI: 10.1186/s13040-025-00427-y
Yi-Chou Chen, Hui-Chen Su, Shih-Ming Huang, Ching-Hsiao Yu, Jen-Huei Chang, Yi-Lin Chiu

Background: Osteoporosis significantly increases the risk of vertebral fractures, particularly among postmenopausal women, decreasing their quality of life. These fractures, often undiagnosed, can lead to severe health consequences and are influenced by bone mineral density and abnormal loads. Management strategies range from non-surgical interventions to surgical treatments. Moreover, the interaction between immune cells and bone cells plays a crucial role in bone repair processes, highlighting the importance of osteoimmunology in understanding and treating bone pathologies.

Methods: This study aims to investigate the xCell signature-based immune cell profiles in osteoporotic patients with and without vertebral fractures, utilizing advanced predictive modeling through the XGBoost algorithm.

Results: Our findings reveal an increased presence of CD4 + naïve T cells and central memory T cells in VF patients, indicating distinct adaptive immune responses. The XGBoost model identified Th1 cells, CD4 memory T cells, and hematopoietic stem cells as key predictors of VF. Notably, VF patients exhibited a reduction in Th1 cells and an enrichment of Th17 cells, which promote osteoclastogenesis and bone resorption. Gene expression analysis further highlighted an upregulation of osteoclast-related genes and a downregulation of osteoblast-related genes in VF patients, emphasizing the disrupted balance between bone formation and resorption. These findings underscore the critical role of immune cells in the pathogenesis of osteoporotic fractures and highlight the potential of XGBoost in identifying key biomarkers and therapeutic targets for mitigating fracture risk in osteoporotic patients.

背景:骨质疏松症显著增加椎体骨折的风险,尤其是绝经后妇女,降低她们的生活质量。这些骨折通常未确诊,可导致严重的健康后果,并受骨矿物质密度和异常负荷的影响。管理策略包括从非手术干预到手术治疗。此外,免疫细胞和骨细胞之间的相互作用在骨修复过程中起着至关重要的作用,突出了骨免疫学在理解和治疗骨病理方面的重要性。方法:本研究旨在利用XGBoost算法进行先进的预测建模,研究伴有和不伴有椎体骨折的骨质疏松症患者基于xCell特征的免疫细胞谱。结果:我们的研究结果显示,VF患者CD4 + naïve T细胞和中枢记忆T细胞的存在增加,表明明显的适应性免疫反应。XGBoost模型发现Th1细胞、CD4记忆T细胞和造血干细胞是VF的关键预测因子。值得注意的是,VF患者表现出Th1细胞的减少和Th17细胞的富集,Th17细胞促进破骨细胞的发生和骨吸收。基因表达分析进一步强调了VF患者中破骨细胞相关基因的上调和成骨细胞相关基因的下调,强调了骨形成和骨吸收之间的平衡被破坏。这些发现强调了免疫细胞在骨质疏松性骨折发病机制中的关键作用,并强调了XGBoost在确定骨质疏松症患者骨折风险的关键生物标志物和治疗靶点方面的潜力。
{"title":"Immune cell profiles and predictive modeling in osteoporotic vertebral fractures using XGBoost machine learning algorithms.","authors":"Yi-Chou Chen, Hui-Chen Su, Shih-Ming Huang, Ching-Hsiao Yu, Jen-Huei Chang, Yi-Lin Chiu","doi":"10.1186/s13040-025-00427-y","DOIUrl":"10.1186/s13040-025-00427-y","url":null,"abstract":"<p><strong>Background: </strong>Osteoporosis significantly increases the risk of vertebral fractures, particularly among postmenopausal women, decreasing their quality of life. These fractures, often undiagnosed, can lead to severe health consequences and are influenced by bone mineral density and abnormal loads. Management strategies range from non-surgical interventions to surgical treatments. Moreover, the interaction between immune cells and bone cells plays a crucial role in bone repair processes, highlighting the importance of osteoimmunology in understanding and treating bone pathologies.</p><p><strong>Methods: </strong>This study aims to investigate the xCell signature-based immune cell profiles in osteoporotic patients with and without vertebral fractures, utilizing advanced predictive modeling through the XGBoost algorithm.</p><p><strong>Results: </strong>Our findings reveal an increased presence of CD4 + naïve T cells and central memory T cells in VF patients, indicating distinct adaptive immune responses. The XGBoost model identified Th1 cells, CD4 memory T cells, and hematopoietic stem cells as key predictors of VF. Notably, VF patients exhibited a reduction in Th1 cells and an enrichment of Th17 cells, which promote osteoclastogenesis and bone resorption. Gene expression analysis further highlighted an upregulation of osteoclast-related genes and a downregulation of osteoblast-related genes in VF patients, emphasizing the disrupted balance between bone formation and resorption. These findings underscore the critical role of immune cells in the pathogenesis of osteoporotic fractures and highlight the potential of XGBoost in identifying key biomarkers and therapeutic targets for mitigating fracture risk in osteoporotic patients.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"13"},"PeriodicalIF":4.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites. xgboost增强集合模型使用判别杂交特征来预测sumoylation位点。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-02-03 DOI: 10.1186/s13040-024-00415-8
Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad

Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.

翻译后修饰(PTMs)对于调节蛋白质的定位和稳定性至关重要,显著影响基因表达、生物学功能和基因组复制。其中,sumoylation(一种将化学基团连接到蛋白质序列上的PTM)在蛋白质功能中起着关键作用。由于与帕金森氏症和阿尔茨海默氏症的联系,确定sumo化位点尤为重要。本研究引入了XGBoost-Sumo,这是一个强大的模型,通过整合蛋白质结构和序列数据来预测sumo化位点。该模型利用基于转换器的注意机制对多肽进行编码,并通过PsePSSM-DWT方法提取进化特征。通过融合词嵌入和进化描述符,应用SHapley加性解释(SHAP)算法进行最优特征选择,并使用极限梯度增强(XGBoost)进行分类。XGBoost-Sumo在使用10倍交叉验证的基准数据集上实现了令人印象深刻的99.68%的准确率,在独立样本上实现了96.08%的准确率。这标志着显著的改进,在训练数据上优于现有模型10.31%,在独立测试上优于现有模型2.74%。该模型的可靠性和高性能使其成为研究人员的宝贵资源,在药物开发中具有很强的应用潜力。
{"title":"XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites.","authors":"Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad","doi":"10.1186/s13040-024-00415-8","DOIUrl":"10.1186/s13040-024-00415-8","url":null,"abstract":"<p><p>Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"12"},"PeriodicalIF":4.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792219/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143123566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MiCML: a causal machine learning cloud platform for the analysis of treatment effects using microbiome profiles. MiCML:一个因果机器学习云平台,用于使用微生物组概况分析治疗效果。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-30 DOI: 10.1186/s13040-025-00422-3
Hyunwook Koh, Jihun Kim, Hyojung Jang

Background: The treatment effects are heterogenous across patients due to the differences in their microbiomes, which in turn implies that we can enhance the treatment effect by manipulating the patient's microbiome profile. Then, the coadministration of microbiome-based dietary supplements/therapeutics along with the primary treatment has been the subject of intensive investigation. However, for this, we first need to comprehend which microbes help (or prevent) the treatment to cure the patient's disease.

Results: In this paper, we introduce a cloud platform, named microbiome causal machine learning (MiCML), for the analysis of treatment effects using microbiome profiles on user-friendly web environments. MiCML is in particular unique with the up-to-date features of (i) batch effect correction to mitigate systematic variation in collective large-scale microbiome data due to the differences in their underlying batches, and (ii) causal machine learning to estimate treatment effects with consistency and then discern microbial taxa that enhance (or lower) the efficacy of the primary treatment. We also stress that MiCML can handle the data from either randomized controlled trials or observational studies.

Conclusion: We describe MiCML as a useful analytic tool for microbiome-based personalized medicine. MiCML is freely available on our web server ( http://micml.micloud.kr ). MiCML can also be implemented locally on the user's computer through our GitHub repository ( https://github.com/hk1785/micml ).

背景:由于患者微生物组的差异,治疗效果是异质性的,这反过来意味着我们可以通过操纵患者的微生物组谱来提高治疗效果。因此,基于微生物组的膳食补充剂/疗法与主要治疗的共同管理一直是深入研究的主题。然而,为此,我们首先需要了解哪些微生物有助于(或阻止)治疗来治愈病人的疾病。结果:在本文中,我们引入了一个名为微生物组因果机器学习(MiCML)的云平台,用于在用户友好的web环境中分析微生物组配置文件的治疗效果。MiCML特别独特,具有以下最新特征:(i)批次效应校正,以减轻集体大规模微生物组数据因其基础批次差异而产生的系统变化,以及(ii)因果机器学习,以一致性估计治疗效果,然后识别增强(或降低)初级治疗效果的微生物分类群。我们还强调,MiCML可以处理来自随机对照试验或观察性研究的数据。结论:我们将MiCML描述为基于微生物组的个性化医疗的有用分析工具。MiCML在我们的web服务器上免费提供(http://micml.micloud.kr)。MiCML也可以通过我们的GitHub存储库(https://github.com/hk1785/micml)在用户的计算机上本地实现。
{"title":"MiCML: a causal machine learning cloud platform for the analysis of treatment effects using microbiome profiles.","authors":"Hyunwook Koh, Jihun Kim, Hyojung Jang","doi":"10.1186/s13040-025-00422-3","DOIUrl":"10.1186/s13040-025-00422-3","url":null,"abstract":"<p><strong>Background: </strong>The treatment effects are heterogenous across patients due to the differences in their microbiomes, which in turn implies that we can enhance the treatment effect by manipulating the patient's microbiome profile. Then, the coadministration of microbiome-based dietary supplements/therapeutics along with the primary treatment has been the subject of intensive investigation. However, for this, we first need to comprehend which microbes help (or prevent) the treatment to cure the patient's disease.</p><p><strong>Results: </strong>In this paper, we introduce a cloud platform, named microbiome causal machine learning (MiCML), for the analysis of treatment effects using microbiome profiles on user-friendly web environments. MiCML is in particular unique with the up-to-date features of (i) batch effect correction to mitigate systematic variation in collective large-scale microbiome data due to the differences in their underlying batches, and (ii) causal machine learning to estimate treatment effects with consistency and then discern microbial taxa that enhance (or lower) the efficacy of the primary treatment. We also stress that MiCML can handle the data from either randomized controlled trials or observational studies.</p><p><strong>Conclusion: </strong>We describe MiCML as a useful analytic tool for microbiome-based personalized medicine. MiCML is freely available on our web server ( http://micml.micloud.kr ). MiCML can also be implemented locally on the user's computer through our GitHub repository ( https://github.com/hk1785/micml ).</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"10"},"PeriodicalIF":4.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1