首页 > 最新文献

Biodata Mining最新文献

英文 中文
Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning. 迈向精确肿瘤学:结合液体活检和机器学习的多层次癌症分类系统。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-04-11 DOI: 10.1186/s13040-025-00439-8
Amr Eledkawy, Taher Hamza, Sara El-Metwally

Background: Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.

Results: The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).

Conclusion: The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.

背景:每年有数百万人死于癌症。早期癌症检测对于确保更高的存活率至关重要,因为它为及时的医疗干预提供了机会。本文提出了一个多层次的癌症分类系统,该系统使用血浆cfDNA/ctDNA突变和蛋白质生物标志物来识别七种不同的癌症类型:结直肠癌、乳腺癌、上胃肠道、肺癌、胰腺癌、卵巢癌和肝癌。结果:提出的系统采用多阶段二元分类框架,其中每个阶段都是针对特定的癌症类型定制的。采用多数投票特征选择过程,结合六个特征选择器:信息值、卡方、随机森林特征重要性、额外树特征重要性、递归特征消除和L1正则化。在特征选择过程之后,分类器(包括极端梯度增强、随机森林、额外树和二次判别分析)分别针对每种癌症类型或在集成软投票设置中进行定制,以优化预测准确性。该系统优于先前发表的结果,AUC为98.2%,准确率为96.21%。为了确保结果的可重复性,本研究中使用的训练模型和数据集通过GitHub存储库(https://github.com/SaraEl-Metwally/Towards-Precision-Oncology)公开提供。结论:识别的生物标志物提高了诊断的可解释性,促进了更明智的决策。该系统的性能强调了其在组织定位方面的有效性,有助于通过及时的医疗干预改善患者的预后。
{"title":"Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.","authors":"Amr Eledkawy, Taher Hamza, Sara El-Metwally","doi":"10.1186/s13040-025-00439-8","DOIUrl":"https://doi.org/10.1186/s13040-025-00439-8","url":null,"abstract":"<p><strong>Background: </strong>Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.</p><p><strong>Results: </strong>The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).</p><p><strong>Conclusion: </strong>The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"29"},"PeriodicalIF":4.0,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144023569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction. 基于llms辅助数据增强和多尺度特征提取的少镜头生物医学NER。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-04-04 DOI: 10.1186/s13040-025-00443-y
Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin

Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.

命名实体识别(NER)是处理生物医学文本的一项基本任务。由于标注数据的可用性有限,研究人员研究了少量学习方法来应对这一挑战。然而,在少数几次学习的情况下,复制完全监督方法的性能仍然很困难。本文主要解决两个问题。在数据增强方面,现有方法主要侧重于替换原文内容,这可能会扭曲语义。此外,现有方法往往忽视了多种尺度的句子特征。为了克服这些挑战,我们利用 ChatGPT 为相同的实体生成具有不同语义的丰富数据,从而减少噪声数据。同时,我们利用动态卷积捕捉句子中的多尺度语义信息,并基于 PubMedBERT 增强特征表示。我们在四个生物医学 NER 数据集(BC5CDR-Disease、NCBI、BioNLP11EPI、BioNLP13GE)上进行了实验评估,结果显示,在大多数少数几个场景中,实验结果都超过了目前最先进的模型,包括主流的大型语言模型,如 ChatGPT。这些结果证实了所提出的方法在数据扩增和模型泛化方面的有效性。
{"title":"Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction.","authors":"Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin","doi":"10.1186/s13040-025-00443-y","DOIUrl":"10.1186/s13040-025-00443-y","url":null,"abstract":"<p><p>Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"28"},"PeriodicalIF":4.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11969866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143781479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate longitudinal clustering reveals neuropsychological factors as dementia predictors in an Alzheimer's disease progression study. 在一项阿尔茨海默病进展研究中,多变量纵向聚类揭示了神经心理因素作为痴呆预测因子。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-28 DOI: 10.1186/s13040-025-00441-0
Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini

Dementia due to Alzheimer's disease (AD) is a multifaceted neurodegenerative disorder characterized by various cognitive and behavioral decline factors. In this work, we propose an extension of the traditional k-means clustering for multivariate time series data to cluster joint trajectories of different features describing progression over time. The algorithm we propose here enables the joint analysis of various longitudinal features to explore co-occurring trajectory factors among markers indicative of cognitive decline in individuals participating in an AD progression study. By examining how multiple variables co-vary and evolve together, we identify distinct subgroups within the cohort based on their longitudinal trajectories. Our clustering method enhances the understanding of individual development across multiple dimensions and provides deeper medical insights into the trajectories of cognitive decline. In addition, the proposed algorithm is also able to make a selection of the most significant features in separating clusters by considering trajectories over time. This process, together with a preliminary pre-processing on the OASIS-3 dataset, reveals an important role of some neuropsychological factors. In particular, the proposed method has identified a significant profile compatible with a syndrome known as Mild Behavioral Impairment (MBI), displaying behavioral manifestations of individuals that may precede the cognitive symptoms typically observed in AD patients. The findings underscore the importance of considering multiple longitudinal features in clinical modeling, ultimately supporting more effective and individualized patient management strategies.

阿尔茨海默病所致痴呆(AD)是一种以多种认知和行为下降因素为特征的多方面神经退行性疾病。在这项工作中,我们提出了对多元时间序列数据的传统k-means聚类的扩展,以聚类描述随时间进展的不同特征的联合轨迹。我们在此提出的算法能够联合分析各种纵向特征,以探索参与AD进展研究的个体认知衰退标志物之间共同发生的轨迹因素。通过研究多个变量如何共同变化和共同发展,我们根据其纵向轨迹在队列中确定不同的亚组。我们的聚类方法在多个维度上增强了对个体发展的理解,并为认知衰退的轨迹提供了更深入的医学见解。此外,该算法还能够通过考虑随时间的轨迹来选择聚类分离中最重要的特征。这一过程,连同对OASIS-3数据集的初步预处理,揭示了一些神经心理因素的重要作用。特别是,所提出的方法已经确定了与轻度行为障碍(MBI)综合征相容的重要特征,显示了可能先于AD患者典型认知症状的个体行为表现。研究结果强调了在临床建模中考虑多个纵向特征的重要性,最终支持更有效和个性化的患者管理策略。
{"title":"Multivariate longitudinal clustering reveals neuropsychological factors as dementia predictors in an Alzheimer's disease progression study.","authors":"Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini","doi":"10.1186/s13040-025-00441-0","DOIUrl":"https://doi.org/10.1186/s13040-025-00441-0","url":null,"abstract":"<p><p>Dementia due to Alzheimer's disease (AD) is a multifaceted neurodegenerative disorder characterized by various cognitive and behavioral decline factors. In this work, we propose an extension of the traditional k-means clustering for multivariate time series data to cluster joint trajectories of different features describing progression over time. The algorithm we propose here enables the joint analysis of various longitudinal features to explore co-occurring trajectory factors among markers indicative of cognitive decline in individuals participating in an AD progression study. By examining how multiple variables co-vary and evolve together, we identify distinct subgroups within the cohort based on their longitudinal trajectories. Our clustering method enhances the understanding of individual development across multiple dimensions and provides deeper medical insights into the trajectories of cognitive decline. In addition, the proposed algorithm is also able to make a selection of the most significant features in separating clusters by considering trajectories over time. This process, together with a preliminary pre-processing on the OASIS-3 dataset, reveals an important role of some neuropsychological factors. In particular, the proposed method has identified a significant profile compatible with a syndrome known as Mild Behavioral Impairment (MBI), displaying behavioral manifestations of individuals that may precede the cognitive symptoms typically observed in AD patients. The findings underscore the importance of considering multiple longitudinal features in clinical modeling, ultimately supporting more effective and individualized patient management strategies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"26"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951806/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network-based multi-omics integrative analysis methods in drug discovery: a systematic review. 药物发现中基于网络的多组学综合分析方法:系统综述。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-28 DOI: 10.1186/s13040-025-00442-z
Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao

The integration of multi-omics data from diverse high-throughput technologies has revolutionized drug discovery. While various network-based methods have been developed to integrate multi-omics data, systematic evaluation and comparison of these methods remain challenging. This review aims to analyze network-based approaches for multi-omics integration and evaluate their applications in drug discovery. We conducted a comprehensive review of literature (2015-2024) on network-based multi-omics integration methods in drug discovery, and categorized methods into four primary types: network propagation/diffusion, similarity-based approaches, graph neural networks, and network inference models. We also discussed the applications of the methods in three scenario of drug discovery, including drug target identification, drug response prediction, and drug repurposing, and finally evaluated the performance of the methods by highlighting their advantages and limitations in specific applications. While network-based multi-omics integration has shown promise in drug discovery, challenges remain in computational scalability, data integration, and biological interpretation. Future developments should focus on incorporating temporal and spatial dynamics, improving model interpretability, and establishing standardized evaluation frameworks.

整合来自不同高通量技术的多组学数据为药物发现带来了革命性的变化。虽然已经开发出各种基于网络的方法来整合多组学数据,但对这些方法进行系统的评估和比较仍然具有挑战性。本综述旨在分析基于网络的多组学整合方法,并评估其在药物发现中的应用。我们对药物发现中基于网络的多组学整合方法的文献(2015-2024 年)进行了全面综述,并将方法分为四种主要类型:网络传播/扩散、基于相似性的方法、图神经网络和网络推理模型。我们还讨论了这些方法在药物发现的三个场景中的应用,包括药物靶点识别、药物反应预测和药物再利用,最后评估了这些方法的性能,强调了它们在具体应用中的优势和局限性。虽然基于网络的多组学整合在药物发现中显示出了前景,但在计算可扩展性、数据整合和生物学解释方面仍然存在挑战。未来的发展重点应该是纳入时间和空间动态、提高模型的可解释性以及建立标准化的评估框架。
{"title":"Network-based multi-omics integrative analysis methods in drug discovery: a systematic review.","authors":"Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao","doi":"10.1186/s13040-025-00442-z","DOIUrl":"https://doi.org/10.1186/s13040-025-00442-z","url":null,"abstract":"<p><p>The integration of multi-omics data from diverse high-throughput technologies has revolutionized drug discovery. While various network-based methods have been developed to integrate multi-omics data, systematic evaluation and comparison of these methods remain challenging. This review aims to analyze network-based approaches for multi-omics integration and evaluate their applications in drug discovery. We conducted a comprehensive review of literature (2015-2024) on network-based multi-omics integration methods in drug discovery, and categorized methods into four primary types: network propagation/diffusion, similarity-based approaches, graph neural networks, and network inference models. We also discussed the applications of the methods in three scenario of drug discovery, including drug target identification, drug response prediction, and drug repurposing, and finally evaluated the performance of the methods by highlighting their advantages and limitations in specific applications. While network-based multi-omics integration has shown promise in drug discovery, challenges remain in computational scalability, data integration, and biological interpretation. Future developments should focus on incorporating temporal and spatial dynamics, improving model interpretability, and establishing standardized evaluation frameworks.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"27"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11954193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data. 推进子痫前期预测:一个定制的机器学习管道,集成了重采样和集成模型,用于处理不平衡的医疗数据。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-24 DOI: 10.1186/s13040-025-00440-1
Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang

Background: Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.

Objective: This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.

Methods: Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.

Results: Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.

Conclusions: This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.

背景:在不平衡的医疗数据集(如子痫前期)中构建预测模型具有挑战性,尤其是在使用集合机器学习算法时:本研究旨在开发一个强大的管道,以提高集合机器学习模型的预测性能,从而在不平衡数据集中对子痫前期进行早期预测:我们的研究为在不平衡医疗数据集中进行子痫前期的早期预测建立了一个综合管道。我们收集了 2015 年至 2020 年广西人民医院孕妇的电子健康记录,并使用三个公共数据集进行了额外的外部验证。这种广泛的数据收集有助于通过结构化的评估过程,对各种重采样技术、不同的少数服从多数比率以及集合机器学习算法进行系统评估。我们根据 G-mean、MCC、AP 和 AUC 等性能指标分析了 4608 种模型设置组合,以确定最有效的配置。我们利用包括 OLS 回归、方差分析和 Kruskal-Wallis 检验在内的高级统计分析对这些设置进行了微调,从而提高了模型的性能和稳健性,以满足临床应用的需要:我们的分析证实,对变量进行系统的连续优化对我们模型的预测性能有重大影响。最有效的配置是利用反向加权高斯混杂模型进行重采样,并结合梯度提升决策树算法,优化后的少数服从多数比率为 0.09,几何平均数达到 0.6694(95% 置信区间:0.5855-0.7557)。这一配置在所有评估指标上都明显优于基线配置,表明模型性能有了大幅提高:本研究建立了一个稳健的管道,可显著提高不平衡数据集中子痫前期模型的预测性能。我们的研究结果强调了在医疗诊断中对变量进行战略性优化的重要性,为广泛应用于各种关注类不平衡的医疗环境提供了可能。
{"title":"Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.","authors":"Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang","doi":"10.1186/s13040-025-00440-1","DOIUrl":"10.1186/s13040-025-00440-1","url":null,"abstract":"<p><strong>Background: </strong>Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.</p><p><strong>Objective: </strong>This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.</p><p><strong>Methods: </strong>Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.</p><p><strong>Results: </strong>Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.</p><p><strong>Conclusions: </strong>This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"25"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934807/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143701866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-dimensional mediation analysis reveals the mediating role of physical activity patterns in genetic pathways leading to AD-like brain atrophy. 高维中介分析揭示了体育活动模式在导致ad样脑萎缩的遗传途径中的中介作用。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-24 DOI: 10.1186/s13040-025-00432-1
Hanxiang Xu, Shizhuo Mu, Jingxuan Bao, Christos Davatzikos, Haochang Shou, Li Shen

Background: Alzheimer's disease (AD) is a complex disorder that affects multiple biological systems including cognition, behavior and physical health. Unfortunately, the pathogenic mechanisms behind AD are not yet clear and the treatment options are still limited. Despite the increasing number of studies examining the pairwise relationships between genetic factors, physical activity (PA), and AD, few have successfully integrated all three domains of data, which may help reveal mechanisms and impact of these genomic and phenomic factors on AD. We use high-dimensional mediation analysis as an integrative framework to study the relationships among genetic factors, PA and AD-like brain atrophy quantified by spatial patterns of brain atrophy.

Results: We integrate data from genetics, PA and neuroimaging measures collected from 13,425 UK Biobank samples to unveil the complex relationship among genetic risk factors, behavior and brain signatures in the contexts of aging and AD. Specifically, we used a composite imaging marker, Spatial Pattern of Abnormality for Recognition of Early AD (SPARE-AD) that characterizes AD-like brain atrophy, as an outcome variable to represent AD risk. Through GWAS, we identified single nucleotide polymorphisms (SNPs) that are significantly associated with SPARE-AD as exposure variables. We employed conventional summary statistics and functional principal component analysis to extract patterns of PA as mediators. After constructing these variables, we utilized a high-dimensional mediation analysis method, Bayesian Mediation Analysis (BAMA), to estimate potential mediating pathways between SNPs, multivariate PA signatures and SPARE-AD. BAMA incorporates Bayesian continuous shrinkage prior to select the active mediators from a large pool of candidates. We identified a total of 22 mediation pathways, indicating how genetic variants can influence SPARE-AD by altering physical activity. By comparing the results with those obtained using univariate mediation analysis, we demonstrate the advantages of high-dimensional mediation analysis methods over univariate mediation analysis.

Conclusion: Through integrative analysis of multi-omics data, we identified several mediation pathways of physical activity between genetic factors and SPARE-AD. These findings contribute to a better understanding of the pathogenic mechanisms of AD. Moreover, our research demonstrates the potential of the high-dimensional mediation analysis method in revealing the mechanisms of disease.

背景:阿尔茨海默病(AD)是一种影响认知、行为和身体健康等多个生物系统的复杂疾病。不幸的是,AD背后的致病机制尚不清楚,治疗选择仍然有限。尽管越来越多的研究调查了遗传因素、身体活动(PA)和AD之间的成对关系,但很少有研究成功地整合了所有三个领域的数据,这可能有助于揭示这些基因组和表型因素对AD的机制和影响。以高维中介分析为整合框架,通过脑萎缩空间格局量化研究遗传因素与PA、ad样脑萎缩之间的关系。结果:我们整合了来自13425个UK Biobank样本的遗传学、PA和神经影像学数据,揭示了衰老和AD背景下遗传风险因素、行为和大脑特征之间的复杂关系。具体来说,我们使用了一种复合成像标记,即识别早期AD的异常空间模式(SPARE-AD),它表征AD样脑萎缩,作为代表AD风险的结果变量。通过GWAS,我们确定了与SPARE-AD显著相关的单核苷酸多态性(snp)作为暴露变量。我们采用传统的汇总统计和功能主成分分析来提取PA的模式作为中介。在构建这些变量之后,我们利用高维中介分析方法贝叶斯中介分析(BAMA)来估计snp、多元PA签名和备用ad之间可能的中介途径。在从大量候选介质中选择活跃介质之前,BAMA采用贝叶斯连续收缩。我们共确定了22种介导途径,表明遗传变异如何通过改变身体活动来影响SPARE-AD。通过与单变量中介分析结果的比较,我们证明了高维中介分析方法相对于单变量中介分析的优势。结论:通过多组学数据的综合分析,我们确定了体育活动在遗传因素与SPARE-AD之间的几种中介途径。这些发现有助于更好地理解阿尔茨海默病的发病机制。此外,我们的研究证明了高维中介分析方法在揭示疾病机制方面的潜力。
{"title":"High-dimensional mediation analysis reveals the mediating role of physical activity patterns in genetic pathways leading to AD-like brain atrophy.","authors":"Hanxiang Xu, Shizhuo Mu, Jingxuan Bao, Christos Davatzikos, Haochang Shou, Li Shen","doi":"10.1186/s13040-025-00432-1","DOIUrl":"10.1186/s13040-025-00432-1","url":null,"abstract":"<p><strong>Background: </strong>Alzheimer's disease (AD) is a complex disorder that affects multiple biological systems including cognition, behavior and physical health. Unfortunately, the pathogenic mechanisms behind AD are not yet clear and the treatment options are still limited. Despite the increasing number of studies examining the pairwise relationships between genetic factors, physical activity (PA), and AD, few have successfully integrated all three domains of data, which may help reveal mechanisms and impact of these genomic and phenomic factors on AD. We use high-dimensional mediation analysis as an integrative framework to study the relationships among genetic factors, PA and AD-like brain atrophy quantified by spatial patterns of brain atrophy.</p><p><strong>Results: </strong>We integrate data from genetics, PA and neuroimaging measures collected from 13,425 UK Biobank samples to unveil the complex relationship among genetic risk factors, behavior and brain signatures in the contexts of aging and AD. Specifically, we used a composite imaging marker, Spatial Pattern of Abnormality for Recognition of Early AD (SPARE-AD) that characterizes AD-like brain atrophy, as an outcome variable to represent AD risk. Through GWAS, we identified single nucleotide polymorphisms (SNPs) that are significantly associated with SPARE-AD as exposure variables. We employed conventional summary statistics and functional principal component analysis to extract patterns of PA as mediators. After constructing these variables, we utilized a high-dimensional mediation analysis method, Bayesian Mediation Analysis (BAMA), to estimate potential mediating pathways between SNPs, multivariate PA signatures and SPARE-AD. BAMA incorporates Bayesian continuous shrinkage prior to select the active mediators from a large pool of candidates. We identified a total of 22 mediation pathways, indicating how genetic variants can influence SPARE-AD by altering physical activity. By comparing the results with those obtained using univariate mediation analysis, we demonstrate the advantages of high-dimensional mediation analysis methods over univariate mediation analysis.</p><p><strong>Conclusion: </strong>Through integrative analysis of multi-omics data, we identified several mediation pathways of physical activity between genetic factors and SPARE-AD. These findings contribute to a better understanding of the pathogenic mechanisms of AD. Moreover, our research demonstrates the potential of the high-dimensional mediation analysis method in revealing the mechanisms of disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"24"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11931790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143701870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic detection and extraction of key resources from tables in biomedical papers. 生物医学论文表格关键资源的自动检测与提取。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-20 DOI: 10.1186/s13040-025-00438-9
Ibrahim Burak Ozyurt, Anita Bandrowski

Background: Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the "findability" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.

Methods: We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, "Table Transformer" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.

Results: The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.

Conclusions: Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.

背景:表格是有用的信息工件,可以很容易地检测到缺失的数据,并且已经被一些出版商部署,以提高关键资源和试剂(如抗体、细胞系和其他构成研究输入的工具)的现有信息量。STAR*Methods关键资源表增加了这些关键资源的“可查找性”,通过(在发表前)警告作者任何问题来提高论文的透明度,例如不能唯一识别的关键资源或已知有问题的关键资源,但这些资源在Cell Press期刊家族之外通常无法获得。我们相信,处理预印本并自动添加这些“资源候选表”将提高更广泛的科学文献中有关研究资源的结构化和链接信息的可用性。但是,如果作者已经添加了一个键资源表,则必须检测到该表,并且必须正确识别每个实体并忠实地将其重新构造为标准格式。方法:引入4个端到端表提取管道,从PDF格式的生物医学论文中提取并忠实地重建关键资源表。管道使用机器学习方法进行关键资源表页面识别,使用“表转换器”模型进行表检测和表结构识别。我们还介绍了一个字符级生成预训练转换(GPT)语言模型,用于在超过1100万个科学表上进行预训练的科学表。我们用一种新方法生成的合成训练数据对特定于表的语言模型进行了微调,以减轻行过度分割,显著提高关键资源提取性能。结果:使用流行的GROBID工具提取PDF文件中的关键资源表,网格表相似度(GriTS)得分为0.12。我们所有的管道都大大超过了GROBID。我们最好的基于表特定语言模型的行合并管道获得了0.90的GriTS分数。结论:我们的管道允许以更高的准确性从表中检测和提取关键资源,使自动化研究资源提取工具能够在BioRxiv上部署,以帮助作者纠正在其文章中检测到的无法识别的关键资源,并提高其发现的可重复性。代码、特定于表的语言模型、带注释的训练和评估数据都是公开的。
{"title":"Automatic detection and extraction of key resources from tables in biomedical papers.","authors":"Ibrahim Burak Ozyurt, Anita Bandrowski","doi":"10.1186/s13040-025-00438-9","DOIUrl":"10.1186/s13040-025-00438-9","url":null,"abstract":"<p><strong>Background: </strong>Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the \"findability\" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.</p><p><strong>Methods: </strong>We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, \"Table Transformer\" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.</p><p><strong>Results: </strong>The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.</p><p><strong>Conclusions: </strong>Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"23"},"PeriodicalIF":4.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924859/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143671632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study. 利用混合效应回归树分析高维纵向数据以确定低和高风险亚群:模拟研究及其在遗传研究中的应用。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-19 DOI: 10.1186/s13040-025-00437-w
Mina Jahangiri, Anoshirvan Kazemnejad, Keith S Goldfeld, Maryam S Daneshpour, Mehdi Momen, Shayan Mostafaei, Davood Khalili, Mahdi Akbarzadeh

Background: The linear mixed-effects model (LME) is a conventional parametric method mainly used for analyzing longitudinal and clustered data in genetic studies. Previous studies have shown that this model can be sensitive to parametric assumptions and provides less predictive performance than non-parametric methods such as random effects-expectation maximization (RE-EM) and unbiased RE-EM regression tree algorithms. These longitudinal regression trees utilize classification and regression trees (CART) and conditional inference trees (Ctree) to estimate the fixed-effects components of the mixed-effects model. While CART is a well-known tree algorithm, it suffers from greediness. To mitigate this issue, we used the Evtree algorithm to estimate the fixed-effects part of the LME for handling longitudinal and clustered data in genome association studies.

Methods: In this study, we propose a new non-parametric longitudinal-based algorithm called "Ev-RE-EM" for modeling a continuous response variable using the Evtree algorithm to estimate the fixed-effects part of the LME. We compared its predictive performance with other tree algorithms, such as RE-EM and unbiased RE-EM, with and without considering the structure for autocorrelation between errors within subjects to analyze the longitudinal data in the genetic study. The autocorrelation structures include a first-order autoregressive process, a compound symmetric structure with a constant correlation, and a general correlation matrix. The real data was obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling used body mass index (BMI) as the phenotype and included predictor variables such as age, sex, and 25,640 single nucleotide polymorphisms (SNPs).

Results: The results demonstrated that the predictive performance of Ev-RE-EM and unbiased RE-EM was nearly similar. Additionally, the Ev-RE-EM algorithm generated smaller trees than the unbiased RE-EM algorithm, enhancing tree interpretability.

Conclusion: The results showed that the unbiased RE-EM and Ev-RE-EM algorithms outperformed the RE-EM algorithm. Since algorithm performance varies across datasets, researchers should test different algorithms on the dataset of interest and select the best-performing one. Accurately predicting and diagnosing an individual's genetic profile is crucial in medical studies. The model with the highest accuracy should be used to enhance understanding of the genetics of complex traits, improve disease prevention and diagnosis, and aid in treating complex human diseases.

背景:线性混合效应模型(LME)是一种传统的参数化方法,主要用于遗传研究中的纵向和聚类数据分析。先前的研究表明,该模型对参数假设敏感,但预测性能低于非参数方法,如随机效应-期望最大化(RE-EM)和无偏RE-EM回归树算法。这些纵向回归树利用分类和回归树(CART)和条件推理树(Ctree)来估计混合效应模型的固定效应成分。虽然CART是一种众所周知的树算法,但它存在贪婪的问题。为了缓解这一问题,我们使用Evtree算法来估计LME的固定效应部分,以处理基因组关联研究中的纵向和聚类数据。方法:在本研究中,我们提出了一种新的非参数纵向算法,称为“Ev-RE-EM”,用于使用Evtree算法对连续响应变量建模,以估计LME的固定效应部分。我们将其预测性能与其他树算法(如RE-EM和无偏RE-EM)进行了比较,分别考虑和不考虑受试者内部误差之间的自相关结构,以分析遗传研究中的纵向数据。自相关结构包括一阶自回归过程、常相关的复合对称结构和一般相关矩阵。真实数据来自纵向德黑兰心脏代谢遗传研究(TCGS)。数据建模使用身体质量指数(BMI)作为表型,并包括预测变量,如年龄、性别和25,640个单核苷酸多态性(snp)。结果:结果表明Ev-RE-EM和无偏RE-EM的预测性能接近。此外,Ev-RE-EM算法生成的树比无偏RE-EM算法生成的树更小,增强了树的可解释性。结论:无偏RE-EM和Ev-RE-EM算法优于RE-EM算法。由于不同数据集的算法性能不同,研究人员应该在感兴趣的数据集上测试不同的算法,并选择性能最好的算法。在医学研究中,准确地预测和诊断一个人的遗传特征是至关重要的。具有最高准确度的模型应用于加强对复杂性状的遗传学的理解,改善疾病的预防和诊断,并有助于治疗复杂的人类疾病。
{"title":"Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study.","authors":"Mina Jahangiri, Anoshirvan Kazemnejad, Keith S Goldfeld, Maryam S Daneshpour, Mehdi Momen, Shayan Mostafaei, Davood Khalili, Mahdi Akbarzadeh","doi":"10.1186/s13040-025-00437-w","DOIUrl":"10.1186/s13040-025-00437-w","url":null,"abstract":"<p><strong>Background: </strong>The linear mixed-effects model (LME) is a conventional parametric method mainly used for analyzing longitudinal and clustered data in genetic studies. Previous studies have shown that this model can be sensitive to parametric assumptions and provides less predictive performance than non-parametric methods such as random effects-expectation maximization (RE-EM) and unbiased RE-EM regression tree algorithms. These longitudinal regression trees utilize classification and regression trees (CART) and conditional inference trees (Ctree) to estimate the fixed-effects components of the mixed-effects model. While CART is a well-known tree algorithm, it suffers from greediness. To mitigate this issue, we used the Evtree algorithm to estimate the fixed-effects part of the LME for handling longitudinal and clustered data in genome association studies.</p><p><strong>Methods: </strong>In this study, we propose a new non-parametric longitudinal-based algorithm called \"Ev-RE-EM\" for modeling a continuous response variable using the Evtree algorithm to estimate the fixed-effects part of the LME. We compared its predictive performance with other tree algorithms, such as RE-EM and unbiased RE-EM, with and without considering the structure for autocorrelation between errors within subjects to analyze the longitudinal data in the genetic study. The autocorrelation structures include a first-order autoregressive process, a compound symmetric structure with a constant correlation, and a general correlation matrix. The real data was obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling used body mass index (BMI) as the phenotype and included predictor variables such as age, sex, and 25,640 single nucleotide polymorphisms (SNPs).</p><p><strong>Results: </strong>The results demonstrated that the predictive performance of Ev-RE-EM and unbiased RE-EM was nearly similar. Additionally, the Ev-RE-EM algorithm generated smaller trees than the unbiased RE-EM algorithm, enhancing tree interpretability.</p><p><strong>Conclusion: </strong>The results showed that the unbiased RE-EM and Ev-RE-EM algorithms outperformed the RE-EM algorithm. Since algorithm performance varies across datasets, researchers should test different algorithms on the dataset of interest and select the best-performing one. Accurately predicting and diagnosing an individual's genetic profile is crucial in medical studies. The model with the highest accuracy should be used to enhance understanding of the genetics of complex traits, improve disease prevention and diagnosis, and aid in treating complex human diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"22"},"PeriodicalIF":4.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924713/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised clustering based coronary artery segmentation. 基于无监督聚类的冠状动脉分割。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-07 DOI: 10.1186/s13040-025-00435-y
Belén Serrano-Antón, Manuel Insúa Villa, Santiago Pendón-Minguillón, Santiago Paramés-Estévez, Alberto Otero-Cacho, Diego López-Otero, Brais Díaz-Fernández, María Bastos-Fernández, José R González-Juanatey, Alberto P Muñuzuri

Background: The acquisition of 3D geometries of coronary arteries from computed tomography coronary angiography (CTCA) is crucial for clinicians, enabling visualization of lesions and supporting decision-making processes. Manual segmentation of coronary arteries is time-consuming and prone to errors. There is growing interest in automatic segmentation algorithms, particularly those based on neural networks, which require large datasets and significant computational resources for training. This paper proposes an automatic segmentation methodology based on clustering algorithms and a graph structure, which integrates data from both the clustering process and the original images.

Results: The study compares two approaches: a 2.5D version using axial, sagittal, and coronal slices (3Axis), and a perpendicular version (Perp), which uses the cross-section of each vessel. The methodology was tested on two patient groups: a test set of 10 patients and an additional set of 22 patients with clinically diagnosed lesions. The 3Axis method achieved a Dice score of 0.88 in the test set and 0.83 in the lesion set, while the Perp method obtained Dice scores of 0.81 in the test set and 0.82 in the lesion set, decreasing to 0.79 and 0.80 in the lesion region, respectively. These results are competitive with current state-of-the-art methods.

Conclusions: This clustering-based segmentation approach offers a robust framework that can be easily integrated into clinical workflows, improving both accuracy and efficiency in coronary artery analysis. Additionally, the ability to visualize clusters and graphs from any cross-section enhances the method's explainability, providing clinicians with deeper insights into vascular structures. The study demonstrates the potential of clustering algorithms for improving segmentation performance in coronary artery imaging.

背景:从ct冠状动脉造影(CTCA)中获取冠状动脉的三维几何形状对临床医生来说至关重要,可以实现病变的可视化并支持决策过程。人工分割冠状动脉既费时又容易出错。人们对自动分割算法的兴趣越来越大,特别是那些基于神经网络的算法,这需要大量的数据集和大量的计算资源来进行训练。本文提出了一种基于聚类算法和图结构的自动分割方法,该方法将聚类过程数据和原始图像数据相结合。结果:该研究比较了两种方法:使用轴向、矢状面和冠状面切片的2.5D版本(3Axis)和垂直版本(Perp),使用每条血管的横截面。该方法在两组患者中进行了测试:一组为10名患者,另一组为22名临床诊断病变的患者。3Axis方法在测试集中的Dice得分为0.88,病变集中的Dice得分为0.83,而Perp方法在测试集中的Dice得分为0.81,病变集中的Dice得分为0.82,病变区域的Dice得分分别降至0.79和0.80。这些结果与目前最先进的方法相比具有竞争力。结论:这种基于聚类的分割方法提供了一个强大的框架,可以很容易地集成到临床工作流程中,提高冠状动脉分析的准确性和效率。此外,从任何横截面上可视化聚类和图形的能力增强了该方法的可解释性,为临床医生提供了对血管结构更深入的了解。该研究证明了聚类算法在提高冠状动脉成像分割性能方面的潜力。
{"title":"Unsupervised clustering based coronary artery segmentation.","authors":"Belén Serrano-Antón, Manuel Insúa Villa, Santiago Pendón-Minguillón, Santiago Paramés-Estévez, Alberto Otero-Cacho, Diego López-Otero, Brais Díaz-Fernández, María Bastos-Fernández, José R González-Juanatey, Alberto P Muñuzuri","doi":"10.1186/s13040-025-00435-y","DOIUrl":"10.1186/s13040-025-00435-y","url":null,"abstract":"<p><strong>Background: </strong>The acquisition of 3D geometries of coronary arteries from computed tomography coronary angiography (CTCA) is crucial for clinicians, enabling visualization of lesions and supporting decision-making processes. Manual segmentation of coronary arteries is time-consuming and prone to errors. There is growing interest in automatic segmentation algorithms, particularly those based on neural networks, which require large datasets and significant computational resources for training. This paper proposes an automatic segmentation methodology based on clustering algorithms and a graph structure, which integrates data from both the clustering process and the original images.</p><p><strong>Results: </strong>The study compares two approaches: a 2.5D version using axial, sagittal, and coronal slices (3Axis), and a perpendicular version (Perp), which uses the cross-section of each vessel. The methodology was tested on two patient groups: a test set of 10 patients and an additional set of 22 patients with clinically diagnosed lesions. The 3Axis method achieved a Dice score of 0.88 in the test set and 0.83 in the lesion set, while the Perp method obtained Dice scores of 0.81 in the test set and 0.82 in the lesion set, decreasing to 0.79 and 0.80 in the lesion region, respectively. These results are competitive with current state-of-the-art methods.</p><p><strong>Conclusions: </strong>This clustering-based segmentation approach offers a robust framework that can be easily integrated into clinical workflows, improving both accuracy and efficiency in coronary artery analysis. Additionally, the ability to visualize clusters and graphs from any cross-section enhances the method's explainability, providing clinicians with deeper insights into vascular structures. The study demonstrates the potential of clustering algorithms for improving segmentation performance in coronary artery imaging.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"21"},"PeriodicalIF":4.0,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11887207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143587591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multiplatform GWASs for late-onset alzheimer's disease. encan:跨多平台GWASs对迟发性阿尔茨海默病致病变异进行优先排序的综合评分。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-04 DOI: 10.1186/s13040-025-00436-x
Onur Erdogan, Cem Iyigun, Yeşim Aydın Son

Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.

迟发性阿尔茨海默病(LOAD)是一种进行性和复杂的老年人群神经退行性疾病。负荷性脑损伤的特点是认知能力下降,如记忆退化、智力丧失和其他认知领域的丧失。阿尔茨海默病(AD)是一种复杂的遗传病因,目前尚不清楚,这限制了其早期或鉴别诊断。全基因组关联研究(GWAS)能够探索单个变异在候选基因座上的统计相互作用,但单变量分析忽略了变异之间的相互作用。机器学习(ML)算法可以捕捉隐藏的、新颖的和重要的模式,同时考虑变量之间的非线性相互作用,以了解复杂遗传疾病的遗传易感性。当在不同的平台上工作时,多数投票不能应用,因为属性不同。因此,研究人员开发了一种新的后ml集成方法,通过多个基因分型平台选择显著的snv。我们提出了EnSCAN框架,使用一种新的算法来集成来自不同平台的选定变体,以优先考虑候选致病位点,从而通过结合从每个数据集捕获的先验信息来帮助提高ML结果。所提出的集成算法利用snv的染色体位置,通过映射到细胞遗传带,以及对之间的接近度和多模型随机森林(RF)验证来优先考虑snv和候选LOAD致病基因。该评分方法具有可扩展性,可应用于任何多平台基因分型研究。我们介绍了所提出的EnSCAN评分算法如何在三个GWAS数据集中优先考虑与LOAD相关的候选致病变异。
{"title":"EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multiplatform GWASs for late-onset alzheimer's disease.","authors":"Onur Erdogan, Cem Iyigun, Yeşim Aydın Son","doi":"10.1186/s13040-025-00436-x","DOIUrl":"10.1186/s13040-025-00436-x","url":null,"abstract":"<p><p>Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"20"},"PeriodicalIF":4.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11881353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1