Pub Date : 2025-04-11DOI: 10.1186/s13040-025-00439-8
Amr Eledkawy, Taher Hamza, Sara El-Metwally
Background: Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.
Results: The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).
Conclusion: The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.
{"title":"Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.","authors":"Amr Eledkawy, Taher Hamza, Sara El-Metwally","doi":"10.1186/s13040-025-00439-8","DOIUrl":"https://doi.org/10.1186/s13040-025-00439-8","url":null,"abstract":"<p><strong>Background: </strong>Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.</p><p><strong>Results: </strong>The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).</p><p><strong>Conclusion: </strong>The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"29"},"PeriodicalIF":4.0,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144023569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-04DOI: 10.1186/s13040-025-00443-y
Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin
Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.
命名实体识别(NER)是处理生物医学文本的一项基本任务。由于标注数据的可用性有限,研究人员研究了少量学习方法来应对这一挑战。然而,在少数几次学习的情况下,复制完全监督方法的性能仍然很困难。本文主要解决两个问题。在数据增强方面,现有方法主要侧重于替换原文内容,这可能会扭曲语义。此外,现有方法往往忽视了多种尺度的句子特征。为了克服这些挑战,我们利用 ChatGPT 为相同的实体生成具有不同语义的丰富数据,从而减少噪声数据。同时,我们利用动态卷积捕捉句子中的多尺度语义信息,并基于 PubMedBERT 增强特征表示。我们在四个生物医学 NER 数据集(BC5CDR-Disease、NCBI、BioNLP11EPI、BioNLP13GE)上进行了实验评估,结果显示,在大多数少数几个场景中,实验结果都超过了目前最先进的模型,包括主流的大型语言模型,如 ChatGPT。这些结果证实了所提出的方法在数据扩增和模型泛化方面的有效性。
{"title":"Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction.","authors":"Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin","doi":"10.1186/s13040-025-00443-y","DOIUrl":"10.1186/s13040-025-00443-y","url":null,"abstract":"<p><p>Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"28"},"PeriodicalIF":4.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11969866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143781479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-28DOI: 10.1186/s13040-025-00441-0
Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini
Dementia due to Alzheimer's disease (AD) is a multifaceted neurodegenerative disorder characterized by various cognitive and behavioral decline factors. In this work, we propose an extension of the traditional k-means clustering for multivariate time series data to cluster joint trajectories of different features describing progression over time. The algorithm we propose here enables the joint analysis of various longitudinal features to explore co-occurring trajectory factors among markers indicative of cognitive decline in individuals participating in an AD progression study. By examining how multiple variables co-vary and evolve together, we identify distinct subgroups within the cohort based on their longitudinal trajectories. Our clustering method enhances the understanding of individual development across multiple dimensions and provides deeper medical insights into the trajectories of cognitive decline. In addition, the proposed algorithm is also able to make a selection of the most significant features in separating clusters by considering trajectories over time. This process, together with a preliminary pre-processing on the OASIS-3 dataset, reveals an important role of some neuropsychological factors. In particular, the proposed method has identified a significant profile compatible with a syndrome known as Mild Behavioral Impairment (MBI), displaying behavioral manifestations of individuals that may precede the cognitive symptoms typically observed in AD patients. The findings underscore the importance of considering multiple longitudinal features in clinical modeling, ultimately supporting more effective and individualized patient management strategies.
{"title":"Multivariate longitudinal clustering reveals neuropsychological factors as dementia predictors in an Alzheimer's disease progression study.","authors":"Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini","doi":"10.1186/s13040-025-00441-0","DOIUrl":"https://doi.org/10.1186/s13040-025-00441-0","url":null,"abstract":"<p><p>Dementia due to Alzheimer's disease (AD) is a multifaceted neurodegenerative disorder characterized by various cognitive and behavioral decline factors. In this work, we propose an extension of the traditional k-means clustering for multivariate time series data to cluster joint trajectories of different features describing progression over time. The algorithm we propose here enables the joint analysis of various longitudinal features to explore co-occurring trajectory factors among markers indicative of cognitive decline in individuals participating in an AD progression study. By examining how multiple variables co-vary and evolve together, we identify distinct subgroups within the cohort based on their longitudinal trajectories. Our clustering method enhances the understanding of individual development across multiple dimensions and provides deeper medical insights into the trajectories of cognitive decline. In addition, the proposed algorithm is also able to make a selection of the most significant features in separating clusters by considering trajectories over time. This process, together with a preliminary pre-processing on the OASIS-3 dataset, reveals an important role of some neuropsychological factors. In particular, the proposed method has identified a significant profile compatible with a syndrome known as Mild Behavioral Impairment (MBI), displaying behavioral manifestations of individuals that may precede the cognitive symptoms typically observed in AD patients. The findings underscore the importance of considering multiple longitudinal features in clinical modeling, ultimately supporting more effective and individualized patient management strategies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"26"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951806/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-28DOI: 10.1186/s13040-025-00442-z
Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao
The integration of multi-omics data from diverse high-throughput technologies has revolutionized drug discovery. While various network-based methods have been developed to integrate multi-omics data, systematic evaluation and comparison of these methods remain challenging. This review aims to analyze network-based approaches for multi-omics integration and evaluate their applications in drug discovery. We conducted a comprehensive review of literature (2015-2024) on network-based multi-omics integration methods in drug discovery, and categorized methods into four primary types: network propagation/diffusion, similarity-based approaches, graph neural networks, and network inference models. We also discussed the applications of the methods in three scenario of drug discovery, including drug target identification, drug response prediction, and drug repurposing, and finally evaluated the performance of the methods by highlighting their advantages and limitations in specific applications. While network-based multi-omics integration has shown promise in drug discovery, challenges remain in computational scalability, data integration, and biological interpretation. Future developments should focus on incorporating temporal and spatial dynamics, improving model interpretability, and establishing standardized evaluation frameworks.
{"title":"Network-based multi-omics integrative analysis methods in drug discovery: a systematic review.","authors":"Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao","doi":"10.1186/s13040-025-00442-z","DOIUrl":"https://doi.org/10.1186/s13040-025-00442-z","url":null,"abstract":"<p><p>The integration of multi-omics data from diverse high-throughput technologies has revolutionized drug discovery. While various network-based methods have been developed to integrate multi-omics data, systematic evaluation and comparison of these methods remain challenging. This review aims to analyze network-based approaches for multi-omics integration and evaluate their applications in drug discovery. We conducted a comprehensive review of literature (2015-2024) on network-based multi-omics integration methods in drug discovery, and categorized methods into four primary types: network propagation/diffusion, similarity-based approaches, graph neural networks, and network inference models. We also discussed the applications of the methods in three scenario of drug discovery, including drug target identification, drug response prediction, and drug repurposing, and finally evaluated the performance of the methods by highlighting their advantages and limitations in specific applications. While network-based multi-omics integration has shown promise in drug discovery, challenges remain in computational scalability, data integration, and biological interpretation. Future developments should focus on incorporating temporal and spatial dynamics, improving model interpretability, and establishing standardized evaluation frameworks.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"27"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11954193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-24DOI: 10.1186/s13040-025-00440-1
Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang
Background: Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.
Objective: This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.
Methods: Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.
Results: Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.
Conclusions: This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.
{"title":"Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.","authors":"Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang","doi":"10.1186/s13040-025-00440-1","DOIUrl":"10.1186/s13040-025-00440-1","url":null,"abstract":"<p><strong>Background: </strong>Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.</p><p><strong>Objective: </strong>This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.</p><p><strong>Methods: </strong>Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.</p><p><strong>Results: </strong>Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.</p><p><strong>Conclusions: </strong>This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"25"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934807/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143701866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Alzheimer's disease (AD) is a complex disorder that affects multiple biological systems including cognition, behavior and physical health. Unfortunately, the pathogenic mechanisms behind AD are not yet clear and the treatment options are still limited. Despite the increasing number of studies examining the pairwise relationships between genetic factors, physical activity (PA), and AD, few have successfully integrated all three domains of data, which may help reveal mechanisms and impact of these genomic and phenomic factors on AD. We use high-dimensional mediation analysis as an integrative framework to study the relationships among genetic factors, PA and AD-like brain atrophy quantified by spatial patterns of brain atrophy.
Results: We integrate data from genetics, PA and neuroimaging measures collected from 13,425 UK Biobank samples to unveil the complex relationship among genetic risk factors, behavior and brain signatures in the contexts of aging and AD. Specifically, we used a composite imaging marker, Spatial Pattern of Abnormality for Recognition of Early AD (SPARE-AD) that characterizes AD-like brain atrophy, as an outcome variable to represent AD risk. Through GWAS, we identified single nucleotide polymorphisms (SNPs) that are significantly associated with SPARE-AD as exposure variables. We employed conventional summary statistics and functional principal component analysis to extract patterns of PA as mediators. After constructing these variables, we utilized a high-dimensional mediation analysis method, Bayesian Mediation Analysis (BAMA), to estimate potential mediating pathways between SNPs, multivariate PA signatures and SPARE-AD. BAMA incorporates Bayesian continuous shrinkage prior to select the active mediators from a large pool of candidates. We identified a total of 22 mediation pathways, indicating how genetic variants can influence SPARE-AD by altering physical activity. By comparing the results with those obtained using univariate mediation analysis, we demonstrate the advantages of high-dimensional mediation analysis methods over univariate mediation analysis.
Conclusion: Through integrative analysis of multi-omics data, we identified several mediation pathways of physical activity between genetic factors and SPARE-AD. These findings contribute to a better understanding of the pathogenic mechanisms of AD. Moreover, our research demonstrates the potential of the high-dimensional mediation analysis method in revealing the mechanisms of disease.
{"title":"High-dimensional mediation analysis reveals the mediating role of physical activity patterns in genetic pathways leading to AD-like brain atrophy.","authors":"Hanxiang Xu, Shizhuo Mu, Jingxuan Bao, Christos Davatzikos, Haochang Shou, Li Shen","doi":"10.1186/s13040-025-00432-1","DOIUrl":"10.1186/s13040-025-00432-1","url":null,"abstract":"<p><strong>Background: </strong>Alzheimer's disease (AD) is a complex disorder that affects multiple biological systems including cognition, behavior and physical health. Unfortunately, the pathogenic mechanisms behind AD are not yet clear and the treatment options are still limited. Despite the increasing number of studies examining the pairwise relationships between genetic factors, physical activity (PA), and AD, few have successfully integrated all three domains of data, which may help reveal mechanisms and impact of these genomic and phenomic factors on AD. We use high-dimensional mediation analysis as an integrative framework to study the relationships among genetic factors, PA and AD-like brain atrophy quantified by spatial patterns of brain atrophy.</p><p><strong>Results: </strong>We integrate data from genetics, PA and neuroimaging measures collected from 13,425 UK Biobank samples to unveil the complex relationship among genetic risk factors, behavior and brain signatures in the contexts of aging and AD. Specifically, we used a composite imaging marker, Spatial Pattern of Abnormality for Recognition of Early AD (SPARE-AD) that characterizes AD-like brain atrophy, as an outcome variable to represent AD risk. Through GWAS, we identified single nucleotide polymorphisms (SNPs) that are significantly associated with SPARE-AD as exposure variables. We employed conventional summary statistics and functional principal component analysis to extract patterns of PA as mediators. After constructing these variables, we utilized a high-dimensional mediation analysis method, Bayesian Mediation Analysis (BAMA), to estimate potential mediating pathways between SNPs, multivariate PA signatures and SPARE-AD. BAMA incorporates Bayesian continuous shrinkage prior to select the active mediators from a large pool of candidates. We identified a total of 22 mediation pathways, indicating how genetic variants can influence SPARE-AD by altering physical activity. By comparing the results with those obtained using univariate mediation analysis, we demonstrate the advantages of high-dimensional mediation analysis methods over univariate mediation analysis.</p><p><strong>Conclusion: </strong>Through integrative analysis of multi-omics data, we identified several mediation pathways of physical activity between genetic factors and SPARE-AD. These findings contribute to a better understanding of the pathogenic mechanisms of AD. Moreover, our research demonstrates the potential of the high-dimensional mediation analysis method in revealing the mechanisms of disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"24"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11931790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143701870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-20DOI: 10.1186/s13040-025-00438-9
Ibrahim Burak Ozyurt, Anita Bandrowski
Background: Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the "findability" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.
Methods: We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, "Table Transformer" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.
Results: The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.
Conclusions: Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.
{"title":"Automatic detection and extraction of key resources from tables in biomedical papers.","authors":"Ibrahim Burak Ozyurt, Anita Bandrowski","doi":"10.1186/s13040-025-00438-9","DOIUrl":"10.1186/s13040-025-00438-9","url":null,"abstract":"<p><strong>Background: </strong>Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the \"findability\" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.</p><p><strong>Methods: </strong>We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, \"Table Transformer\" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.</p><p><strong>Results: </strong>The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.</p><p><strong>Conclusions: </strong>Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"23"},"PeriodicalIF":4.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924859/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143671632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-19DOI: 10.1186/s13040-025-00437-w
Mina Jahangiri, Anoshirvan Kazemnejad, Keith S Goldfeld, Maryam S Daneshpour, Mehdi Momen, Shayan Mostafaei, Davood Khalili, Mahdi Akbarzadeh
Background: The linear mixed-effects model (LME) is a conventional parametric method mainly used for analyzing longitudinal and clustered data in genetic studies. Previous studies have shown that this model can be sensitive to parametric assumptions and provides less predictive performance than non-parametric methods such as random effects-expectation maximization (RE-EM) and unbiased RE-EM regression tree algorithms. These longitudinal regression trees utilize classification and regression trees (CART) and conditional inference trees (Ctree) to estimate the fixed-effects components of the mixed-effects model. While CART is a well-known tree algorithm, it suffers from greediness. To mitigate this issue, we used the Evtree algorithm to estimate the fixed-effects part of the LME for handling longitudinal and clustered data in genome association studies.
Methods: In this study, we propose a new non-parametric longitudinal-based algorithm called "Ev-RE-EM" for modeling a continuous response variable using the Evtree algorithm to estimate the fixed-effects part of the LME. We compared its predictive performance with other tree algorithms, such as RE-EM and unbiased RE-EM, with and without considering the structure for autocorrelation between errors within subjects to analyze the longitudinal data in the genetic study. The autocorrelation structures include a first-order autoregressive process, a compound symmetric structure with a constant correlation, and a general correlation matrix. The real data was obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling used body mass index (BMI) as the phenotype and included predictor variables such as age, sex, and 25,640 single nucleotide polymorphisms (SNPs).
Results: The results demonstrated that the predictive performance of Ev-RE-EM and unbiased RE-EM was nearly similar. Additionally, the Ev-RE-EM algorithm generated smaller trees than the unbiased RE-EM algorithm, enhancing tree interpretability.
Conclusion: The results showed that the unbiased RE-EM and Ev-RE-EM algorithms outperformed the RE-EM algorithm. Since algorithm performance varies across datasets, researchers should test different algorithms on the dataset of interest and select the best-performing one. Accurately predicting and diagnosing an individual's genetic profile is crucial in medical studies. The model with the highest accuracy should be used to enhance understanding of the genetics of complex traits, improve disease prevention and diagnosis, and aid in treating complex human diseases.
{"title":"Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study.","authors":"Mina Jahangiri, Anoshirvan Kazemnejad, Keith S Goldfeld, Maryam S Daneshpour, Mehdi Momen, Shayan Mostafaei, Davood Khalili, Mahdi Akbarzadeh","doi":"10.1186/s13040-025-00437-w","DOIUrl":"10.1186/s13040-025-00437-w","url":null,"abstract":"<p><strong>Background: </strong>The linear mixed-effects model (LME) is a conventional parametric method mainly used for analyzing longitudinal and clustered data in genetic studies. Previous studies have shown that this model can be sensitive to parametric assumptions and provides less predictive performance than non-parametric methods such as random effects-expectation maximization (RE-EM) and unbiased RE-EM regression tree algorithms. These longitudinal regression trees utilize classification and regression trees (CART) and conditional inference trees (Ctree) to estimate the fixed-effects components of the mixed-effects model. While CART is a well-known tree algorithm, it suffers from greediness. To mitigate this issue, we used the Evtree algorithm to estimate the fixed-effects part of the LME for handling longitudinal and clustered data in genome association studies.</p><p><strong>Methods: </strong>In this study, we propose a new non-parametric longitudinal-based algorithm called \"Ev-RE-EM\" for modeling a continuous response variable using the Evtree algorithm to estimate the fixed-effects part of the LME. We compared its predictive performance with other tree algorithms, such as RE-EM and unbiased RE-EM, with and without considering the structure for autocorrelation between errors within subjects to analyze the longitudinal data in the genetic study. The autocorrelation structures include a first-order autoregressive process, a compound symmetric structure with a constant correlation, and a general correlation matrix. The real data was obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling used body mass index (BMI) as the phenotype and included predictor variables such as age, sex, and 25,640 single nucleotide polymorphisms (SNPs).</p><p><strong>Results: </strong>The results demonstrated that the predictive performance of Ev-RE-EM and unbiased RE-EM was nearly similar. Additionally, the Ev-RE-EM algorithm generated smaller trees than the unbiased RE-EM algorithm, enhancing tree interpretability.</p><p><strong>Conclusion: </strong>The results showed that the unbiased RE-EM and Ev-RE-EM algorithms outperformed the RE-EM algorithm. Since algorithm performance varies across datasets, researchers should test different algorithms on the dataset of interest and select the best-performing one. Accurately predicting and diagnosing an individual's genetic profile is crucial in medical studies. The model with the highest accuracy should be used to enhance understanding of the genetics of complex traits, improve disease prevention and diagnosis, and aid in treating complex human diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"22"},"PeriodicalIF":4.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924713/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-07DOI: 10.1186/s13040-025-00435-y
Belén Serrano-Antón, Manuel Insúa Villa, Santiago Pendón-Minguillón, Santiago Paramés-Estévez, Alberto Otero-Cacho, Diego López-Otero, Brais Díaz-Fernández, María Bastos-Fernández, José R González-Juanatey, Alberto P Muñuzuri
Background: The acquisition of 3D geometries of coronary arteries from computed tomography coronary angiography (CTCA) is crucial for clinicians, enabling visualization of lesions and supporting decision-making processes. Manual segmentation of coronary arteries is time-consuming and prone to errors. There is growing interest in automatic segmentation algorithms, particularly those based on neural networks, which require large datasets and significant computational resources for training. This paper proposes an automatic segmentation methodology based on clustering algorithms and a graph structure, which integrates data from both the clustering process and the original images.
Results: The study compares two approaches: a 2.5D version using axial, sagittal, and coronal slices (3Axis), and a perpendicular version (Perp), which uses the cross-section of each vessel. The methodology was tested on two patient groups: a test set of 10 patients and an additional set of 22 patients with clinically diagnosed lesions. The 3Axis method achieved a Dice score of 0.88 in the test set and 0.83 in the lesion set, while the Perp method obtained Dice scores of 0.81 in the test set and 0.82 in the lesion set, decreasing to 0.79 and 0.80 in the lesion region, respectively. These results are competitive with current state-of-the-art methods.
Conclusions: This clustering-based segmentation approach offers a robust framework that can be easily integrated into clinical workflows, improving both accuracy and efficiency in coronary artery analysis. Additionally, the ability to visualize clusters and graphs from any cross-section enhances the method's explainability, providing clinicians with deeper insights into vascular structures. The study demonstrates the potential of clustering algorithms for improving segmentation performance in coronary artery imaging.
{"title":"Unsupervised clustering based coronary artery segmentation.","authors":"Belén Serrano-Antón, Manuel Insúa Villa, Santiago Pendón-Minguillón, Santiago Paramés-Estévez, Alberto Otero-Cacho, Diego López-Otero, Brais Díaz-Fernández, María Bastos-Fernández, José R González-Juanatey, Alberto P Muñuzuri","doi":"10.1186/s13040-025-00435-y","DOIUrl":"10.1186/s13040-025-00435-y","url":null,"abstract":"<p><strong>Background: </strong>The acquisition of 3D geometries of coronary arteries from computed tomography coronary angiography (CTCA) is crucial for clinicians, enabling visualization of lesions and supporting decision-making processes. Manual segmentation of coronary arteries is time-consuming and prone to errors. There is growing interest in automatic segmentation algorithms, particularly those based on neural networks, which require large datasets and significant computational resources for training. This paper proposes an automatic segmentation methodology based on clustering algorithms and a graph structure, which integrates data from both the clustering process and the original images.</p><p><strong>Results: </strong>The study compares two approaches: a 2.5D version using axial, sagittal, and coronal slices (3Axis), and a perpendicular version (Perp), which uses the cross-section of each vessel. The methodology was tested on two patient groups: a test set of 10 patients and an additional set of 22 patients with clinically diagnosed lesions. The 3Axis method achieved a Dice score of 0.88 in the test set and 0.83 in the lesion set, while the Perp method obtained Dice scores of 0.81 in the test set and 0.82 in the lesion set, decreasing to 0.79 and 0.80 in the lesion region, respectively. These results are competitive with current state-of-the-art methods.</p><p><strong>Conclusions: </strong>This clustering-based segmentation approach offers a robust framework that can be easily integrated into clinical workflows, improving both accuracy and efficiency in coronary artery analysis. Additionally, the ability to visualize clusters and graphs from any cross-section enhances the method's explainability, providing clinicians with deeper insights into vascular structures. The study demonstrates the potential of clustering algorithms for improving segmentation performance in coronary artery imaging.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"21"},"PeriodicalIF":4.0,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11887207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143587591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1186/s13040-025-00436-x
Onur Erdogan, Cem Iyigun, Yeşim Aydın Son
Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.
{"title":"EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multiplatform GWASs for late-onset alzheimer's disease.","authors":"Onur Erdogan, Cem Iyigun, Yeşim Aydın Son","doi":"10.1186/s13040-025-00436-x","DOIUrl":"10.1186/s13040-025-00436-x","url":null,"abstract":"<p><p>Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"20"},"PeriodicalIF":4.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11881353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}