Pub Date : 2025-11-18DOI: 10.1186/s13040-025-00503-3
Abrar Yaqoob, Mushtaq Ahmad Mir, R Vijaya Lakshmi, Tejaswini Pradhan, G V V Jagannadha Rao, Ghanshyam G Tejani, Mohd Asif Shah
High-dimensional gene expression datasets pose a major challenge in cancer classification due to redundancy, noise, and the risk of overfitting. To address these issues, this study proposes a hybrid framework that integrates the Dung Beetle Optimizer (DBO) for feature selection with Support Vector Machines (SVM) for classification. DBO, a recently developed nature-inspired algorithm, effectively identifies informative and non-redundant subsets of genes by simulating dung beetles' foraging, rolling, obstacle avoidance, stealing, and breeding behaviors. The selected features are then classified using SVM with Radial Basis Function (RBF) kernels, which provide robust decision boundaries even in high-dimensional spaces. Extensive experiments were conducted on publicly available cancer-related gene expression datasets, covering binary, ternary, and quaternary classification tasks. Results show that the proposed DBO-SVM framework achieves 97.4-98.0% accuracy on binary datasets and 84-88% accuracy on multiclass datasets, with balanced Precision, Recall, and F1-scores. These findings highlight the method's ability to enhance classification performance while reducing computational cost and improving biological interpretability. The proposed hybrid model demonstrates strong potential as an efficient and reliable tool for precision medicine and biomedical data analysis.
{"title":"Optimizing accuracy and dimensionality: a swarm intelligence strategy for robust cancer genomics classification.","authors":"Abrar Yaqoob, Mushtaq Ahmad Mir, R Vijaya Lakshmi, Tejaswini Pradhan, G V V Jagannadha Rao, Ghanshyam G Tejani, Mohd Asif Shah","doi":"10.1186/s13040-025-00503-3","DOIUrl":"10.1186/s13040-025-00503-3","url":null,"abstract":"<p><p>High-dimensional gene expression datasets pose a major challenge in cancer classification due to redundancy, noise, and the risk of overfitting. To address these issues, this study proposes a hybrid framework that integrates the Dung Beetle Optimizer (DBO) for feature selection with Support Vector Machines (SVM) for classification. DBO, a recently developed nature-inspired algorithm, effectively identifies informative and non-redundant subsets of genes by simulating dung beetles' foraging, rolling, obstacle avoidance, stealing, and breeding behaviors. The selected features are then classified using SVM with Radial Basis Function (RBF) kernels, which provide robust decision boundaries even in high-dimensional spaces. Extensive experiments were conducted on publicly available cancer-related gene expression datasets, covering binary, ternary, and quaternary classification tasks. Results show that the proposed DBO-SVM framework achieves 97.4-98.0% accuracy on binary datasets and 84-88% accuracy on multiclass datasets, with balanced Precision, Recall, and F1-scores. These findings highlight the method's ability to enhance classification performance while reducing computational cost and improving biological interpretability. The proposed hybrid model demonstrates strong potential as an efficient and reliable tool for precision medicine and biomedical data analysis.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"79"},"PeriodicalIF":6.1,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625619/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145551437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s13040-025-00496-z
Yanhong Xiong, Xiaoyan Zhou, Qi Wang
Background: Understanding cellular mechanical properties is crucial for investigating cell fate determination, embryonic development, and disease progression. Traditional methods of measuring cellular mechanical properties, such as atomic force microscopy, are time-consuming, labor-intensive, and low-throughput. Computational models which can capture the relationship between mechanosensitive gene expression, a readily accessible and cost-effective data source, and cellular mechanical properties offer promising alternatives.
Results: In this study, we identified mechanosensitive genes from 104 cell lines, using RNA-seq data and corresponding elastic modulus from the MechanoBase database. Several statistical learning models were tested and gradient boosting regression emerged as the most effective, outperforming other models in accuracy. We termed this model MechanoGEPred. The model demonstrated its ability to predict elastic modulus variations across tissue samples, single cells, and tissue spatial domains, capturing complex relationships between gene expression and mechanical properties.
Conclusions: By enabling predictions at multiple biological levels, MechanoGEPred offers a useful framework for inferring cellular elastic modulus directly from gene expression data. The model reveals biologically meaningful patterns and context-dependent differences, suggesting potential applications in biomechanics and cancer research, and providing a proof of concept for studying mechanical heterogeneity and its role in health and disease.
{"title":"Computational prediction of cellular elastic modulus from mechanosensitive gene expression at multiple biological levels.","authors":"Yanhong Xiong, Xiaoyan Zhou, Qi Wang","doi":"10.1186/s13040-025-00496-z","DOIUrl":"10.1186/s13040-025-00496-z","url":null,"abstract":"<p><strong>Background: </strong>Understanding cellular mechanical properties is crucial for investigating cell fate determination, embryonic development, and disease progression. Traditional methods of measuring cellular mechanical properties, such as atomic force microscopy, are time-consuming, labor-intensive, and low-throughput. Computational models which can capture the relationship between mechanosensitive gene expression, a readily accessible and cost-effective data source, and cellular mechanical properties offer promising alternatives.</p><p><strong>Results: </strong>In this study, we identified mechanosensitive genes from 104 cell lines, using RNA-seq data and corresponding elastic modulus from the MechanoBase database. Several statistical learning models were tested and gradient boosting regression emerged as the most effective, outperforming other models in accuracy. We termed this model MechanoGEPred. The model demonstrated its ability to predict elastic modulus variations across tissue samples, single cells, and tissue spatial domains, capturing complex relationships between gene expression and mechanical properties.</p><p><strong>Conclusions: </strong>By enabling predictions at multiple biological levels, MechanoGEPred offers a useful framework for inferring cellular elastic modulus directly from gene expression data. The model reveals biologically meaningful patterns and context-dependent differences, suggesting potential applications in biomechanics and cancer research, and providing a proof of concept for studying mechanical heterogeneity and its role in health and disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"80"},"PeriodicalIF":6.1,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625330/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145551445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13DOI: 10.1186/s13040-025-00502-4
Jason H Moore, Nicholas P Tatonetti
{"title":"From prompt engineering to agent engineering: expanding the AI toolbox with autonomous agentic AI collaborators for biomedical discovery.","authors":"Jason H Moore, Nicholas P Tatonetti","doi":"10.1186/s13040-025-00502-4","DOIUrl":"10.1186/s13040-025-00502-4","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"78"},"PeriodicalIF":6.1,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12613637/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145514505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-05DOI: 10.1186/s13040-025-00492-3
Dat Thanh Pham, Khai Quang Tran, Viet Anh Nguyen
Background and objective: Estimating individualized causal effects plays a vital role in data-driven decision-making, especially in high-risk domains such as public health. However, current causal inference models often lack flexibility and generalizability due to the tight coupling between representation learning and effect estimation. This study aims to develop a modular and adaptive framework to enhance the analysis of individualized causal effects in complex health data.
Methods: We propose CAUSALRLSTACK, a modular framework designed to separate representation learning from causal effect estimation. In practice, the model uses a memory-augmented Transformer (TITAN) to capture complex, individualized representations. It is further paired with a doubly robust estimator(DRLearner) to improve the treatment effect estimation. A reinforcement learning agent adjusts how much each component contributes by assigning instance-specific weights. This adaptive weighting process improves the model's ability to generalize across different populations. Input features are derived from causal graphs, automatically chosen between an expert-defined graph and one discovered from data. To evaluate performance, we applied the framework to two publicly available HIV datasets that reflect community-level testing behavior and post-intervention clinical outcomes.
Results: CAUSALRLSTACK outperforms six state-of-the-art causal inference models across both datasets, achieving the highest accuracy (0.861 and 0.855), F1-Score (0.845 and 0.839), and AUC-ROC (0.897 and 0.892). It also achieves the lowest predictive uncertainty (0.093 and 0.092), indicating robust performance in estimating treatment effects.
Conclusions: The proposed framework offers a flexible and effective solution for individualized causal inference. Its modular architecture and reinforcement learning-based weighting strategy enable adaptive, data-driven estimation across diverse populations. Strong experimental results demonstrate the potential of the framework to advance individualized causal inference in health data and provide a practical basis for designing personalized intervention strategies in HIV and broader public health domains.
{"title":"CAUSALRLSTACK: adaptive balancing of deep representation and causal effect estimation with application to HIV-related health data.","authors":"Dat Thanh Pham, Khai Quang Tran, Viet Anh Nguyen","doi":"10.1186/s13040-025-00492-3","DOIUrl":"10.1186/s13040-025-00492-3","url":null,"abstract":"<p><strong>Background and objective: </strong>Estimating individualized causal effects plays a vital role in data-driven decision-making, especially in high-risk domains such as public health. However, current causal inference models often lack flexibility and generalizability due to the tight coupling between representation learning and effect estimation. This study aims to develop a modular and adaptive framework to enhance the analysis of individualized causal effects in complex health data.</p><p><strong>Methods: </strong>We propose CAUSALRLSTACK, a modular framework designed to separate representation learning from causal effect estimation. In practice, the model uses a memory-augmented Transformer (TITAN) to capture complex, individualized representations. It is further paired with a doubly robust estimator(DRLearner) to improve the treatment effect estimation. A reinforcement learning agent adjusts how much each component contributes by assigning instance-specific weights. This adaptive weighting process improves the model's ability to generalize across different populations. Input features are derived from causal graphs, automatically chosen between an expert-defined graph and one discovered from data. To evaluate performance, we applied the framework to two publicly available HIV datasets that reflect community-level testing behavior and post-intervention clinical outcomes.</p><p><strong>Results: </strong>CAUSALRLSTACK outperforms six state-of-the-art causal inference models across both datasets, achieving the highest accuracy (0.861 and 0.855), F1-Score (0.845 and 0.839), and AUC-ROC (0.897 and 0.892). It also achieves the lowest predictive uncertainty (0.093 and 0.092), indicating robust performance in estimating treatment effects.</p><p><strong>Conclusions: </strong>The proposed framework offers a flexible and effective solution for individualized causal inference. Its modular architecture and reinforcement learning-based weighting strategy enable adaptive, data-driven estimation across diverse populations. Strong experimental results demonstrate the potential of the framework to advance individualized causal inference in health data and provide a practical basis for designing personalized intervention strategies in HIV and broader public health domains.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"77"},"PeriodicalIF":6.1,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12587697/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145453252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1186/s13040-025-00490-5
Tongfei Shen, Yifei Sheng, Wan Nie, Shuo Yang, Kaiqi Li, Ziwei Ma, Zhao Ling, Bowen Tan, Xikang Feng, Miaozhe Huo
Background: Systemic Lupus Erythematosus (SLE) is a complex autoimmune disorder involving dysregulation of multiple immune components, including T cells. Aberrant T-cell activity contributes significantly to the immune pathology of SLE, for instance, by facilitating autoantibody production. The Complementarity Determining Region 3 (CDR3) of the TCRβ chain is pivotal for T-cell specificity, thereby positioning it as a promising target for enhancing diagnostic accuracy and gaining deeper mechanistic insights into SLE. To address these diagnostic limitations in SLE, our team developed DeepTAPE, a deep learning-based diagnostic framework that utilizes CDR3 sequences to achieve robust classification performance for SLE.
Results: Building upon the foundation established by DeepTAPE, we devised a novel diagnostic approach that effectively integrates a TCR classifier to quantify SLE disease activity. Furthermore, this methodology employs advanced deep learning models for the bio-mining of disease-associated motifs that serve as potential biomarkers. As a result, this approach generates an autoimmune risk score (ARS) indicative of SLE probability. Notably, this ARS metric exhibited a strong correlation with disease activity, functioning as a quantitative clinical marker that complements traditional indices such as the SLE Disease Activity Index (SLEDAI). In addition, through a comprehensive analysis of immune repertoire data, we identified SLE-specific amino acid motifs within the CDR3 sequences, including critical 3-mer and gapped-mer oligopeptides. These motifs demonstrated high efficacy in SLE classification, achieving an area under the curve (AUC) of 0.908, thereby significantly outperforming other candidate biomarkers. Moreover, our model revealed potential SLE-associated antigens and genes, such as CD109 and INS, which provide new insights into the immunological mechanisms underlying the disease.
Conclusion: This study highlights the potential of DeepTAPE as a supportive tool for biomarker discovery and assessing SLE disease activity, which complements traditional diagnostic approaches. By deepening our understanding of the immunological characteristics and mechanisms associated with SLE, this work lays a foundation for advancing targeted therapies and personalized medicine in autoimmune diseases. Consequently, our findings may pave the way for improved patient outcomes and more effective treatment strategies in the management of SLE.
{"title":"Deep learning-driven TCRβ repertoire analysis enhances diagnosis and enables mining of immunological biomarkers in systemic lupus erythematosus.","authors":"Tongfei Shen, Yifei Sheng, Wan Nie, Shuo Yang, Kaiqi Li, Ziwei Ma, Zhao Ling, Bowen Tan, Xikang Feng, Miaozhe Huo","doi":"10.1186/s13040-025-00490-5","DOIUrl":"10.1186/s13040-025-00490-5","url":null,"abstract":"<p><strong>Background: </strong>Systemic Lupus Erythematosus (SLE) is a complex autoimmune disorder involving dysregulation of multiple immune components, including T cells. Aberrant T-cell activity contributes significantly to the immune pathology of SLE, for instance, by facilitating autoantibody production. The Complementarity Determining Region 3 (CDR3) of the TCRβ chain is pivotal for T-cell specificity, thereby positioning it as a promising target for enhancing diagnostic accuracy and gaining deeper mechanistic insights into SLE. To address these diagnostic limitations in SLE, our team developed DeepTAPE, a deep learning-based diagnostic framework that utilizes CDR3 sequences to achieve robust classification performance for SLE.</p><p><strong>Results: </strong>Building upon the foundation established by DeepTAPE, we devised a novel diagnostic approach that effectively integrates a TCR classifier to quantify SLE disease activity. Furthermore, this methodology employs advanced deep learning models for the bio-mining of disease-associated motifs that serve as potential biomarkers. As a result, this approach generates an autoimmune risk score (ARS) indicative of SLE probability. Notably, this ARS metric exhibited a strong correlation with disease activity, functioning as a quantitative clinical marker that complements traditional indices such as the SLE Disease Activity Index (SLEDAI). In addition, through a comprehensive analysis of immune repertoire data, we identified SLE-specific amino acid motifs within the CDR3 sequences, including critical 3-mer and gapped-mer oligopeptides. These motifs demonstrated high efficacy in SLE classification, achieving an area under the curve (AUC) of 0.908, thereby significantly outperforming other candidate biomarkers. Moreover, our model revealed potential SLE-associated antigens and genes, such as CD109 and INS, which provide new insights into the immunological mechanisms underlying the disease.</p><p><strong>Conclusion: </strong>This study highlights the potential of DeepTAPE as a supportive tool for biomarker discovery and assessing SLE disease activity, which complements traditional diagnostic approaches. By deepening our understanding of the immunological characteristics and mechanisms associated with SLE, this work lays a foundation for advancing targeted therapies and personalized medicine in autoimmune diseases. Consequently, our findings may pave the way for improved patient outcomes and more effective treatment strategies in the management of SLE.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"76"},"PeriodicalIF":6.1,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12577242/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1186/s13040-025-00485-2
Abbas Rehman, Gu Naijie, Stephen Ojo, Thomas I Nathaniel, Nagwan Abdel Samee, Muhammad Umer, Mona M Jamjoom
Diabetic retinopathy (DR) is a primary cause of blindness globally and its treatment and management depend on accurate and timely identification. Current approaches for DR detection and segmentation repeatedly fall short in accuracy and sturdiness highlighting the essential for advanced computational methods. In this study propose a deep learning model Fundus Images Segmentation Model (FISM) designed to precisely detect microaneurysms and retinal exudates dangerous indicators of DR. Employing the Diabetic Retinopathy Dataset (DDR), our model utilizes both the segmentation and grading subsets, comprising over 13,000 fundus images annotated with comprehensive lesion-level and DR severity information, enabling robust training for both detection and classification tasks. The preprocessing pipeline contains band separation generative adversarial network (GAN) based data augmentation and extensive normalization techniques. The FISM architecture is derived from the Segment Anything Model (SAM) exclusively integrating transformer layers and patch embedding techniques. The model begins with patch embedding followed by transformer blocks to capture both local and global relationships within retinal images. The architecture employs transfer learning, domain-specific fine-tuning customized loss functions and attention mechanisms to optimize feature extraction and segmentation accuracy. The image encoder and Mask decoder modules work in tandem to transform input retinal images into precise segmentation Masks, highlighting regions affected by DR. Beyond deep learning, the framework also integrates reinforcement learning to constructively direct the exploration of regions of interest so that the model is capable of highlighting areas of interest to a diagnosis. This form of adaptive attention is an improvement in the precision of detection and computational cost. Results show that FISM outperforms state-of-the-art methods, achieving 96.32% accuracy, 95.14% precision, 95.25% recall and a 96.33% F1-score. The model demonstrates an AUC of 96.32%, specificity of 94.13%, segmentation Dice coefficient of 94.21% and IoU of 96.01%. These metrics indicate superior performance in both detection and segmentation tasks for early diabetic retinopathy diagnosis.
{"title":"FISM: harnessing deep learning and reinforcement learning for precision detection of microaneurysms and retinal exudates for early diabetic retinopathy diagnosis.","authors":"Abbas Rehman, Gu Naijie, Stephen Ojo, Thomas I Nathaniel, Nagwan Abdel Samee, Muhammad Umer, Mona M Jamjoom","doi":"10.1186/s13040-025-00485-2","DOIUrl":"10.1186/s13040-025-00485-2","url":null,"abstract":"<p><p>Diabetic retinopathy (DR) is a primary cause of blindness globally and its treatment and management depend on accurate and timely identification. Current approaches for DR detection and segmentation repeatedly fall short in accuracy and sturdiness highlighting the essential for advanced computational methods. In this study propose a deep learning model Fundus Images Segmentation Model (FISM) designed to precisely detect microaneurysms and retinal exudates dangerous indicators of DR. Employing the Diabetic Retinopathy Dataset (DDR), our model utilizes both the segmentation and grading subsets, comprising over 13,000 fundus images annotated with comprehensive lesion-level and DR severity information, enabling robust training for both detection and classification tasks. The preprocessing pipeline contains band separation generative adversarial network (GAN) based data augmentation and extensive normalization techniques. The FISM architecture is derived from the Segment Anything Model (SAM) exclusively integrating transformer layers and patch embedding techniques. The model begins with patch embedding followed by transformer blocks to capture both local and global relationships within retinal images. The architecture employs transfer learning, domain-specific fine-tuning customized loss functions and attention mechanisms to optimize feature extraction and segmentation accuracy. The image encoder and Mask decoder modules work in tandem to transform input retinal images into precise segmentation Masks, highlighting regions affected by DR. Beyond deep learning, the framework also integrates reinforcement learning to constructively direct the exploration of regions of interest so that the model is capable of highlighting areas of interest to a diagnosis. This form of adaptive attention is an improvement in the precision of detection and computational cost. Results show that FISM outperforms state-of-the-art methods, achieving 96.32% accuracy, 95.14% precision, 95.25% recall and a 96.33% F1-score. The model demonstrates an AUC of 96.32%, specificity of 94.13%, segmentation Dice coefficient of 94.21% and IoU of 96.01%. These metrics indicate superior performance in both detection and segmentation tasks for early diabetic retinopathy diagnosis.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"75"},"PeriodicalIF":6.1,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576988/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-29DOI: 10.1186/s13040-025-00489-y
Kenneth D Doig, Rashindrie Perera, Yamuna Kankanige, Andrew Fellowes, Jason Li, Richard Lupat, Ella R Thompson, Piers Blombery, Stephen B Fox
Background: Targeted next generation sequencing (NGS) of somatic DNA is now routinely used for diagnostic and predictive reporting in the oncology clinic. The expert genomic analysis required for NGS assays remains a bottleneck to scaling the volume of patients being assessed. This study harnesses data from targeted clinical sequencing to build machine learning models that predict whether patient variants should be reported.
Methods: Three somatic assays were used to build machine learning prediction models using the estimators Logistic Regression, Random Forest, XGBoost and Neural Networks. Using manual expert curation to select reportable variants as ground truth, we built models to classify clinically reportable variants. Assays were performed between 2020 and 2023 yielding 1,350,018 variants and used to report on 10,116 patients. All variants, together with 211 annotations and sequencing features, were used by the models to predict the likelihood of variants being reported.
Results: The tree-based ensemble models performed consistently well achieving between 0.904 and 0.996 on the precision recall/area under the curve (PRC AUC) metric when predicting whether a variant should be reported. To assist model explainability, individual model predictions were presented to users within a tertiary analysis platform as a waterfall plot showing individual feature contributions and their values for the variant. Over 30% of the model performance was due to features sourced from statistics derived in-house from the sequencing assay precluding easy generalization of the models to other assays or other laboratories.
Conclusions: Longitudinally acquired NGS assay data provide a strong basis for machine learning models for decision support to select variants for clinical oncology reports. The models provide a framework for consistent reporting practices and reducing inter-reviewer variability. To improve model transparency, individual variant predictions are able to be presented as part of reviewer workflows.
{"title":"Using artificial intelligence (AI) to model clinical variant reporting for next generation sequencing (NGS) oncology assays.","authors":"Kenneth D Doig, Rashindrie Perera, Yamuna Kankanige, Andrew Fellowes, Jason Li, Richard Lupat, Ella R Thompson, Piers Blombery, Stephen B Fox","doi":"10.1186/s13040-025-00489-y","DOIUrl":"10.1186/s13040-025-00489-y","url":null,"abstract":"<p><strong>Background: </strong>Targeted next generation sequencing (NGS) of somatic DNA is now routinely used for diagnostic and predictive reporting in the oncology clinic. The expert genomic analysis required for NGS assays remains a bottleneck to scaling the volume of patients being assessed. This study harnesses data from targeted clinical sequencing to build machine learning models that predict whether patient variants should be reported.</p><p><strong>Methods: </strong>Three somatic assays were used to build machine learning prediction models using the estimators Logistic Regression, Random Forest, XGBoost and Neural Networks. Using manual expert curation to select reportable variants as ground truth, we built models to classify clinically reportable variants. Assays were performed between 2020 and 2023 yielding 1,350,018 variants and used to report on 10,116 patients. All variants, together with 211 annotations and sequencing features, were used by the models to predict the likelihood of variants being reported.</p><p><strong>Results: </strong>The tree-based ensemble models performed consistently well achieving between 0.904 and 0.996 on the precision recall/area under the curve (PRC AUC) metric when predicting whether a variant should be reported. To assist model explainability, individual model predictions were presented to users within a tertiary analysis platform as a waterfall plot showing individual feature contributions and their values for the variant. Over 30% of the model performance was due to features sourced from statistics derived in-house from the sequencing assay precluding easy generalization of the models to other assays or other laboratories.</p><p><strong>Conclusions: </strong>Longitudinally acquired NGS assay data provide a strong basis for machine learning models for decision support to select variants for clinical oncology reports. The models provide a framework for consistent reporting practices and reducing inter-reviewer variability. To improve model transparency, individual variant predictions are able to be presented as part of reviewer workflows.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"74"},"PeriodicalIF":6.1,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12570631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145402603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-21DOI: 10.1186/s13040-025-00488-z
Ana Niño-López, Álvaro Martínez-Rubio, Rocío Picón-González, Ana Castillo Robleda, Manuel Ramírez Orellana, Salvador Chulián, María Rosa
B Acute Lymphoblastic Leukemia (B-ALL) accounts for approximately 80% of pediatric leukemia cases. Despite treatment advances, 15-20% of children experience relapse, highlighting the need of improved monitoring of patients and novel strategies leading to successful therapies. Flow Cytometry is an essential technique for measuring residual disease and guiding treatment. However, traditional manual gating limits its efficiency. In recent years, computational tools have been integrated to enhance these clinical processes but many mathematical techniques are underexploited. Particularly, Uniform Manifold Approximation and Projection (UMAP), together with Machine Learning, provide promising approaches for analyzing large datasets. Mathematical tools and artificial intelligence offer new perspectives on these health problems, beyond the usual approach in biomedicine. We have exploited 234 samples from 75 B-ALL patients to develop an artificial intelligence-based algorithm that can improve patient classification and therapy decisions in different patient cohorts. This implies an advancement on the routine manual analysis of the disease progression, as we identify key subpopulations automatically, distinguishing patients' bone marrow regeneration patterns, thus improving the prediction and prognosis of the disease.
{"title":"Automatic computational classification of bone marrow cells for B cell pediatric leukemia using UMAP.","authors":"Ana Niño-López, Álvaro Martínez-Rubio, Rocío Picón-González, Ana Castillo Robleda, Manuel Ramírez Orellana, Salvador Chulián, María Rosa","doi":"10.1186/s13040-025-00488-z","DOIUrl":"10.1186/s13040-025-00488-z","url":null,"abstract":"<p><p>B Acute Lymphoblastic Leukemia (B-ALL) accounts for approximately 80% of pediatric leukemia cases. Despite treatment advances, 15-20% of children experience relapse, highlighting the need of improved monitoring of patients and novel strategies leading to successful therapies. Flow Cytometry is an essential technique for measuring residual disease and guiding treatment. However, traditional manual gating limits its efficiency. In recent years, computational tools have been integrated to enhance these clinical processes but many mathematical techniques are underexploited. Particularly, Uniform Manifold Approximation and Projection (UMAP), together with Machine Learning, provide promising approaches for analyzing large datasets. Mathematical tools and artificial intelligence offer new perspectives on these health problems, beyond the usual approach in biomedicine. We have exploited 234 samples from 75 B-ALL patients to develop an artificial intelligence-based algorithm that can improve patient classification and therapy decisions in different patient cohorts. This implies an advancement on the routine manual analysis of the disease progression, as we identify key subpopulations automatically, distinguishing patients' bone marrow regeneration patterns, thus improving the prediction and prognosis of the disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"73"},"PeriodicalIF":6.1,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12538793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145349506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Gastric Cancer remains one of the most prevalent cancers worldwide, with its prognosis heavily reliant on early detection. Traditional GC diagnostic methods are invasive and risky, prompting interest in non-invasive alternatives that could enhance outcomes.
Method: In this study, we introduce a non-invasive approach, World Hyper-heuristic Fuzzy Deep Learning, for gastric cancer prediction using metabolomics. Metabolomics profiles of plasma samples from 702 individuals were obtained and used for classification. To apply an efficient feature selection, we employed the World Hyper Heuristic, a metaheuristic to extract the most relevant features from the dataset. Subsequently, the extracted data were classified by implementing a Fuzzy Deep Neural Network.
Results: The performance of WHFDL was assessed and compared against a comprehensive set of classical and state-of-the-art feature selection and classification algorithms. Our results highlighted six key metabolites as biomarkers associated with gastric cancer: (1-Methyladenosine, C18-Carnitine, Guanidineacetic acid, Hypoxanthine, Nicotinamide mononucleotide, and Succinate). The WHFDL outperformed all other classifiers, achieving an F1-score, recall and precision of 94%, 93% and 94%, respectively, along with an accuracy of 94% and an Area Under the Curve of 0.9384. Interpretability were analyzed using SHAP, LIME, IG calibration analysis, and adversarial testing, demonstrating the model's transparency. The source code is available on ( https://github.com/arman-daliri/WHFDL ).
{"title":"WHFDL: an explainable method based on World Hyper-heuristic and Fuzzy Deep Learning approaches for gastric cancer detection using metabolomics data.","authors":"Nora Mahdavi, Arman Daliri, Mahdieh Zabihimayvan, Yalda Yaghooti, Mohammad Mahdi Mir, Parastoo Ghazanfari, Avin Zarrabi, Pedram Khalaj, Reza Sadeghi","doi":"10.1186/s13040-025-00486-1","DOIUrl":"10.1186/s13040-025-00486-1","url":null,"abstract":"<p><strong>Background: </strong>Gastric Cancer remains one of the most prevalent cancers worldwide, with its prognosis heavily reliant on early detection. Traditional GC diagnostic methods are invasive and risky, prompting interest in non-invasive alternatives that could enhance outcomes.</p><p><strong>Method: </strong>In this study, we introduce a non-invasive approach, World Hyper-heuristic Fuzzy Deep Learning, for gastric cancer prediction using metabolomics. Metabolomics profiles of plasma samples from 702 individuals were obtained and used for classification. To apply an efficient feature selection, we employed the World Hyper Heuristic, a metaheuristic to extract the most relevant features from the dataset. Subsequently, the extracted data were classified by implementing a Fuzzy Deep Neural Network.</p><p><strong>Results: </strong>The performance of WHFDL was assessed and compared against a comprehensive set of classical and state-of-the-art feature selection and classification algorithms. Our results highlighted six key metabolites as biomarkers associated with gastric cancer: (1-Methyladenosine, C18-Carnitine, Guanidineacetic acid, Hypoxanthine, Nicotinamide mononucleotide, and Succinate). The WHFDL outperformed all other classifiers, achieving an F1-score, recall and precision of 94%, 93% and 94%, respectively, along with an accuracy of 94% and an Area Under the Curve of 0.9384. Interpretability were analyzed using SHAP, LIME, IG calibration analysis, and adversarial testing, demonstrating the model's transparency. The source code is available on ( https://github.com/arman-daliri/WHFDL ).</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"72"},"PeriodicalIF":6.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12514820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145276395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1186/s13040-025-00448-7
Tai-Hoon Kim, Asma Aldrees, Dina Abdulaziz AlHammadi, Muhammad Umer, Taoufik Saidani, Shtwai Alsubai, Imran Ashraf
This research work provides an innovative approach, called MediNet, for drug safety review classification that integrates the strengths of three word embedding approaches: FastText, ELMo, and GloVe, alongside an ensemble of EfficientNetB4 and MobileNet models. The unique blend of these word embeddings captures both context-independent and context-dependent representations, enabling the model to understand complex linguistic nuances within drug reviews. The ensemble architecture leverages EfficientNetB4's scalability and MobileNet's efficiency, making MediNet both powerful and resource-efficient. The proposed model MediNet is evaluated concerning performance on a comprehensive dataset of drug safety reviews, achieving remarkable results with a 95.69% accuracy, 96.46% precision, 98.30% recall, and 97.22% F1 score. The generalizability of MediNet is evaluated using the cross-validation technique, demonstrating the statistical significance of the results. Additionally, MediNet results are compared against six other well-known transfer learning models, where it consistently outperforms other models across all metrics. These results suggest that MediNet is a highly effective solution for classifying drug safety reviews, significantly improving accuracy and reliability compared to existing models. The proposed approach offers a promising direction for future research in natural language processing and its application to healthcare.
{"title":"MediNet: ensemble transfer learning approach for classification of medical drugs-related text reviews using significant combined-embeddings.","authors":"Tai-Hoon Kim, Asma Aldrees, Dina Abdulaziz AlHammadi, Muhammad Umer, Taoufik Saidani, Shtwai Alsubai, Imran Ashraf","doi":"10.1186/s13040-025-00448-7","DOIUrl":"10.1186/s13040-025-00448-7","url":null,"abstract":"<p><p>This research work provides an innovative approach, called MediNet, for drug safety review classification that integrates the strengths of three word embedding approaches: FastText, ELMo, and GloVe, alongside an ensemble of EfficientNetB4 and MobileNet models. The unique blend of these word embeddings captures both context-independent and context-dependent representations, enabling the model to understand complex linguistic nuances within drug reviews. The ensemble architecture leverages EfficientNetB4's scalability and MobileNet's efficiency, making MediNet both powerful and resource-efficient. The proposed model MediNet is evaluated concerning performance on a comprehensive dataset of drug safety reviews, achieving remarkable results with a 95.69% accuracy, 96.46% precision, 98.30% recall, and 97.22% F1 score. The generalizability of MediNet is evaluated using the cross-validation technique, demonstrating the statistical significance of the results. Additionally, MediNet results are compared against six other well-known transfer learning models, where it consistently outperforms other models across all metrics. These results suggest that MediNet is a highly effective solution for classifying drug safety reviews, significantly improving accuracy and reliability compared to existing models. The proposed approach offers a promising direction for future research in natural language processing and its application to healthcare.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"71"},"PeriodicalIF":6.1,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512702/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145259399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}