Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06302-1
Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim
{"title":"AllergenAI: a deep learning model predicting allergenicity based on protein sequence.","authors":"Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim","doi":"10.1186/s12859-025-06302-1","DOIUrl":"10.1186/s12859-025-06302-1","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"279"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06309-8
Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau
Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.
{"title":"scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification.","authors":"Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau","doi":"10.1186/s12859-025-06309-8","DOIUrl":"10.1186/s12859-025-06309-8","url":null,"abstract":"<p><p>Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"277"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625116/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06310-1
Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey
Background: Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.
Results: In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .
Conclusions: In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.
{"title":"Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models.","authors":"Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey","doi":"10.1186/s12859-025-06310-1","DOIUrl":"10.1186/s12859-025-06310-1","url":null,"abstract":"<p><strong>Background: </strong>Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.</p><p><strong>Results: </strong>In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .</p><p><strong>Conclusions: </strong>In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"276"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625589/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06287-x
Ze Chen, Julie Blommaert, Yi Mei, Linley Jesson, Maren Wellenreuther, Mengjie Zhang
Background: Chrysophrys auratus (family: Sparidae), commonly known as Australasian snapper, is a warm-water species being developed as a candidate for aquaculture in New Zealand. Genomic selection of elite snapper offers significant potential to accelerate genetic gains in aquaculture; however, the complexity of genetic architecture, coupled with challenges such as missing data and high dimensionality, poses significant hurdles. Machine learning techniques have emerged as powerful tools in genomic selection programmes due to their flexibility and ability to model complex, polygenic and non-linear relationships between genotypes and traits. This study aims to develop a comprehensive machine learning framework to evaluate imputation methods and genomic prediction models, and identify single-nucleotide polymorphisms associated with growth traits in snapper, ultimately contributing to the advancement of selective breeding programmes.
Results: We evaluated multiple approaches for each component of the machine learning framework. We developed and evaluated the Domain Knowledge-based K-nearest neighbour (DK-KNN) imputation method, achieving a notably high imputation accuracy of 98.33% in simulation testing, outperforming two alternative imputation methods. Among feature selection and classification combinations evaluated for growth prediction, Chi-squared feature selection paired with Distance-Weighted Discrimination (Chi2-DWD) achieved 60% prediction accuracy, comparable to genomic best linear unbiased prediction (60.3%) but without requiring the genomic relationship matrix. Notably, the two-stage approach using Domain Knowledge-based Pre-filtering (DK Pre-filtering) as a pre-filter did not substantially impact prediction accuracy, and it proved valuable in reducing the dimensionality of the feature space without affecting model performance.
Conclusions: Integration of domain knowledge into machine learning frameworks effectively addresses missing values and high-dimensional challenges in snapper genomic data. The evaluated framework demonstrates that Chi2-DWD represents a promising combination for genomic prediction tasks. The DK Pre-filtering workflow as a pre-filtering method successfully removes redundant features without affecting model performance. Selected features showed biological significance and were confirmed to be associated with growth traits based on biological analysis, providing valuable insights for selective breeding programs.
{"title":"Machine learning for genomic prediction of growth traits in aquaculture: a case study of the Australasian snapper (Chrysophrys auratus).","authors":"Ze Chen, Julie Blommaert, Yi Mei, Linley Jesson, Maren Wellenreuther, Mengjie Zhang","doi":"10.1186/s12859-025-06287-x","DOIUrl":"10.1186/s12859-025-06287-x","url":null,"abstract":"<p><strong>Background: </strong>Chrysophrys auratus (family: Sparidae), commonly known as Australasian snapper, is a warm-water species being developed as a candidate for aquaculture in New Zealand. Genomic selection of elite snapper offers significant potential to accelerate genetic gains in aquaculture; however, the complexity of genetic architecture, coupled with challenges such as missing data and high dimensionality, poses significant hurdles. Machine learning techniques have emerged as powerful tools in genomic selection programmes due to their flexibility and ability to model complex, polygenic and non-linear relationships between genotypes and traits. This study aims to develop a comprehensive machine learning framework to evaluate imputation methods and genomic prediction models, and identify single-nucleotide polymorphisms associated with growth traits in snapper, ultimately contributing to the advancement of selective breeding programmes.</p><p><strong>Results: </strong>We evaluated multiple approaches for each component of the machine learning framework. We developed and evaluated the Domain Knowledge-based K-nearest neighbour (DK-KNN) imputation method, achieving a notably high imputation accuracy of 98.33% in simulation testing, outperforming two alternative imputation methods. Among feature selection and classification combinations evaluated for growth prediction, Chi-squared feature selection paired with Distance-Weighted Discrimination (Chi2-DWD) achieved 60% prediction accuracy, comparable to genomic best linear unbiased prediction (60.3%) but without requiring the genomic relationship matrix. Notably, the two-stage approach using Domain Knowledge-based Pre-filtering (DK Pre-filtering) as a pre-filter did not substantially impact prediction accuracy, and it proved valuable in reducing the dimensionality of the feature space without affecting model performance.</p><p><strong>Conclusions: </strong>Integration of domain knowledge into machine learning frameworks effectively addresses missing values and high-dimensional challenges in snapper genomic data. The evaluated framework demonstrates that Chi2-DWD represents a promising combination for genomic prediction tasks. The DK Pre-filtering workflow as a pre-filtering method successfully removes redundant features without affecting model performance. Selected features showed biological significance and were confirmed to be associated with growth traits based on biological analysis, providing valuable insights for selective breeding programs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"278"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06297-9
Elijah R Bring Horvath, Jaclyn M Winter
Background: The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST + remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.
Results: We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST + database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.
Conclusions: SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.
{"title":"SeqForge: a scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets.","authors":"Elijah R Bring Horvath, Jaclyn M Winter","doi":"10.1186/s12859-025-06297-9","DOIUrl":"10.1186/s12859-025-06297-9","url":null,"abstract":"<p><strong>Background: </strong>The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST + remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.</p><p><strong>Results: </strong>We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST + database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.</p><p><strong>Conclusions: </strong>SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"280"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Predicting drug-target interactions (DTIs) plays a pivotal role in accelerating drug repositioning by prioritizing candidate drugs and reducing experimental costs. Despite advancements in deep learning, several challenges still require further exploration, including sparsity and inadequate representation of feature relationships.
Results: We propose GCNMM, a novel graph convolutional network based on meta-paths and mutual information, to predict latent DTIs in drug-target heterogeneous networks. Our approach begins by constructing a fused DTI network based on meta-paths and a graph attention network. We compute multiple similarity networks by using Jaccard coefficients and integrate them into the fused drug and target similarity networks through entropy-based fusion. These networks are then jointly processed by graph convolutional auto-encoder to generate low-dimensional feature representations. To preserve the topological structure of the original network in the embedding space and strengthen the relationship between the input and latent representations, we incorporate spatial topological consistency and mutual information maximization as dual optimization objectives.
Conclusions: The experimental results illustrate that GCNMM exhibits superior performance to existing baseline models in DTI prediction. Furthermore, case studies validate the practical effectiveness of GCNMM, highlighting its potential in DTI prediction and drug repositioning.
{"title":"Graph convolution network based on meta-paths and mutual information for drug-target interaction prediction.","authors":"Shujuan Cao, Binying Cai, Zhejian Qiu, Tiantian Chang, Qiqige Wuyun, Fang-Xiang Wu","doi":"10.1186/s12859-025-06295-x","DOIUrl":"10.1186/s12859-025-06295-x","url":null,"abstract":"<p><strong>Background: </strong>Predicting drug-target interactions (DTIs) plays a pivotal role in accelerating drug repositioning by prioritizing candidate drugs and reducing experimental costs. Despite advancements in deep learning, several challenges still require further exploration, including sparsity and inadequate representation of feature relationships.</p><p><strong>Results: </strong>We propose GCNMM, a novel graph convolutional network based on meta-paths and mutual information, to predict latent DTIs in drug-target heterogeneous networks. Our approach begins by constructing a fused DTI network based on meta-paths and a graph attention network. We compute multiple similarity networks by using Jaccard coefficients and integrate them into the fused drug and target similarity networks through entropy-based fusion. These networks are then jointly processed by graph convolutional auto-encoder to generate low-dimensional feature representations. To preserve the topological structure of the original network in the embedding space and strengthen the relationship between the input and latent representations, we incorporate spatial topological consistency and mutual information maximization as dual optimization objectives.</p><p><strong>Conclusions: </strong>The experimental results illustrate that GCNMM exhibits superior performance to existing baseline models in DTI prediction. Furthermore, case studies validate the practical effectiveness of GCNMM, highlighting its potential in DTI prediction and drug repositioning.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"275"},"PeriodicalIF":3.3,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12595897/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145470547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1186/s12859-025-06099-z
Shuo Shuo Liu, Shikun Wang, Yuxuan Chen, Anil K Rustgi, Ming Yuan, Jianhua Hu
Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.
Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.
Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.
{"title":"TransST: transfer learning embedded spatial factor modeling of spatial transcriptomics data.","authors":"Shuo Shuo Liu, Shikun Wang, Yuxuan Chen, Anil K Rustgi, Ming Yuan, Jianhua Hu","doi":"10.1186/s12859-025-06099-z","DOIUrl":"10.1186/s12859-025-06099-z","url":null,"abstract":"<p><strong>Background: </strong>Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.</p><p><strong>Results: </strong>Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.</p><p><strong>Conclusions: </strong>In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"274"},"PeriodicalIF":3.3,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12593783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145457374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Identifying potential associations among food, gut microbiota and disease is fundamental for elucidating interaction mechanisms and advancing personalized healthy dietary strategies. While computational methods have been extensively applied to predict microbiota-disease associations, methods on predicting food-microbiota relationships remain limited, particularly regarding higher-order food-microbiota-disease interactions.
Results: In this work, we construct a food-microbe-disease (FMD) database encompassing 190 food items, 219 gut microbiota species, and 163 disease entities, resulting in 17,065 FMD associations. We then propose a lightweight single-view contrastive learning hypergraph neural network (LSCHNN) for FMD association prediction on the sparse FMD dataset. LSCHNN formulates ternary FMD interactions as a hypergraph, in which foods, microbes, and diseases are represented by nodes and FMD triplets are represented by hyperedges, and leverages the biological features of foods, microbes, and diseases as node attributes. Subsequently, a hypergraph neural network is designed to learn the embeddings of foods, microbes, and diseases from the hypergraph and predict potential ternary FMD associations. Additionally, we incorporate a single-view contrastive learning mechanism that enhances the model's ability to extract discriminative features and improves generalization on sparse data. Comprehensive comparison experiments demonstrate that LSCHNN outperforms other state-of-the-art methods in terms of the precision of predicting ternary FMD associations and discovering more potential FMD associations. Case studies on two microbes further confirm the effectiveness of LSCHNN in identifying potential FMD associations.
Conclusions: A novel computational model, LSCHNN, is proposed, marking the first integration of hypergraph neural networks with lightweight single-view contrastive learning for ternary FMD association prediction, providing a groundbreaking framework for precision nutrition and personalized dietary interventions.
{"title":"A lightweight single-view contrastive learning hypergraph neural network for food-microbe-disease association prediction.","authors":"Jianqiang Hu, Mingyi Hu, Yangxiang Wu, Songyao Mu, Dahao Huang, Baolong Wang, Yuchen Gao, Shixin Gu, Jinlin Zhu","doi":"10.1186/s12859-025-06283-1","DOIUrl":"10.1186/s12859-025-06283-1","url":null,"abstract":"<p><strong>Background: </strong>Identifying potential associations among food, gut microbiota and disease is fundamental for elucidating interaction mechanisms and advancing personalized healthy dietary strategies. While computational methods have been extensively applied to predict microbiota-disease associations, methods on predicting food-microbiota relationships remain limited, particularly regarding higher-order food-microbiota-disease interactions.</p><p><strong>Results: </strong>In this work, we construct a food-microbe-disease (FMD) database encompassing 190 food items, 219 gut microbiota species, and 163 disease entities, resulting in 17,065 FMD associations. We then propose a lightweight single-view contrastive learning hypergraph neural network (LSCHNN) for FMD association prediction on the sparse FMD dataset. LSCHNN formulates ternary FMD interactions as a hypergraph, in which foods, microbes, and diseases are represented by nodes and FMD triplets are represented by hyperedges, and leverages the biological features of foods, microbes, and diseases as node attributes. Subsequently, a hypergraph neural network is designed to learn the embeddings of foods, microbes, and diseases from the hypergraph and predict potential ternary FMD associations. Additionally, we incorporate a single-view contrastive learning mechanism that enhances the model's ability to extract discriminative features and improves generalization on sparse data. Comprehensive comparison experiments demonstrate that LSCHNN outperforms other state-of-the-art methods in terms of the precision of predicting ternary FMD associations and discovering more potential FMD associations. Case studies on two microbes further confirm the effectiveness of LSCHNN in identifying potential FMD associations.</p><p><strong>Conclusions: </strong>A novel computational model, LSCHNN, is proposed, marking the first integration of hypergraph neural networks with lightweight single-view contrastive learning for ternary FMD association prediction, providing a groundbreaking framework for precision nutrition and personalized dietary interventions.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"273"},"PeriodicalIF":3.3,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12584493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145443977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1186/s12859-025-06275-1
Namgil Lee, Hojin Yoo, Juhyoung Kim, Heejung Yang
Background: In bottom-up proteomics using data-independent acquisition mass spectrometry (DIA-MS), quantitative measurements are obtained following multiple steps of protein fragmentation and ionization, which introduces cumulative errors and impairs the effectiveness of classical statistical methods. This study proposes an alternative statistical approach for testing group mean differences at the peptide level in quantitative bottom-up proteomics.
Results: We present a novel probabilistic graphical model, that accounts for the non-normality of empirical distributions and the correlations between fragment ion quantities. Based on the model, we propose a new statistical method that improves upon the classical feature-based approach by incorporating distribution-free shrinkage estimation of covariance matrices and bootstrap-based estimation of degrees-of-freedom. Simulated experiments demonstrate that the proposed method outperforms the four most widely used classical methods in terms of specificity, sensitivity, and accuracy, particularly when the data distribution closely resembles real MS data, and under conditions of small sample sizes. Numerical analysis of real quantitative tandem mass spectrometry data reveals that the proposed method effectively identifies candidate peptides exhibiting changes in mean quantity following treatment with the kinase inhibitor Staurosporine.
Conclusions: The proposed statistical method offers an effective alternative to classical approaches for differential analysis of peptides in quantitative bottom-up proteomics using DIA-MS. The R software package MDstatsDIAMS is available at https://github.com/namgillee/MDstatsDIAMS .
{"title":"A shrinkage-based statistical method for testing group mean differences in quantitative bottom-up proteomics.","authors":"Namgil Lee, Hojin Yoo, Juhyoung Kim, Heejung Yang","doi":"10.1186/s12859-025-06275-1","DOIUrl":"10.1186/s12859-025-06275-1","url":null,"abstract":"<p><strong>Background: </strong>In bottom-up proteomics using data-independent acquisition mass spectrometry (DIA-MS), quantitative measurements are obtained following multiple steps of protein fragmentation and ionization, which introduces cumulative errors and impairs the effectiveness of classical statistical methods. This study proposes an alternative statistical approach for testing group mean differences at the peptide level in quantitative bottom-up proteomics.</p><p><strong>Results: </strong>We present a novel probabilistic graphical model, that accounts for the non-normality of empirical distributions and the correlations between fragment ion quantities. Based on the model, we propose a new statistical method that improves upon the classical feature-based approach by incorporating distribution-free shrinkage estimation of covariance matrices and bootstrap-based estimation of degrees-of-freedom. Simulated experiments demonstrate that the proposed method outperforms the four most widely used classical methods in terms of specificity, sensitivity, and accuracy, particularly when the data distribution closely resembles real MS data, and under conditions of small sample sizes. Numerical analysis of real quantitative tandem mass spectrometry data reveals that the proposed method effectively identifies candidate peptides exhibiting changes in mean quantity following treatment with the kinase inhibitor Staurosporine.</p><p><strong>Conclusions: </strong>The proposed statistical method offers an effective alternative to classical approaches for differential analysis of peptides in quantitative bottom-up proteomics using DIA-MS. The R software package MDstatsDIAMS is available at https://github.com/namgillee/MDstatsDIAMS .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"269"},"PeriodicalIF":3.3,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12577184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145421027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Small nucleolar RNAs (snoRNAs), a class of non-coding RNAs broadly distributed in eukaryotes, are emerging as pivotal regulators in the field of epigenomics. In addition to guiding 2'-O-methylation and pseudouridylation modifications at specific rRNA sites to maintain ribosomal stability and support protein synthesis, snoRNAs have been increasingly implicated in epigenetic regulation, influencing gene expression, chromatin architecture, and RNA modification patterns. Accurate identification of potential snoRNA-disease associations (SDAs) is therefore essential for understanding epigenomic dysregulation in complex diseases and facilitating early intervention and drug repurposing. Although artificial intelligence (AI) methods have advanced SDA prediction, they are still hindered by issues such as sample imbalance and high false-negative rates. To address these challenges, we propose ESAE-SDA, a novel model integrating sparse autoencoders with an ensemble learning framework. ESAE-SDA first constructs a comprehensive snoRNA-disease representation using multi-source similarity metrics. It then applies k-means clustering to select high-confidence negative samples and employs a deep sparse autoencoder with sparsity constraints to learn compact, discriminative embeddings. Finally, multiple GNN-based learners are independently trained on dynamically resampled data, and ensemble inference is performed via weighted fusion, substantially enhancing robustness and generalization. Experiments on a public SDA dataset demonstrate that ESAE-SDA consistently outperforms state-of-the-art methods. Notably, a case study on ophthalmic diseases highlights the model's ability to uncover epigenetically relevant snoRNAs with potential regulatory and therapeutic significance, underscoring its value in epigenomics-driven disease research and target discovery.
{"title":"ESAE-SDA: ensemble sparse autoencoder framework for epigenomics-informed snoRNA-disease associations prediction.","authors":"Xinqing Jiang, Xiaojun Chen, Lifeng Xu, Feng Zhang, Jiawei Chen, Wenqian Zhang","doi":"10.1186/s12859-025-06290-2","DOIUrl":"10.1186/s12859-025-06290-2","url":null,"abstract":"<p><p>Small nucleolar RNAs (snoRNAs), a class of non-coding RNAs broadly distributed in eukaryotes, are emerging as pivotal regulators in the field of epigenomics. In addition to guiding 2'-O-methylation and pseudouridylation modifications at specific rRNA sites to maintain ribosomal stability and support protein synthesis, snoRNAs have been increasingly implicated in epigenetic regulation, influencing gene expression, chromatin architecture, and RNA modification patterns. Accurate identification of potential snoRNA-disease associations (SDAs) is therefore essential for understanding epigenomic dysregulation in complex diseases and facilitating early intervention and drug repurposing. Although artificial intelligence (AI) methods have advanced SDA prediction, they are still hindered by issues such as sample imbalance and high false-negative rates. To address these challenges, we propose ESAE-SDA, a novel model integrating sparse autoencoders with an ensemble learning framework. ESAE-SDA first constructs a comprehensive snoRNA-disease representation using multi-source similarity metrics. It then applies k-means clustering to select high-confidence negative samples and employs a deep sparse autoencoder with sparsity constraints to learn compact, discriminative embeddings. Finally, multiple GNN-based learners are independently trained on dynamically resampled data, and ensemble inference is performed via weighted fusion, substantially enhancing robustness and generalization. Experiments on a public SDA dataset demonstrate that ESAE-SDA consistently outperforms state-of-the-art methods. Notably, a case study on ophthalmic diseases highlights the model's ability to uncover epigenetically relevant snoRNAs with potential regulatory and therapeutic significance, underscoring its value in epigenomics-driven disease research and target discovery.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"270"},"PeriodicalIF":3.3,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12577141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145421061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}