Summary: Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.
Availability and implementation: ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.
{"title":"Unifying proteomic technologies with ProteinProjector.","authors":"Leah V Schaffer, Mayank Jain, Rami Nasser, Roded Sharan, Trey Ideker","doi":"10.1093/bioadv/vbaf266","DOIUrl":"10.1093/bioadv/vbaf266","url":null,"abstract":"<p><strong>Summary: </strong>Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.</p><p><strong>Availability and implementation: </strong>ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf266"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12680973/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf259
Patrik Waldmann
Motivation: Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.
Results: We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.
Availability and implementation: Source code and documentation are available at https://github.com/patwa67/GOCSMA.
{"title":"Genomic optimum contribution selection and mate allocation using JuMP.","authors":"Patrik Waldmann","doi":"10.1093/bioadv/vbaf259","DOIUrl":"10.1093/bioadv/vbaf259","url":null,"abstract":"<p><strong>Motivation: </strong>Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.</p><p><strong>Results: </strong>We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( <math><mrow><mo><</mo> <mn>0.01</mn></mrow> </math> seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.</p><p><strong>Availability and implementation: </strong>Source code and documentation are available at https://github.com/patwa67/GOCSMA.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf259"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf264
Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen
Motivation: Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.
Availability and implementation: The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.
{"title":"SUMO: an R package for simulating multi-omics data for methods development and testing.","authors":"Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen","doi":"10.1093/bioadv/vbaf264","DOIUrl":"10.1093/bioadv/vbaf264","url":null,"abstract":"<p><strong>Motivation: </strong>Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.</p><p><strong>Availability and implementation: </strong>The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf264"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12630132/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf260
Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz
Motivation: Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of nif genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of nif genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.
Results: We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of nif genes, NifFinder enables more reliable genome-wide identification of Nif proteins.
Availability and implementation: The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.
{"title":"NifFinder: improved Nif protein prediction using SWeeP vectors and neural networks.","authors":"Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz","doi":"10.1093/bioadv/vbaf260","DOIUrl":"10.1093/bioadv/vbaf260","url":null,"abstract":"<p><strong>Motivation: </strong>Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of <i>nif</i> genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of <i>nif</i> genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.</p><p><strong>Results: </strong>We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of <i>nif</i> genes, NifFinder enables more reliable genome-wide identification of Nif proteins.</p><p><strong>Availability and implementation: </strong>The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf260"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf261
Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce
Motivation: The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.
Results: We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.
Availability and implementation: StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.
{"title":"StarPepWeb: an integrative, graph-based resource for bioactive peptides.","authors":"Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce","doi":"10.1093/bioadv/vbaf261","DOIUrl":"10.1093/bioadv/vbaf261","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.</p><p><strong>Results: </strong>We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.</p><p><strong>Availability and implementation: </strong>StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf261"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf215
R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis
Motivation: The colonial hydroid Hydractinia exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.
Results: We present the Hydractinia Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied Hydractinia species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of Hydractinia-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.
Availability and implementation: The Hydractinia Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.
{"title":"The <i>Hydractinia</i> Genome Project Portal: multi-omic annotation and visualization of <i>Hydractinia</i> genomic datasets.","authors":"R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis","doi":"10.1093/bioadv/vbaf215","DOIUrl":"10.1093/bioadv/vbaf215","url":null,"abstract":"<p><strong>Motivation: </strong>The colonial hydroid <i>Hydractinia</i> exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.</p><p><strong>Results: </strong>We present the <i>Hydractinia</i> Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied <i>Hydractinia</i> species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of <i>Hydractinia</i>-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.</p><p><strong>Availability and implementation: </strong>The <i>Hydractinia</i> Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf215"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12624445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.
Results: We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.
Availability and implementation: Code is available at https://github.com/Ying-Lab/ICCTax.
{"title":"ICCTax: a hierarchical taxonomic classifier for metagenomic sequences on a large language model.","authors":"Yichun Gao, Jiaxing Bai, Feng Zhou, Yushuang He, Ying Wang, Xiaobing Huang","doi":"10.1093/bioadv/vbaf257","DOIUrl":"10.1093/bioadv/vbaf257","url":null,"abstract":"<p><strong>Motivation: </strong>Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.</p><p><strong>Results: </strong>We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.</p><p><strong>Availability and implementation: </strong>Code is available at https://github.com/Ying-Lab/ICCTax.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf257"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619997/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Insect guts may harbor phytase-producing bacteria applicable in poultry nutrition, but only Serratia sp. TN49 and its histidine acid phytase (AEQ29498.1) have been studied for this purpose. Therefore, AEQ29498.1 was used as a query to conduct a homology search for insect-associated bacterial phytases, followed by prediction of their structure and function. This in silico analysis of phytase may lead to the isolation of native phytase-producing bacteria from insect guts, potentially facilitating the production of desirable phytases for use in feed additives.
Results: Twenty-six phytases from bacteria associated with the guts of black soldier fly larvae, fruit flies, and honey bees were identified. The mature chains of these phytases, except for the 4-phytase of Bartocella apis PEB0150, were predicted to carry a positive charge under the acidic conditions of the poultry upper gastrointestinal tract. They are stable (instability indices <40) and belong to histidine acid phosphatase family, which has been proven to be an effective poultry feed additive. The three-dimensional structure of the mature histidine-type phosphatase of Tatumella sp. JGM130 demonstrated the best quality and was found to be a homo-tetrameric protein. Molecular docking confirmed phytate binding at the catalytic motif of the histidine acid phosphatase family, RHGVRPP/AP/Q and HD.
{"title":"<i>In silico</i> analysis of insect-associated bacterial phytases reveals optimal biochemical properties and function in poultry gut condition.","authors":"Olyad Erba Urgessa, Ketema Tafess Tulu, Mesfin Tafesse Gemeda, Hunduma Dinka","doi":"10.1093/bioadv/vbaf256","DOIUrl":"10.1093/bioadv/vbaf256","url":null,"abstract":"<p><strong>Motivation: </strong>Insect guts may harbor phytase-producing bacteria applicable in poultry nutrition, but only <i>Serratia</i> sp. TN49 and its histidine acid phytase (AEQ29498.1) have been studied for this purpose. Therefore, AEQ29498.1 was used as a query to conduct a homology search for insect-associated bacterial phytases, followed by prediction of their structure and function. This <i>in silico</i> analysis of phytase may lead to the isolation of native phytase-producing bacteria from insect guts, potentially facilitating the production of desirable phytases for use in feed additives.</p><p><strong>Results: </strong>Twenty-six phytases from bacteria associated with the guts of black soldier fly larvae, fruit flies, and honey bees were identified. The mature chains of these phytases, except for the 4-phytase of <i>Bartocella apis</i> PEB0150, were predicted to carry a positive charge under the acidic conditions of the poultry upper gastrointestinal tract. They are stable (instability indices <40) and belong to histidine acid phosphatase family, which has been proven to be an effective poultry feed additive. The three-dimensional structure of the mature histidine-type phosphatase of <i>Tatumella</i> sp. JGM130 demonstrated the best quality and was found to be a homo-tetrameric protein. Molecular docking confirmed phytate binding at the catalytic motif of the histidine acid phosphatase family, RHGVRPP/AP/Q and HD.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf256"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596144/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf231
Anna Špačková, Nina Kadášová, Ivana Hutařová Vařeková, Karel Berka
Motivation: Cytochrome P450 proteins play a crucial role in human metabolism, ranging from hormone production to drug metabolism. While multiple commonly known variants have known effects on the individual cytochrome P450 protein performance, the pathogenicity information is usually experimentally limited to only a few mutations. Current pathogenicity prediction software enables the extension of the scope to virtually mutate all amino acids with all possible substitutional mutations. In this work, we do a comprehensive exploration that unveils pathogenicity patterns in the human cytochrome P450 family. Pathogenicity analysis was conducted across proteins using SIFT, AlphaMissense, and PrimateAI-3D algorithms.
Results: Our findings indicate a progressive increase in pathogenicity along protein tunnels-identified via MOLE-toward the cofactor binding site, underscoring the essential role of cofactor interactions in enzymatic function. Notably, the integrity of tunnels and cofactor environment emerges as a critical factor, with even single amino acid alterations potentially disrupting molecular guidance to active sites. These insights highlight the fundamental role of structural pathways in preserving cytochrome P450 functionality, with implications for understanding disease-associated variants and drug metabolism.
Availability and implementation: Data and source code can be found at https://github.com/annaspac/P450_pathogenicity_codes.
{"title":"Pathogenicity patterns in cytochrome P450 family.","authors":"Anna Špačková, Nina Kadášová, Ivana Hutařová Vařeková, Karel Berka","doi":"10.1093/bioadv/vbaf231","DOIUrl":"10.1093/bioadv/vbaf231","url":null,"abstract":"<p><strong>Motivation: </strong>Cytochrome P450 proteins play a crucial role in human metabolism, ranging from hormone production to drug metabolism. While multiple commonly known variants have known effects on the individual cytochrome P450 protein performance, the pathogenicity information is usually experimentally limited to only a few mutations. Current pathogenicity prediction software enables the extension of the scope to virtually mutate all amino acids with all possible substitutional mutations. In this work, we do a comprehensive exploration that unveils pathogenicity patterns in the human cytochrome P450 family. Pathogenicity analysis was conducted across proteins using SIFT, AlphaMissense, and PrimateAI-3D algorithms.</p><p><strong>Results: </strong>Our findings indicate a progressive increase in pathogenicity along protein tunnels-identified via MOLE-toward the cofactor binding site, underscoring the essential role of cofactor interactions in enzymatic function. Notably, the integrity of tunnels and cofactor environment emerges as a critical factor, with even single amino acid alterations potentially disrupting molecular guidance to active sites. These insights highlight the fundamental role of structural pathways in preserving cytochrome P450 functionality, with implications for understanding disease-associated variants and drug metabolism.</p><p><strong>Availability and implementation: </strong>Data and source code can be found at https://github.com/annaspac/P450_pathogenicity_codes.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf231"},"PeriodicalIF":2.8,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12534787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf253
Ruixuan Wang, Lam Tran, Benjamin Brennan, Lars G Fritsche, Kevin He, J Chad Brenner, Hui Jiang
Motivation: Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging.
Results: Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative, using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDRs). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.
Availability and implementation: Our method is available as source code at https://github.com/rxxwang/integrate_cancer.
{"title":"Asymmetric integration of various cancer datasets for identifying risk-associated variants and genes.","authors":"Ruixuan Wang, Lam Tran, Benjamin Brennan, Lars G Fritsche, Kevin He, J Chad Brenner, Hui Jiang","doi":"10.1093/bioadv/vbaf253","DOIUrl":"10.1093/bioadv/vbaf253","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging.</p><p><strong>Results: </strong>Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative, using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDRs). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.</p><p><strong>Availability and implementation: </strong>Our method is available as source code at https://github.com/rxxwang/integrate_cancer.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf253"},"PeriodicalIF":2.8,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576323/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}