Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf261
Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce
Motivation: The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.
Results: We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.
Availability and implementation: StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.
{"title":"StarPepWeb: an integrative, graph-based resource for bioactive peptides.","authors":"Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce","doi":"10.1093/bioadv/vbaf261","DOIUrl":"10.1093/bioadv/vbaf261","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.</p><p><strong>Results: </strong>We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.</p><p><strong>Availability and implementation: </strong>StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf261"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf215
R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis
Motivation: The colonial hydroid Hydractinia exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.
Results: We present the Hydractinia Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied Hydractinia species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of Hydractinia-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.
Availability and implementation: The Hydractinia Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.
{"title":"The <i>Hydractinia</i> Genome Project Portal: multi-omic annotation and visualization of <i>Hydractinia</i> genomic datasets.","authors":"R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis","doi":"10.1093/bioadv/vbaf215","DOIUrl":"10.1093/bioadv/vbaf215","url":null,"abstract":"<p><strong>Motivation: </strong>The colonial hydroid <i>Hydractinia</i> exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.</p><p><strong>Results: </strong>We present the <i>Hydractinia</i> Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied <i>Hydractinia</i> species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of <i>Hydractinia</i>-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.</p><p><strong>Availability and implementation: </strong>The <i>Hydractinia</i> Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf215"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12624445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.
Results: We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.
Availability and implementation: Code is available at https://github.com/Ying-Lab/ICCTax.
{"title":"ICCTax: a hierarchical taxonomic classifier for metagenomic sequences on a large language model.","authors":"Yichun Gao, Jiaxing Bai, Feng Zhou, Yushuang He, Ying Wang, Xiaobing Huang","doi":"10.1093/bioadv/vbaf257","DOIUrl":"10.1093/bioadv/vbaf257","url":null,"abstract":"<p><strong>Motivation: </strong>Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.</p><p><strong>Results: </strong>We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.</p><p><strong>Availability and implementation: </strong>Code is available at https://github.com/Ying-Lab/ICCTax.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf257"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619997/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Insect guts may harbor phytase-producing bacteria applicable in poultry nutrition, but only Serratia sp. TN49 and its histidine acid phytase (AEQ29498.1) have been studied for this purpose. Therefore, AEQ29498.1 was used as a query to conduct a homology search for insect-associated bacterial phytases, followed by prediction of their structure and function. This in silico analysis of phytase may lead to the isolation of native phytase-producing bacteria from insect guts, potentially facilitating the production of desirable phytases for use in feed additives.
Results: Twenty-six phytases from bacteria associated with the guts of black soldier fly larvae, fruit flies, and honey bees were identified. The mature chains of these phytases, except for the 4-phytase of Bartocella apis PEB0150, were predicted to carry a positive charge under the acidic conditions of the poultry upper gastrointestinal tract. They are stable (instability indices <40) and belong to histidine acid phosphatase family, which has been proven to be an effective poultry feed additive. The three-dimensional structure of the mature histidine-type phosphatase of Tatumella sp. JGM130 demonstrated the best quality and was found to be a homo-tetrameric protein. Molecular docking confirmed phytate binding at the catalytic motif of the histidine acid phosphatase family, RHGVRPP/AP/Q and HD.
{"title":"<i>In silico</i> analysis of insect-associated bacterial phytases reveals optimal biochemical properties and function in poultry gut condition.","authors":"Olyad Erba Urgessa, Ketema Tafess Tulu, Mesfin Tafesse Gemeda, Hunduma Dinka","doi":"10.1093/bioadv/vbaf256","DOIUrl":"10.1093/bioadv/vbaf256","url":null,"abstract":"<p><strong>Motivation: </strong>Insect guts may harbor phytase-producing bacteria applicable in poultry nutrition, but only <i>Serratia</i> sp. TN49 and its histidine acid phytase (AEQ29498.1) have been studied for this purpose. Therefore, AEQ29498.1 was used as a query to conduct a homology search for insect-associated bacterial phytases, followed by prediction of their structure and function. This <i>in silico</i> analysis of phytase may lead to the isolation of native phytase-producing bacteria from insect guts, potentially facilitating the production of desirable phytases for use in feed additives.</p><p><strong>Results: </strong>Twenty-six phytases from bacteria associated with the guts of black soldier fly larvae, fruit flies, and honey bees were identified. The mature chains of these phytases, except for the 4-phytase of <i>Bartocella apis</i> PEB0150, were predicted to carry a positive charge under the acidic conditions of the poultry upper gastrointestinal tract. They are stable (instability indices <40) and belong to histidine acid phosphatase family, which has been proven to be an effective poultry feed additive. The three-dimensional structure of the mature histidine-type phosphatase of <i>Tatumella</i> sp. JGM130 demonstrated the best quality and was found to be a homo-tetrameric protein. Molecular docking confirmed phytate binding at the catalytic motif of the histidine acid phosphatase family, RHGVRPP/AP/Q and HD.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf256"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596144/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf231
Anna Špačková, Nina Kadášová, Ivana Hutařová Vařeková, Karel Berka
Motivation: Cytochrome P450 proteins play a crucial role in human metabolism, ranging from hormone production to drug metabolism. While multiple commonly known variants have known effects on the individual cytochrome P450 protein performance, the pathogenicity information is usually experimentally limited to only a few mutations. Current pathogenicity prediction software enables the extension of the scope to virtually mutate all amino acids with all possible substitutional mutations. In this work, we do a comprehensive exploration that unveils pathogenicity patterns in the human cytochrome P450 family. Pathogenicity analysis was conducted across proteins using SIFT, AlphaMissense, and PrimateAI-3D algorithms.
Results: Our findings indicate a progressive increase in pathogenicity along protein tunnels-identified via MOLE-toward the cofactor binding site, underscoring the essential role of cofactor interactions in enzymatic function. Notably, the integrity of tunnels and cofactor environment emerges as a critical factor, with even single amino acid alterations potentially disrupting molecular guidance to active sites. These insights highlight the fundamental role of structural pathways in preserving cytochrome P450 functionality, with implications for understanding disease-associated variants and drug metabolism.
Availability and implementation: Data and source code can be found at https://github.com/annaspac/P450_pathogenicity_codes.
{"title":"Pathogenicity patterns in cytochrome P450 family.","authors":"Anna Špačková, Nina Kadášová, Ivana Hutařová Vařeková, Karel Berka","doi":"10.1093/bioadv/vbaf231","DOIUrl":"10.1093/bioadv/vbaf231","url":null,"abstract":"<p><strong>Motivation: </strong>Cytochrome P450 proteins play a crucial role in human metabolism, ranging from hormone production to drug metabolism. While multiple commonly known variants have known effects on the individual cytochrome P450 protein performance, the pathogenicity information is usually experimentally limited to only a few mutations. Current pathogenicity prediction software enables the extension of the scope to virtually mutate all amino acids with all possible substitutional mutations. In this work, we do a comprehensive exploration that unveils pathogenicity patterns in the human cytochrome P450 family. Pathogenicity analysis was conducted across proteins using SIFT, AlphaMissense, and PrimateAI-3D algorithms.</p><p><strong>Results: </strong>Our findings indicate a progressive increase in pathogenicity along protein tunnels-identified via MOLE-toward the cofactor binding site, underscoring the essential role of cofactor interactions in enzymatic function. Notably, the integrity of tunnels and cofactor environment emerges as a critical factor, with even single amino acid alterations potentially disrupting molecular guidance to active sites. These insights highlight the fundamental role of structural pathways in preserving cytochrome P450 functionality, with implications for understanding disease-associated variants and drug metabolism.</p><p><strong>Availability and implementation: </strong>Data and source code can be found at https://github.com/annaspac/P450_pathogenicity_codes.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf231"},"PeriodicalIF":2.8,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12534787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf253
Ruixuan Wang, Lam Tran, Benjamin Brennan, Lars G Fritsche, Kevin He, J Chad Brenner, Hui Jiang
Motivation: Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging.
Results: Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative, using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDRs). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.
Availability and implementation: Our method is available as source code at https://github.com/rxxwang/integrate_cancer.
{"title":"Asymmetric integration of various cancer datasets for identifying risk-associated variants and genes.","authors":"Ruixuan Wang, Lam Tran, Benjamin Brennan, Lars G Fritsche, Kevin He, J Chad Brenner, Hui Jiang","doi":"10.1093/bioadv/vbaf253","DOIUrl":"10.1093/bioadv/vbaf253","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging.</p><p><strong>Results: </strong>Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative, using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDRs). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.</p><p><strong>Availability and implementation: </strong>Our method is available as source code at https://github.com/rxxwang/integrate_cancer.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf253"},"PeriodicalIF":2.8,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576323/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf245
Muhammed Talo, Serdar Bozdag
Motivation: Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for discovering new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction.
Results: In this study, we propose a multimodal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model embeddings as node features in a protein family network. Employing graph attention networks on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using three benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun.
Availability and implementation: The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.
{"title":"ProtFun: a protein function prediction model using graph attention networks with a protein large language model.","authors":"Muhammed Talo, Serdar Bozdag","doi":"10.1093/bioadv/vbaf245","DOIUrl":"10.1093/bioadv/vbaf245","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for discovering new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction.</p><p><strong>Results: </strong>In this study, we propose a multimodal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model embeddings as node features in a protein family network. Employing graph attention networks on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using three benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun.</p><p><strong>Availability and implementation: </strong>The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf245"},"PeriodicalIF":2.8,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571506/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf252
Bishwa Ghimire, Nicholas Booth, Tapio Lönnberg, Tero Aittokallio
Motivation: High-throughput genomic data analysis consists of the inexorably intertwined inputs and outputs of a vast array of bioinformatic analysis tools. To guarantee streamlined and reproducible analyses, the often complex data analysis pipelines need to be run using workflow management tools. Nextflow is one popular tool commonly used to automate such pipelines. Nextflow records key pipeline data, such as the submission time, start time, completion time, CPU usage, memory usage, and disk usage for each task run. These data are stored in log files, often scattered across a file system. Therefore, aggregating information about resource usage critical for the optimization of Nextflow pipelines and improving reproducibility, as well as parsing and managing such log data, can quickly become cumbersome.
Results: Here, we present a web-based tool, Nextpie, which provides both a database and a reporting tool for Nextflow pipelines. Nextpie stores comprehensive resource usage information in a relational database, thus facilitating and accelerating the performance of a variety of data analyses and interactive visualizations, providing an easily comprehensible overview of a pipeline's resource usage.
Availability and implementation: The Nextpie source code, user documentation, an SQLite database with test data, and a Nextflow example pipeline are available at GitHub (https://github.com/bishwaG/Nextpie).
{"title":"Nextpie: a web-based reporting tool and database for reproducible nextflow pipelines.","authors":"Bishwa Ghimire, Nicholas Booth, Tapio Lönnberg, Tero Aittokallio","doi":"10.1093/bioadv/vbaf252","DOIUrl":"10.1093/bioadv/vbaf252","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput genomic data analysis consists of the inexorably intertwined inputs and outputs of a vast array of bioinformatic analysis tools. To guarantee streamlined and reproducible analyses, the often complex data analysis pipelines need to be run using workflow management tools. Nextflow is one popular tool commonly used to automate such pipelines. Nextflow records key pipeline data, such as the submission time, start time, completion time, CPU usage, memory usage, and disk usage for each task run. These data are stored in log files, often scattered across a file system. Therefore, aggregating information about resource usage critical for the optimization of Nextflow pipelines and improving reproducibility, as well as parsing and managing such log data, can quickly become cumbersome.</p><p><strong>Results: </strong>Here, we present a web-based tool, Nextpie, which provides both a database and a reporting tool for Nextflow pipelines. Nextpie stores comprehensive resource usage information in a relational database, thus facilitating and accelerating the performance of a variety of data analyses and interactive visualizations, providing an easily comprehensible overview of a pipeline's resource usage.</p><p><strong>Availability and implementation: </strong>The Nextpie source code, user documentation, an SQLite database with test data, and a Nextflow example pipeline are available at GitHub (https://github.com/bishwaG/Nextpie).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf252"},"PeriodicalIF":2.8,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12574971/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf250
Bradley T Martin, Zachery D Zbinden, Michael E Douglas, Marlis R Douglas, Tyler K Chafin
Motivation: Determining geographic origin of samples is a common objective in wildlife management, forensics, and conservation. Current methods often assume evolutionary models or require extensive reference datasets, which are costly and difficult to develop, that perform poorly with uneven or biased sampling. Supervised deep learning offers a promising alternative by learning complex patterns without prior model specifications. Combined with novel geo-genetic data augmentation and preprocessing techniques, it can reduce reference panel demands and improve performance across diverse sampling schemes, broadening accurate provenance determination to more study systems.
Results: We present GeoGenIE, an open-source software package powered by PyTorch for geographic provenance prediction from genomic data. GeoGenIE implements a multilayer perceptron architecture within an automated hyperparameter tuning framework, incorporating preprocessing, geo-genetic outlier detection, and data augmentation to improve accuracy in sparsely sampled regions. Benchmarking against a comparable approach with White-tailed deer (Odocoileus virginianus) double digest restriction-site associated DNA sequencing data, GeoGenIE achieved substantially improved geolocation accuracy with less spatial bias using a smaller SNP panel. Gains were most evident in undersampled regions, underscoring effectiveness under challenging conditions. Its parallelized execution also produced fast runtimes, promoting its application to large datasets.
Availability and implementation: Open-source at https://github.com/btmartin721/geogenie and https://pypi.org/project/GeoGenIE/.
{"title":"GeoGenIE: a deep learning approach to predict geographic provenance of biodiversity samples from genomic SNPs.","authors":"Bradley T Martin, Zachery D Zbinden, Michael E Douglas, Marlis R Douglas, Tyler K Chafin","doi":"10.1093/bioadv/vbaf250","DOIUrl":"10.1093/bioadv/vbaf250","url":null,"abstract":"<p><strong>Motivation: </strong>Determining geographic origin of samples is a common objective in wildlife management, forensics, and conservation. Current methods often assume evolutionary models or require extensive reference datasets, which are costly and difficult to develop, that perform poorly with uneven or biased sampling. Supervised deep learning offers a promising alternative by learning complex patterns without prior model specifications. Combined with novel geo-genetic data augmentation and preprocessing techniques, it can reduce reference panel demands and improve performance across diverse sampling schemes, broadening accurate provenance determination to more study systems.</p><p><strong>Results: </strong>We present GeoGenIE, an open-source software package powered by PyTorch for geographic provenance prediction from genomic data. GeoGenIE implements a multilayer perceptron architecture within an automated hyperparameter tuning framework, incorporating preprocessing, geo-genetic outlier detection, and data augmentation to improve accuracy in sparsely sampled regions. Benchmarking against a comparable approach with White-tailed deer (<i>Odocoileus virginianus</i>) double digest restriction-site associated DNA sequencing data, GeoGenIE achieved substantially improved geolocation accuracy with less spatial bias using a smaller SNP panel. Gains were most evident in undersampled regions, underscoring effectiveness under challenging conditions. Its parallelized execution also produced fast runtimes, promoting its application to large datasets.</p><p><strong>Availability and implementation: </strong>Open-source at https://github.com/btmartin721/geogenie and https://pypi.org/project/GeoGenIE/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf250"},"PeriodicalIF":2.8,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596584/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf247
Monica-Andreea Baciu-Drăgan, Niko Beerenwinkel
Summary: In recent years, developments in single-cell next-generation sequencing technology and computational methodology have made it possible to reconstruct, with increasing precision, the evolutionary history of tumors and their cell phylogenies, represented as mutation trees. Many mutation tree inference tools exist, but they do not support detailed visual tree inspection, nor tree comparisons or analysis at the cohort level, an important task in computational oncology. We developed oncotreeVIS, an interactive graphical user interface for visualizing mutation tree cohorts and tree posterior distributions obtained from mutation tree inference tools. OncotreeVIS can display mutation trees that encode single or joint genetic events, such as point mutations and copy number changes, and highlight matching subclones, conserved trajectories and drug-gene interactions at the cohort level. OncotreeVIS facilitates the visual inspection of mutation tree clusters and pairwise tree distances. It is available both as a JavaScript library that can be used locally or as a web application that can be accessed online. It includes seven default datasets of public mutation tree cohorts for visualization, while new mutation trees are provided in a predefined JSON format.
Availability and implementation: https://cbg-ethz.github.io/oncotreeVIS.
{"title":"OncotreeVIS-an interactive graphical user interface for visualizing mutation tree cohorts.","authors":"Monica-Andreea Baciu-Drăgan, Niko Beerenwinkel","doi":"10.1093/bioadv/vbaf247","DOIUrl":"10.1093/bioadv/vbaf247","url":null,"abstract":"<p><strong>Summary: </strong>In recent years, developments in single-cell next-generation sequencing technology and computational methodology have made it possible to reconstruct, with increasing precision, the evolutionary history of tumors and their cell phylogenies, represented as mutation trees. Many mutation tree inference tools exist, but they do not support detailed visual tree inspection, nor tree comparisons or analysis at the cohort level, an important task in computational oncology. We developed oncotreeVIS, an interactive graphical user interface for visualizing mutation tree cohorts and tree posterior distributions obtained from mutation tree inference tools. OncotreeVIS can display mutation trees that encode single or joint genetic events, such as point mutations and copy number changes, and highlight matching subclones, conserved trajectories and drug-gene interactions at the cohort level. OncotreeVIS facilitates the visual inspection of mutation tree clusters and pairwise tree distances. It is available both as a JavaScript library that can be used locally or as a web application that can be accessed online. It includes seven default datasets of public mutation tree cohorts for visualization, while new mutation trees are provided in a predefined JSON format.</p><p><strong>Availability and implementation: </strong>https://cbg-ethz.github.io/oncotreeVIS.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf247"},"PeriodicalIF":2.8,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}