Pub Date : 2025-10-27eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf262
Arvid Harder, Jerry Guintivano, Joëlle A Pasman, Patrick F Sullivan, Yi Lu
Motivation: Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific "munging" before analysis can proceed.
Results: Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses.
Availability and implementation: The package, reference data, and Docker containers are freely available for broad adoption.
{"title":"TidyGWAS: a scalable approach for standardized cleaning of genome-wide association study summary statistics.","authors":"Arvid Harder, Jerry Guintivano, Joëlle A Pasman, Patrick F Sullivan, Yi Lu","doi":"10.1093/bioadv/vbaf262","DOIUrl":"10.1093/bioadv/vbaf262","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific \"munging\" before analysis can proceed.</p><p><strong>Results: </strong>Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses.</p><p><strong>Availability and implementation: </strong>The package, reference data, and Docker containers are freely available for broad adoption.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf262"},"PeriodicalIF":2.8,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12597892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145497642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-25eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf268
Camilo Villaman, Irene Cartas-Espinel, Mauricio Saez, Alberto J M Martin
Motivation: CTCF is a conserved protein involved in the establishment and maintenance of topologically associating domains (TADs) and loops. Alzheimer's disease (AD) represents the most common form of dementia, affecting over 50 million elderly individuals. Epigenetic alterations are a hallmark of AD, and epigenetic disruptions are able to affect CTCF binding and looping. Understanding the dynamics of CTCF loops behind AD may lead to new, undiscovered contributions of CTCF to the etiology of AD. To understand the dynamics behind CTCF loops, we developed a CTCF loop predictor using different genomic and epigenomic features, such as CTCF motif information, CTCF protein binding information, and different histone marks.
Results: We obtained F-scores of over 0.9 in GM12878 and K562 cell lines. We reported the importance of each feature in classification, and compared the results with other loop predictors. After testing the predictor, we predicted loops in control and AD data, reported a score of loop disruption and selected the top disrupted loops on AD which were all previously linked with AD in bibliography. Our study contributes to a better understanding of the role of CTCF binding and CTCF loops in gene regulation, and highlights new clues about CTCF in the etiology and development of AD.
Availability and implementation: The method can be found in https://github.com/networkbiolab/jalpy.
{"title":"Gaining insights into Alzheimer's disease by predicting chromatin spatial organization.","authors":"Camilo Villaman, Irene Cartas-Espinel, Mauricio Saez, Alberto J M Martin","doi":"10.1093/bioadv/vbaf268","DOIUrl":"10.1093/bioadv/vbaf268","url":null,"abstract":"<p><strong>Motivation: </strong>CTCF is a conserved protein involved in the establishment and maintenance of topologically associating domains (TADs) and loops. Alzheimer's disease (AD) represents the most common form of dementia, affecting over 50 million elderly individuals. Epigenetic alterations are a hallmark of AD, and epigenetic disruptions are able to affect CTCF binding and looping. Understanding the dynamics of CTCF loops behind AD may lead to new, undiscovered contributions of CTCF to the etiology of AD. To understand the dynamics behind CTCF loops, we developed a CTCF loop predictor using different genomic and epigenomic features, such as CTCF motif information, CTCF protein binding information, and different histone marks.</p><p><strong>Results: </strong>We obtained F-scores of over 0.9 in GM12878 and K562 cell lines. We reported the importance of each feature in classification, and compared the results with other loop predictors. After testing the predictor, we predicted loops in control and AD data, reported a score of loop disruption and selected the top disrupted loops on AD which were all previously linked with AD in bibliography. Our study contributes to a better understanding of the role of CTCF binding and CTCF loops in gene regulation, and highlights new clues about CTCF in the etiology and development of AD.</p><p><strong>Availability and implementation: </strong>The method can be found in https://github.com/networkbiolab/jalpy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf268"},"PeriodicalIF":2.8,"publicationDate":"2025-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145565500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf265
Annette E Dodge, Andrew Williams, Danielle P M LeBlanc, David M Schuster, Elena Esina, Charles C Valentine, Jesse J Salk, Alex Y Maslov, Chris Bradley, Carole L Yauk, Francesco Marchetti, Matthew J Meier
Motivation: Error-corrected next-generation sequencing (ECS) methods are increasingly used to assess mutagenicity and other genetic toxicology endpoints. The lack of open and standardized bioinformatic workflows and tools poses challenges to data reproducibility, comparability, and consistency in interpretation for its application in genetic toxicity assessment.
Results: We present MutSeqR, an open source R package to analyse ECS mutation data for genetic toxicology studies. MutSeqR offers practical variant filtering, comparative analysis of mutation frequency between experimental conditions, dose-response assessment via benchmark dose calculations, mutation spectrum analysis, and clonality analyses. We demonstrate MutSeqR's application using published datasets on mice treated with benzo[a]pyrene or benzo[b]fluoranthene, analysed using Duplex Sequencing and SMM-seq, respectively. MutSeqR's flexible functions enable reproducible analyses across ECS platforms, facilitating research and regulatory applications in mutagenicity testing.
Availability and implementation: MutSeqR is freely available under an open source license at https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR. Implemented in R (version 3.4.0 or greater), it supports all major operating systems. Sequencing data for Project 1 has been deposited in the Sequence Read Archive under accession number PRJNA803048. Variant call files for Project 2 are available on Mendeley Data (doi: 10.17632/65dnysxym8.1).
{"title":"MutSeqR: an open source R package for standardized analysis of error-corrected next-generation sequencing data in genetic toxicology.","authors":"Annette E Dodge, Andrew Williams, Danielle P M LeBlanc, David M Schuster, Elena Esina, Charles C Valentine, Jesse J Salk, Alex Y Maslov, Chris Bradley, Carole L Yauk, Francesco Marchetti, Matthew J Meier","doi":"10.1093/bioadv/vbaf265","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf265","url":null,"abstract":"<p><strong>Motivation: </strong>Error-corrected next-generation sequencing (ECS) methods are increasingly used to assess mutagenicity and other genetic toxicology endpoints. The lack of open and standardized bioinformatic workflows and tools poses challenges to data reproducibility, comparability, and consistency in interpretation for its application in genetic toxicity assessment.</p><p><strong>Results: </strong>We present MutSeqR, an open source R package to analyse ECS mutation data for genetic toxicology studies. MutSeqR offers practical variant filtering, comparative analysis of mutation frequency between experimental conditions, dose-response assessment via benchmark dose calculations, mutation spectrum analysis, and clonality analyses. We demonstrate MutSeqR's application using published datasets on mice treated with benzo[a]pyrene or benzo[b]fluoranthene, analysed using Duplex Sequencing and SMM-seq, respectively. MutSeqR's flexible functions enable reproducible analyses across ECS platforms, facilitating research and regulatory applications in mutagenicity testing.</p><p><strong>Availability and implementation: </strong>MutSeqR is freely available under an open source license at https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR. Implemented in R (version 3.4.0 or greater), it supports all major operating systems. Sequencing data for Project 1 has been deposited in the Sequence Read Archive under accession number PRJNA803048. Variant call files for Project 2 are available on Mendeley Data (doi: 10.17632/65dnysxym8.1).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf265"},"PeriodicalIF":2.8,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12645840/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf258
Abrar Rahman Abir, Liqing Zhang
Motivation: Designing RNA molecules that can specifically bind to target proteins is fundamental to numerous biological and therapeutic applications. However, existing approaches to protein-conditioned RNA design primarily focus on structural alignment or sequence recovery, often ignoring essential biophysical factors such as molecular stability and thermodynamic feasibility.
Results: To address this gap, we propose RNA-EFM, a novel deep learning framework that integrates energy-based refinement with flow matching for protein-conditioned RNA sequence and structure co-design. RNA-EFM consists of two complementary components: a flow matching objective that supervises geometric alignment between predicted and native RNA backbone structures, and an energy-based idempotent refinement that iteratively improves RNA structure predictions by minimizing both structural error and physical energy. The energy refinement is guided by biophysical priors including the Lennard-Jones potential and sequence-derived free energy, ensuring that the generated RNAs are not only geometrically plausible but also thermodynamically stable. We demonstrate the effectiveness of RNA-EFM through extensive experiments. RNA-EFM significantly outperforms state-of-the-art baselines in terms of RMSD, lDDT, sequence recovery, and binding energy improvement. These results highlight the importance of incorporating biophysical constraints into RNA design and establish RNA-EFM as a promising framework.
Availability and implementation: The source code for RNA-EFM is available at: https://github.com/abrarrahmanabir/RNA-EFM.
{"title":"RNA-EFM: energy-based flow matching for protein-conditioned RNA sequence-structure co-design.","authors":"Abrar Rahman Abir, Liqing Zhang","doi":"10.1093/bioadv/vbaf258","DOIUrl":"10.1093/bioadv/vbaf258","url":null,"abstract":"<p><strong>Motivation: </strong>Designing RNA molecules that can specifically bind to target proteins is fundamental to numerous biological and therapeutic applications. However, existing approaches to protein-conditioned RNA design primarily focus on structural alignment or sequence recovery, often ignoring essential biophysical factors such as molecular stability and thermodynamic feasibility.</p><p><strong>Results: </strong>To address this gap, we propose RNA-EFM, a novel deep learning framework that integrates energy-based refinement with flow matching for protein-conditioned RNA sequence and structure co-design. RNA-EFM consists of two complementary components: a flow matching objective that supervises geometric alignment between predicted and native RNA backbone structures, and an energy-based idempotent refinement that iteratively improves RNA structure predictions by minimizing both structural error and physical energy. The energy refinement is guided by biophysical priors including the Lennard-Jones potential and sequence-derived free energy, ensuring that the generated RNAs are not only geometrically plausible but also thermodynamically stable. We demonstrate the effectiveness of RNA-EFM through extensive experiments. RNA-EFM significantly outperforms state-of-the-art baselines in terms of RMSD, lDDT, sequence recovery, and binding energy improvement. These results highlight the importance of incorporating biophysical constraints into RNA design and establish RNA-EFM as a promising framework.</p><p><strong>Availability and implementation: </strong>The source code for RNA-EFM is available at: https://github.com/abrarrahmanabir/RNA-EFM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf258"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701795/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.
Availability and implementation: ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.
{"title":"Unifying proteomic technologies with ProteinProjector.","authors":"Leah V Schaffer, Mayank Jain, Rami Nasser, Roded Sharan, Trey Ideker","doi":"10.1093/bioadv/vbaf266","DOIUrl":"10.1093/bioadv/vbaf266","url":null,"abstract":"<p><strong>Summary: </strong>Proteomics has developed many approaches to inform the subcellular organization of proteins, each with differing coverage and sensitivity to distinct scales. Here, we develop a self-supervised deep learning framework, ProteinProjector, that flexibly integrates all available data for a protein from any number of modalities, resulting in a unified map of protein position. As initial proof-of-concept we integrate four proteome-wide characterizations of HEK293 human embryonic kidney cells, including protein affinity purification, proximity ligation, and size-exclusion-chromatography mass spectrometry (AP-MS, PL-MS, SEC-MS), as well as protein fluorescent imaging. Map coverage and accuracy grow substantially as new data modes are added, with maximal recovery of known complexes observed when using all four proteomic datasets. We find that ProteinProjector outperforms individual modalities and other integration methods in recovery of orthogonal functional and physical associations not used during training. ProteinProjector provides a foundation for integration of diverse modalities that characterize subcellular structure.</p><p><strong>Availability and implementation: </strong>ProteinProjector is available as part of the Cell Mapping Toolkit at https://github.com/idekerlab/cellmaps_coembedding.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf266"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12680973/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf259
Patrik Waldmann
Motivation: Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.
Results: We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.
Availability and implementation: Source code and documentation are available at https://github.com/patwa67/GOCSMA.
{"title":"Genomic optimum contribution selection and mate allocation using JuMP.","authors":"Patrik Waldmann","doi":"10.1093/bioadv/vbaf259","DOIUrl":"10.1093/bioadv/vbaf259","url":null,"abstract":"<p><strong>Motivation: </strong>Artificial selection improves desired traits, but reduces genetic diversity within populations. Modern breeding programs aim to balance genetic gain with the maintenance of genetic variation to ensure long-term sustainability. Optimum contribution selection (OCS) is a widely adopted strategy that maximizes genetic gain while limiting the rate of inbreeding, traditionally relying on pedigree data. However, genomic relationship matrices offer a more accurate measure of genetic relatedness. A subsequent step to OCS involves mate allocation (MA) to optimize breeding plans, which often presents significant computational challenges for large datasets.</p><p><strong>Results: </strong>We developed a two-stage genomic OCS and mate allocation (GOCSMA) method implemented in JuMP/Julia. The OCS problem is formulated as a linear program with quadratic constraints and solved efficiently using the conic operator splitting method (COSMO). The subsequent MA problem, expressed as a mixed integer program, is solved with the SCIP framework's branch-cut-and-price algorithm. Applying GOCSMA to the simulated QTLMAS2010 dataset, we observed efficient convergence for OCS, balancing genetic gain with coancestry constraints better compared to traditional top selection. The MA stage consistently achieved very low runtimes ( <math><mrow><mo><</mo> <mn>0.01</mn></mrow> </math> seconds), with integer mating constraints providing lower coancestry and higher genetic gain compared to binary constraints, indicating a more optimal mating scheme.Hence, GOCSMA provides an efficient deterministic mathematical optimization framework for integrated genomic OCS and MA. Using advanced solvers within the flexible JuMP environment, our method offers a robust solution to balance genetic gain and diversity in large-scale breeding programs.</p><p><strong>Availability and implementation: </strong>Source code and documentation are available at https://github.com/patwa67/GOCSMA.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf259"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12619993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf264
Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen
Motivation: Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.
Availability and implementation: The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.
{"title":"SUMO: an R package for simulating multi-omics data for methods development and testing.","authors":"Bernard Isekah Osang'ir, Surya Gupta, Ziv Shkedy, Jürgen Claesen","doi":"10.1093/bioadv/vbaf264","DOIUrl":"10.1093/bioadv/vbaf264","url":null,"abstract":"<p><strong>Motivation: </strong>Insights from integrative multi-omics analyses have fueled demand for innovative computational methods and tools in multi-omics research. However, the scarcity of multi-omics datasets with user-defined signal structures hinders the evaluation of these newly developed tools. SUMO (SimUlating Multi-Omics), an open-source R package, was developed to address this gap by enabling the generation of high-quality factor analysis-based datasets with full control over the dataset's structure such as latent structures, noise, and complexity. Users can configure datasets with distinct and/or shared non-overlapping latent factors, enabling flexible and precise control over the signal structures. Consequently, SUMO allows reproducible testing and validation of methods, fostering methodological innovation.</p><p><strong>Availability and implementation: </strong>The SUMO R package is freely available and accessible on the Comprehensive R Archive Network https://doi.org/10.32614/CRAN.package.SUMO and on GitHub https://github.com/lucp12891/SUMO.git under CC-BY 4.0 license.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf264"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12630132/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf260
Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz
Motivation: Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of nif genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of nif genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.
Results: We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of nif genes, NifFinder enables more reliable genome-wide identification of Nif proteins.
Availability and implementation: The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.
{"title":"NifFinder: improved Nif protein prediction using SWeeP vectors and neural networks.","authors":"Bruno Thiago de Lima Nichio, Roxana Beatriz Ribeiro Chaves, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz","doi":"10.1093/bioadv/vbaf260","DOIUrl":"10.1093/bioadv/vbaf260","url":null,"abstract":"<p><strong>Motivation: </strong>Biological nitrogen fixation is a vital process for global ecosystems and agriculture; however, the diversity and complexity of <i>nif</i> genes present significant challenges for the accurate identification of Nif proteins. Existing computational tools are often limited to a narrow subset of <i>nif</i> genes, leaving many important protein classes unexplored. NifFinder was developed to address this gap, combining SWeeP vector representation with neural network models to predict up to 24 different Nif proteins. By expanding the predictive scope and improving accuracy, NifFinder provides a more comprehensive and reliable framework to study nitrogen fixation, supporting both evolutionary insights and applications in agricultural sustainability.</p><p><strong>Results: </strong>We present NifFinder, a computational framework that integrates SWeeP vector encoding with neural network classifiers to predict up to 24 different Nif protein classes across Archaea and Bacteria. NifFinder achieved an average accuracy of 84.31%, with sensitivity (86.49%), precision (81.97%), F1-score (82.33%), and a class correlation coefficient of 0.94. Benchmarking against Nif curated resources showed strong agreement and robust classification even under class imbalance. By expanding beyond traditional subsets of <i>nif</i> genes, NifFinder enables more reliable genome-wide identification of Nif proteins.</p><p><strong>Availability and implementation: </strong>The NifFinder installation instructions and source code can be accessed at https://sourceforge.net/projects/NifFinder.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf260"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf261
Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce
Motivation: The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.
Results: We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.
Availability and implementation: StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.
{"title":"StarPepWeb: an integrative, graph-based resource for bioactive peptides.","authors":"Christian López, Roberto Cárdenas, Longendri Aguilera-Mendoza, Guillermin Agüero-Chapin, Félix Martínez-Rios, César R García-Jacas, Noel Pérez-Pérez, Yovani Marrero-Ponce","doi":"10.1093/bioadv/vbaf261","DOIUrl":"10.1093/bioadv/vbaf261","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid growth of bioactive peptide sequences presents challenges for organization and analysis. Existing repositories often specialize in functions, taxonomic origins, or structural classes, but most remain isolated, use heterogeneous metadata, and lack uniform descriptors or structural models. Few integrative web services exist, offering only partial coverage or depth. As a result, reproducible and comprehensive exploration of the bioactive peptide landscape remains limited, underscoring the need for a unified, source-tracked, extensible platform.</p><p><strong>Results: </strong>We present StarPepWeb, a freely accessible web application that democratizes access to StarPepDB, one of the largest curated repositories of bioactive peptides. The platform integrates 45 120 non-redundant sequences from 40 public databases into a source-tracked graph enriched with metadata, physicochemical features, and predicted 3D structures from ESMFold. Each peptide is represented with ESM-2 embeddings and iFeature descriptors, while the interface supports metadata-aware filtering, alignment-based similarity searches with single and multiple queries, and interactive visualization. A microservice-oriented architecture ensures scalability, maintainability, and reproducible versioned downloads, including Neo4j exports. StarPepWeb thus overcomes deployment and expertise barriers of the standalone database, providing an extensible, cloud-hosted framework for integrative bioactive peptide analysis.</p><p><strong>Availability and implementation: </strong>StarPepWeb is freely available at https://starpepweb.org. Source code and documentation are hosted at https://github.com/starpep-web.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf261"},"PeriodicalIF":2.8,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf215
R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis
Motivation: The colonial hydroid Hydractinia exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.
Results: We present the Hydractinia Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied Hydractinia species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of Hydractinia-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.
Availability and implementation: The Hydractinia Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.
{"title":"The <i>Hydractinia</i> Genome Project Portal: multi-omic annotation and visualization of <i>Hydractinia</i> genomic datasets.","authors":"R Travis Moreland, Christine E Schnitzler, Suiyuan Zhang, Sumeeta Singh, Tyra G Wolfsberg, Andreas D Baxevanis","doi":"10.1093/bioadv/vbaf215","DOIUrl":"10.1093/bioadv/vbaf215","url":null,"abstract":"<p><strong>Motivation: </strong>The colonial hydroid <i>Hydractinia</i> exhibits several unique biological properties, including its remarkable regenerative capacity and the ability to distinguish self from non-self, characteristics that make them valuable models for studying human disease and aging. The availability of well-annotated multi-omic data, as well as tools to visualize these data, is essential for advancing the use of these model organisms to enhance our understanding of the relationship between genomic and morphological complexity, the evolution of multicellularity, and the emergence of novel cell types.</p><p><strong>Results: </strong>We present the <i>Hydractinia</i> Genome Project Portal, a comprehensive resource providing genomic, transcriptomic, and proteomic datasets for two widely studied <i>Hydractinia</i> species. The portal provides extensive sequence, structure, and functional annotation resources that are not available elsewhere, including genome browsers, a single-cell gene expression atlas, a protein structure viewer, and a custom BLAST implementation. We demonstrate the portal's utility for biological discovery and have used a subset of <i>Hydractinia</i>-specific stem cell gene markers to explore known gaps in annotation transfer methods, illustrating how structure-based deep learning methods such as DeepFRI can significantly improve the functional annotation of heretofore unannotated i-cell markers.</p><p><strong>Availability and implementation: </strong>The <i>Hydractinia</i> Genome Project Portal is freely available at https://research.nhgri.nih.gov/hydractinia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf215"},"PeriodicalIF":2.8,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12624445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}