Pub Date : 2025-11-03eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf273
Sergio Hernández-Galaz, Andrés Hernández-Olivera, Felipe Villanelo, Alvaro Lladser, Alberto J M Martin
Summary: Computational analysis of single-cell RNA sequencing (scRNA-seq) data presents significant barriers for researchers lacking programming expertise, particularly for multi-dataset integration, scalable job management, and reproducible workflows. We developed scExplorer, a web-based platform that addresses these limitations through three key innovations: Comprehensive batch correction using four state-of-the-art algorithms (ComBat, Scanorama, BBKNN, and Harmony), SLURM-based job scheduling with pause/resume functionality for large-scale analyses, and automated generation of publication-ready reports with exportable configuration files ensuring complete reproducibility. The platform's modular Docker architecture supports both standalone and client-server deployments, enabling analysis of datasets ranging from thousands to hundreds of thousands of cells. An openly documented REST API clarifies how the interface orchestrates analyses and supports transparent operation. scExplorer eliminates the technical barriers that prevent non-computational researchers from performing rigorous scRNA-seq analysis while maintaining the transparency and reproducibility standards required for collaborative research.
Availability and implementation: https://apps.cienciavida.org/scexplorer/.
{"title":"scExplorer: a comprehensive web server for single-cell RNA sequencing data analysis.","authors":"Sergio Hernández-Galaz, Andrés Hernández-Olivera, Felipe Villanelo, Alvaro Lladser, Alberto J M Martin","doi":"10.1093/bioadv/vbaf273","DOIUrl":"10.1093/bioadv/vbaf273","url":null,"abstract":"<p><strong>Summary: </strong>Computational analysis of single-cell RNA sequencing (scRNA-seq) data presents significant barriers for researchers lacking programming expertise, particularly for multi-dataset integration, scalable job management, and reproducible workflows. We developed scExplorer, a web-based platform that addresses these limitations through three key innovations: Comprehensive batch correction using four state-of-the-art algorithms (ComBat, Scanorama, BBKNN, and Harmony), SLURM-based job scheduling with pause/resume functionality for large-scale analyses, and automated generation of publication-ready reports with exportable configuration files ensuring complete reproducibility. The platform's modular Docker architecture supports both standalone and client-server deployments, enabling analysis of datasets ranging from thousands to hundreds of thousands of cells. An openly documented REST API clarifies how the interface orchestrates analyses and supports transparent operation. scExplorer eliminates the technical barriers that prevent non-computational researchers from performing rigorous scRNA-seq analysis while maintaining the transparency and reproducibility standards required for collaborative research.</p><p><strong>Availability and implementation: </strong>https://apps.cienciavida.org/scexplorer/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf273"},"PeriodicalIF":2.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627405/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf272
Pietro Cinaglia, Mario Cannataro
Motivation: A comprehensive and in-depth deciphering of the dynamics concerning gene expressions is essential for understanding intricate biological mechanisms; for instance, the latter can be effectively addressed via network science, and Gene Co-expression Networks (GCNs), specifically. However, a typical GCN is based on a static model, which limits the ability to reflect changes that occur over time. To overcome this issue, we designed an open-source user-friendly web-service for constructing temporal networks from genotype-tissue expression data: COnstructing Real-world TEmporal networks (CoRTE).
Results: CoRTE bases the construction of a temporal network on the statistical analysis of the related gene co-expressions across successive age ranges, to define an ordered set of time points. In our experimentation we investigated gene co-expression dynamics across age groups in brain tissues associated with Alzheimer's Disease, processing curated aging-related data via the proposed web-service. The latter has effectively generated the temporal network consisting of a set of gene pairs that showed statistically significant co-expressions over time. Results demonstrated its capacity to capture time-dependent gene interactions relevant for aging-related disease progression. From a purely applicative point of view, CoRTE may be particularly suitable for exploring aging-related changes, disease development, and other time-dependent biological events.
Availability and implementation: CoRTE is freely available at https://github.com/pietrocinaglia/corte-ws.
{"title":"CoRTE: a web-service for constructing temporal networks from genotype-tissue expression data.","authors":"Pietro Cinaglia, Mario Cannataro","doi":"10.1093/bioadv/vbaf272","DOIUrl":"10.1093/bioadv/vbaf272","url":null,"abstract":"<p><strong>Motivation: </strong>A comprehensive and in-depth deciphering of the dynamics concerning gene expressions is essential for understanding intricate biological mechanisms; for instance, the latter can be effectively addressed via network science, and Gene Co-expression Networks (GCNs), specifically. However, a typical GCN is based on a static model, which limits the ability to reflect changes that occur over time. To overcome this issue, we designed an open-source user-friendly web-service for constructing temporal networks from genotype-tissue expression data: <i>COnstructing Real-world TEmporal networks</i> (CoRTE).</p><p><strong>Results: </strong>CoRTE bases the construction of a temporal network on the statistical analysis of the related gene co-expressions across successive age ranges, to define an ordered set of time points. In our experimentation we investigated gene co-expression dynamics across age groups in brain tissues associated with Alzheimer's Disease, processing curated aging-related data via the proposed web-service. The latter has effectively generated the temporal network consisting of a set of gene pairs that showed statistically significant co-expressions over time. Results demonstrated its capacity to capture time-dependent gene interactions relevant for aging-related disease progression. From a purely applicative point of view, CoRTE may be particularly suitable for exploring aging-related changes, disease development, and other time-dependent biological events.</p><p><strong>Availability and implementation: </strong>CoRTE is freely available at https://github.com/pietrocinaglia/corte-ws.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf272"},"PeriodicalIF":2.8,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12633645/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf274
Linfeng Wang, Susana Campino, Taane G Clark, Jody E Phelan
Motivation: Tuberculosis, caused by Mycobacterium tuberculosis, remains a global health challenge driven by rising antibiotic resistance. Antimicrobial peptides offer a promising alternative due to membrane-disruptive activity and low resistance potential, yet the scarcity of TB-specific AMP data constrains targeted development. We present a reproducible deep learning protocol that integrates long short-term memory networks with transfer learning to classify and generate TB-active peptides.
Results: Classifiers were pretrained on a large corpus of general AMPs and fine-tuned on curated TB-specific sequences using frozen encoder and full backpropagation strategies. We benchmarked four model variants [unidirectional and bidirectional long short-term memories (LSTMs), with and without attention] on a held-out TB test set; the unidirectional LSTM with a frozen encoder achieved the best performance (accuracy 90%, AUC 0.97). In parallel, LSTM-based generative models were trained to produce de novo TB-active peptides. A generator trained exclusively on TB data produced 94 of 100 peptides predicted as antimicrobial by AMP Scanner, outperforming transfer learning-based generators. Generated peptides were evaluated for antimicrobial activity, toxicity, structure, and AMP-like physicochemical traits, and four candidates shared ≥84% identity with known TB-AMPs.
Availability and implementation: The complete model and data can be found at: https://github.com/linfeng-wang/TB-AMP-design.
{"title":"Long short-term memory-based deep learning model for the discovery of antimicrobial peptides targeting <i>Mycobacterium tuberculosis</i>.","authors":"Linfeng Wang, Susana Campino, Taane G Clark, Jody E Phelan","doi":"10.1093/bioadv/vbaf274","DOIUrl":"10.1093/bioadv/vbaf274","url":null,"abstract":"<p><strong>Motivation: </strong>Tuberculosis, caused by <i>Mycobacterium tuberculosis</i>, remains a global health challenge driven by rising antibiotic resistance. Antimicrobial peptides offer a promising alternative due to membrane-disruptive activity and low resistance potential, yet the scarcity of TB-specific AMP data constrains targeted development. We present a reproducible deep learning protocol that integrates long short-term memory networks with transfer learning to classify and generate TB-active peptides.</p><p><strong>Results: </strong>Classifiers were pretrained on a large corpus of general AMPs and fine-tuned on curated TB-specific sequences using frozen encoder and full backpropagation strategies. We benchmarked four model variants [unidirectional and bidirectional long short-term memories (LSTMs), with and without attention] on a held-out TB test set; the unidirectional LSTM with a frozen encoder achieved the best performance (accuracy 90%, AUC 0.97). In parallel, LSTM-based generative models were trained to produce de novo TB-active peptides. A generator trained exclusively on TB data produced 94 of 100 peptides predicted as antimicrobial by AMP Scanner, outperforming transfer learning-based generators. Generated peptides were evaluated for antimicrobial activity, toxicity, structure, and AMP-like physicochemical traits, and four candidates shared ≥84% identity with known TB-AMPs.</p><p><strong>Availability and implementation: </strong>The complete model and data can be found at: https://github.com/linfeng-wang/TB-AMP-design.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf274"},"PeriodicalIF":2.8,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12603352/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145508185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-29eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf263
Raziyeh Masumshah, Changiz Eslahchi
Motivation: Integrating heterogeneous biological data is a central challenge in bioinformatics, especially when modeling complex relationships among entities such as drugs, diseases, and molecular features. Existing methods often rely on static or separate feature extraction processes, which may fail to capture interactions across diverse feature types and reduce predictive accuracy.
Results: To address these limitations, we propose PSO-FeatureFusion, a unified framework that combines particle swarm optimization with neural networks to jointly integrate and optimize features from multiple biological entities. By modeling pairwise feature interactions and learning their optimal contributions, the framework captures individual feature signals and their interdependencies in a task-agnostic and modular manner. We applied PSO-FeatureFusion to two bioinformatics tasks-drug-drug interaction and drug-disease association prediction-using multiple benchmark datasets. Across both tasks, the framework achieved strong performance across evaluation metrics, often outperforming or matching state-of-the-art baselines, including deep learning and graph-based models. The method also demonstrated robustness with limited hyperparameter tuning and flexibility across datasets with varying feature structures. PSO-FeatureFusion provides a scalable and practical solution for researchers working with high-dimensional biological data. Its adaptability and interpretability make it well-suited for applications in drug discovery, disease prediction, and other bioinformatics domains.
Availability and implementation: The source code and datasets are available at https://github.com/raziyehmasumshah/PSO-FeatureFusion.
{"title":"PSO-FeatureFusion: a general framework for fusing heterogeneous features via particle swarm optimization.","authors":"Raziyeh Masumshah, Changiz Eslahchi","doi":"10.1093/bioadv/vbaf263","DOIUrl":"10.1093/bioadv/vbaf263","url":null,"abstract":"<p><strong>Motivation: </strong>Integrating heterogeneous biological data is a central challenge in bioinformatics, especially when modeling complex relationships among entities such as drugs, diseases, and molecular features. Existing methods often rely on static or separate feature extraction processes, which may fail to capture interactions across diverse feature types and reduce predictive accuracy.</p><p><strong>Results: </strong>To address these limitations, we propose PSO-FeatureFusion, a unified framework that combines particle swarm optimization with neural networks to jointly integrate and optimize features from multiple biological entities. By modeling pairwise feature interactions and learning their optimal contributions, the framework captures individual feature signals and their interdependencies in a task-agnostic and modular manner. We applied PSO-FeatureFusion to two bioinformatics tasks-drug-drug interaction and drug-disease association prediction-using multiple benchmark datasets. Across both tasks, the framework achieved strong performance across evaluation metrics, often outperforming or matching state-of-the-art baselines, including deep learning and graph-based models. The method also demonstrated robustness with limited hyperparameter tuning and flexibility across datasets with varying feature structures. PSO-FeatureFusion provides a scalable and practical solution for researchers working with high-dimensional biological data. Its adaptability and interpretability make it well-suited for applications in drug discovery, disease prediction, and other bioinformatics domains.</p><p><strong>Availability and implementation: </strong>The source code and datasets are available at https://github.com/raziyehmasumshah/PSO-FeatureFusion.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf263"},"PeriodicalIF":2.8,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596698/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-29eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf270
Stephan Grein, Tabea Elschner, Ronja Kardinal, Johanna Bruder, Akim Strohmeyer, Karthikeyan Gunasekaran, Jennifer Witt, Hildigunnur Hermannsdóttir, Janina Behrens, Mueez U-Din, Jiangyan Yu, Gerhard Heldmaier, Renate Schreiber, Jan Rozman, Markus Heine, Ludger Scheja, Anna Worthmann, Joerg Heeren, Dagmar Wachten, Kerstin Wilhelm-Jüngling, Alexander Pfeifer, Jan Hasenauer, Martin Klingenspor
Motivation: Indirect calorimetry is the standard method for metabolic phenotyping of animal models in pre-clinical research, supported by mature experimental protocols and widely used commercial platforms. However, a flexible, extensible, and user-friendly software suite that enables standardized integration of data and metadata from diverse metabolic phenotyping platforms-followed by unified statistical analysis and visualization-remains absent.
Results: We present Shiny-Calorie, an open-source interactive application for transparent data and metadata integration, comprehensive statistical data analysis, and visualization of indirect calorimetry datasets. Shiny-Calorie supports the majority of standard data formats across commercial metabolic phenotyping platforms, such as TSE and Sable Systems, COSMED platform and CLAMS/Columbus instruments, and provides export functionality of processed data into standardized formats. Built using GNU R with a reactive interface, Shiny-Calorie enables intuitive exploration of complex, multi-modal longitudinal datasets comprising categorical, continuous, ordinal, and count variables. The platform incorporates state-of-the-art statistical methods for robust hypothesis testing, thereby facilitating biologically meaningful interpretation of energy metabolism phenotypes, including resting metabolic rate and energy expenditure. Together, these features, streamline routine analysis workflows and enhances reproducibility and transparency in metabolic phenotyping studies.
Availability and implementation: Shiny-Calorie is freely available at https://shiny.iaas.uni-bonn.de/Shiny-Calorie/. User documentation and source code are available at https://github.com/ICB-DCM/Shiny-Calorie. A docker image is available from https://hub.docker.com/r/stephanmg/Shiny-Calorie. Instructional screen recordings are available on https://www.youtube.com/@shiny-calorie.
{"title":"Shiny-Calorie: a context-aware application for indirect calorimetry data analysis and visualization using R.","authors":"Stephan Grein, Tabea Elschner, Ronja Kardinal, Johanna Bruder, Akim Strohmeyer, Karthikeyan Gunasekaran, Jennifer Witt, Hildigunnur Hermannsdóttir, Janina Behrens, Mueez U-Din, Jiangyan Yu, Gerhard Heldmaier, Renate Schreiber, Jan Rozman, Markus Heine, Ludger Scheja, Anna Worthmann, Joerg Heeren, Dagmar Wachten, Kerstin Wilhelm-Jüngling, Alexander Pfeifer, Jan Hasenauer, Martin Klingenspor","doi":"10.1093/bioadv/vbaf270","DOIUrl":"10.1093/bioadv/vbaf270","url":null,"abstract":"<p><strong>Motivation: </strong>Indirect calorimetry is the standard method for metabolic phenotyping of animal models in pre-clinical research, supported by mature experimental protocols and widely used commercial platforms. However, a flexible, extensible, and user-friendly software suite that enables standardized integration of data and metadata from diverse metabolic phenotyping platforms-followed by unified statistical analysis and visualization-remains absent.</p><p><strong>Results: </strong>We present Shiny-Calorie, an open-source interactive application for transparent data and metadata integration, comprehensive statistical data analysis, and visualization of indirect calorimetry datasets. Shiny-Calorie supports the majority of standard data formats across commercial metabolic phenotyping platforms, such as TSE and Sable Systems, COSMED platform and CLAMS/Columbus instruments, and provides export functionality of processed data into standardized formats. Built using GNU R with a reactive interface, Shiny-Calorie enables intuitive exploration of complex, multi-modal longitudinal datasets comprising categorical, continuous, ordinal, and count variables. The platform incorporates state-of-the-art statistical methods for robust hypothesis testing, thereby facilitating biologically meaningful interpretation of energy metabolism phenotypes, including resting metabolic rate and energy expenditure. Together, these features, streamline routine analysis workflows and enhances reproducibility and transparency in metabolic phenotyping studies.</p><p><strong>Availability and implementation: </strong>Shiny-Calorie is freely available at https://shiny.iaas.uni-bonn.de/Shiny-Calorie/. User documentation and source code are available at https://github.com/ICB-DCM/Shiny-Calorie. A docker image is available from https://hub.docker.com/r/stephanmg/Shiny-Calorie. Instructional screen recordings are available on https://www.youtube.com/@shiny-calorie.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf270"},"PeriodicalIF":2.8,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12867577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf269
Yuexi Gu, Yongheng Sun, Louxin Zhang, Jian Zu
Motivation: Drug combinations are crucial in combating drug resistance, reducing toxicity, and improving therapeutic outcomes in disease management. Because a large number of drugs are available, the potential combinations increase exponentially, making it impractical to rely solely on biological experiments to identify synergistic combinations. Consequently, machine learning methods are increasingly being used to find synergistic drug combinations. Most existing methods focus on predictive performance through auxiliary data or complex models, but neglecting underlying biological mechanisms limits their accuracy in predicting synergistic drug combinations.
Results: We present DSA-DeepFM, a deep learning model that integrates a dual-stage attention (DSA) mechanism with Factorization Machines (FMs) to predict synergistic two-drug combinations by addressing complex biological feature interactions. The model incorporates categorical and auxiliary numerical inputs to capture both field-aware and embedding-aware patterns. These patterns are then processed by a deep FM module, which captures low- and high-order feature interactions before making the final predictions. Validation testing demonstrates that DSA-DeepFM significantly outperforms traditional machine learning and state-of-the-art deep learning models. Furthermore, t-SNE visualizations confirm the discriminative power of the model at various stages. Additionally, we use our model to identify eight novel synergistic drug combinations, underscoring its practical utility and potential for future applications.
Availability and implementation: Source code is available at https://github.com/gracygyx/DSA-DeepFM.
{"title":"DSA-DeepFM: a dual-stage attention-enhanced DeepFM model for predicting anticancer synergistic drug combinations.","authors":"Yuexi Gu, Yongheng Sun, Louxin Zhang, Jian Zu","doi":"10.1093/bioadv/vbaf269","DOIUrl":"10.1093/bioadv/vbaf269","url":null,"abstract":"<p><strong>Motivation: </strong>Drug combinations are crucial in combating drug resistance, reducing toxicity, and improving therapeutic outcomes in disease management. Because a large number of drugs are available, the potential combinations increase exponentially, making it impractical to rely solely on biological experiments to identify synergistic combinations. Consequently, machine learning methods are increasingly being used to find synergistic drug combinations. Most existing methods focus on predictive performance through auxiliary data or complex models, but neglecting underlying biological mechanisms limits their accuracy in predicting synergistic drug combinations.</p><p><strong>Results: </strong>We present DSA-DeepFM, a deep learning model that integrates a dual-stage attention (DSA) mechanism with Factorization Machines (FMs) to predict synergistic two-drug combinations by addressing complex biological feature interactions. The model incorporates categorical and auxiliary numerical inputs to capture both field-aware and embedding-aware patterns. These patterns are then processed by a deep FM module, which captures low- and high-order feature interactions before making the final predictions. Validation testing demonstrates that DSA-DeepFM significantly outperforms traditional machine learning and state-of-the-art deep learning models. Furthermore, t-SNE visualizations confirm the discriminative power of the model at various stages. Additionally, we use our model to identify eight novel synergistic drug combinations, underscoring its practical utility and potential for future applications.</p><p><strong>Availability and implementation: </strong>Source code is available at https://github.com/gracygyx/DSA-DeepFM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf269"},"PeriodicalIF":2.8,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12609172/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145515098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf262
Arvid Harder, Jerry Guintivano, Joëlle A Pasman, Patrick F Sullivan, Yi Lu
Motivation: Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific "munging" before analysis can proceed.
Results: Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses.
Availability and implementation: The package, reference data, and Docker containers are freely available for broad adoption.
{"title":"TidyGWAS: a scalable approach for standardized cleaning of genome-wide association study summary statistics.","authors":"Arvid Harder, Jerry Guintivano, Joëlle A Pasman, Patrick F Sullivan, Yi Lu","doi":"10.1093/bioadv/vbaf262","DOIUrl":"10.1093/bioadv/vbaf262","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific \"munging\" before analysis can proceed.</p><p><strong>Results: </strong>Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses.</p><p><strong>Availability and implementation: </strong>The package, reference data, and Docker containers are freely available for broad adoption.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf262"},"PeriodicalIF":2.8,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12597892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145497642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-25eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf268
Camilo Villaman, Irene Cartas-Espinel, Mauricio Saez, Alberto J M Martin
Motivation: CTCF is a conserved protein involved in the establishment and maintenance of topologically associating domains (TADs) and loops. Alzheimer's disease (AD) represents the most common form of dementia, affecting over 50 million elderly individuals. Epigenetic alterations are a hallmark of AD, and epigenetic disruptions are able to affect CTCF binding and looping. Understanding the dynamics of CTCF loops behind AD may lead to new, undiscovered contributions of CTCF to the etiology of AD. To understand the dynamics behind CTCF loops, we developed a CTCF loop predictor using different genomic and epigenomic features, such as CTCF motif information, CTCF protein binding information, and different histone marks.
Results: We obtained F-scores of over 0.9 in GM12878 and K562 cell lines. We reported the importance of each feature in classification, and compared the results with other loop predictors. After testing the predictor, we predicted loops in control and AD data, reported a score of loop disruption and selected the top disrupted loops on AD which were all previously linked with AD in bibliography. Our study contributes to a better understanding of the role of CTCF binding and CTCF loops in gene regulation, and highlights new clues about CTCF in the etiology and development of AD.
Availability and implementation: The method can be found in https://github.com/networkbiolab/jalpy.
{"title":"Gaining insights into Alzheimer's disease by predicting chromatin spatial organization.","authors":"Camilo Villaman, Irene Cartas-Espinel, Mauricio Saez, Alberto J M Martin","doi":"10.1093/bioadv/vbaf268","DOIUrl":"10.1093/bioadv/vbaf268","url":null,"abstract":"<p><strong>Motivation: </strong>CTCF is a conserved protein involved in the establishment and maintenance of topologically associating domains (TADs) and loops. Alzheimer's disease (AD) represents the most common form of dementia, affecting over 50 million elderly individuals. Epigenetic alterations are a hallmark of AD, and epigenetic disruptions are able to affect CTCF binding and looping. Understanding the dynamics of CTCF loops behind AD may lead to new, undiscovered contributions of CTCF to the etiology of AD. To understand the dynamics behind CTCF loops, we developed a CTCF loop predictor using different genomic and epigenomic features, such as CTCF motif information, CTCF protein binding information, and different histone marks.</p><p><strong>Results: </strong>We obtained F-scores of over 0.9 in GM12878 and K562 cell lines. We reported the importance of each feature in classification, and compared the results with other loop predictors. After testing the predictor, we predicted loops in control and AD data, reported a score of loop disruption and selected the top disrupted loops on AD which were all previously linked with AD in bibliography. Our study contributes to a better understanding of the role of CTCF binding and CTCF loops in gene regulation, and highlights new clues about CTCF in the etiology and development of AD.</p><p><strong>Availability and implementation: </strong>The method can be found in https://github.com/networkbiolab/jalpy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf268"},"PeriodicalIF":2.8,"publicationDate":"2025-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145565500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf265
Annette E Dodge, Andrew Williams, Danielle P M LeBlanc, David M Schuster, Elena Esina, Charles C Valentine, Jesse J Salk, Alex Y Maslov, Chris Bradley, Carole L Yauk, Francesco Marchetti, Matthew J Meier
Motivation: Error-corrected next-generation sequencing (ECS) methods are increasingly used to assess mutagenicity and other genetic toxicology endpoints. The lack of open and standardized bioinformatic workflows and tools poses challenges to data reproducibility, comparability, and consistency in interpretation for its application in genetic toxicity assessment.
Results: We present MutSeqR, an open source R package to analyse ECS mutation data for genetic toxicology studies. MutSeqR offers practical variant filtering, comparative analysis of mutation frequency between experimental conditions, dose-response assessment via benchmark dose calculations, mutation spectrum analysis, and clonality analyses. We demonstrate MutSeqR's application using published datasets on mice treated with benzo[a]pyrene or benzo[b]fluoranthene, analysed using Duplex Sequencing and SMM-seq, respectively. MutSeqR's flexible functions enable reproducible analyses across ECS platforms, facilitating research and regulatory applications in mutagenicity testing.
Availability and implementation: MutSeqR is freely available under an open source license at https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR. Implemented in R (version 3.4.0 or greater), it supports all major operating systems. Sequencing data for Project 1 has been deposited in the Sequence Read Archive under accession number PRJNA803048. Variant call files for Project 2 are available on Mendeley Data (doi: 10.17632/65dnysxym8.1).
{"title":"MutSeqR: an open source R package for standardized analysis of error-corrected next-generation sequencing data in genetic toxicology.","authors":"Annette E Dodge, Andrew Williams, Danielle P M LeBlanc, David M Schuster, Elena Esina, Charles C Valentine, Jesse J Salk, Alex Y Maslov, Chris Bradley, Carole L Yauk, Francesco Marchetti, Matthew J Meier","doi":"10.1093/bioadv/vbaf265","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf265","url":null,"abstract":"<p><strong>Motivation: </strong>Error-corrected next-generation sequencing (ECS) methods are increasingly used to assess mutagenicity and other genetic toxicology endpoints. The lack of open and standardized bioinformatic workflows and tools poses challenges to data reproducibility, comparability, and consistency in interpretation for its application in genetic toxicity assessment.</p><p><strong>Results: </strong>We present MutSeqR, an open source R package to analyse ECS mutation data for genetic toxicology studies. MutSeqR offers practical variant filtering, comparative analysis of mutation frequency between experimental conditions, dose-response assessment via benchmark dose calculations, mutation spectrum analysis, and clonality analyses. We demonstrate MutSeqR's application using published datasets on mice treated with benzo[a]pyrene or benzo[b]fluoranthene, analysed using Duplex Sequencing and SMM-seq, respectively. MutSeqR's flexible functions enable reproducible analyses across ECS platforms, facilitating research and regulatory applications in mutagenicity testing.</p><p><strong>Availability and implementation: </strong>MutSeqR is freely available under an open source license at https://github.com/EHSRB-BSRSE-Bioinformatics/MutSeqR. Implemented in R (version 3.4.0 or greater), it supports all major operating systems. Sequencing data for Project 1 has been deposited in the Sequence Read Archive under accession number PRJNA803048. Variant call files for Project 2 are available on Mendeley Data (doi: 10.17632/65dnysxym8.1).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf265"},"PeriodicalIF":2.8,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12645840/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf258
Abrar Rahman Abir, Liqing Zhang
Motivation: Designing RNA molecules that can specifically bind to target proteins is fundamental to numerous biological and therapeutic applications. However, existing approaches to protein-conditioned RNA design primarily focus on structural alignment or sequence recovery, often ignoring essential biophysical factors such as molecular stability and thermodynamic feasibility.
Results: To address this gap, we propose RNA-EFM, a novel deep learning framework that integrates energy-based refinement with flow matching for protein-conditioned RNA sequence and structure co-design. RNA-EFM consists of two complementary components: a flow matching objective that supervises geometric alignment between predicted and native RNA backbone structures, and an energy-based idempotent refinement that iteratively improves RNA structure predictions by minimizing both structural error and physical energy. The energy refinement is guided by biophysical priors including the Lennard-Jones potential and sequence-derived free energy, ensuring that the generated RNAs are not only geometrically plausible but also thermodynamically stable. We demonstrate the effectiveness of RNA-EFM through extensive experiments. RNA-EFM significantly outperforms state-of-the-art baselines in terms of RMSD, lDDT, sequence recovery, and binding energy improvement. These results highlight the importance of incorporating biophysical constraints into RNA design and establish RNA-EFM as a promising framework.
Availability and implementation: The source code for RNA-EFM is available at: https://github.com/abrarrahmanabir/RNA-EFM.
{"title":"RNA-EFM: energy-based flow matching for protein-conditioned RNA sequence-structure co-design.","authors":"Abrar Rahman Abir, Liqing Zhang","doi":"10.1093/bioadv/vbaf258","DOIUrl":"10.1093/bioadv/vbaf258","url":null,"abstract":"<p><strong>Motivation: </strong>Designing RNA molecules that can specifically bind to target proteins is fundamental to numerous biological and therapeutic applications. However, existing approaches to protein-conditioned RNA design primarily focus on structural alignment or sequence recovery, often ignoring essential biophysical factors such as molecular stability and thermodynamic feasibility.</p><p><strong>Results: </strong>To address this gap, we propose RNA-EFM, a novel deep learning framework that integrates energy-based refinement with flow matching for protein-conditioned RNA sequence and structure co-design. RNA-EFM consists of two complementary components: a flow matching objective that supervises geometric alignment between predicted and native RNA backbone structures, and an energy-based idempotent refinement that iteratively improves RNA structure predictions by minimizing both structural error and physical energy. The energy refinement is guided by biophysical priors including the Lennard-Jones potential and sequence-derived free energy, ensuring that the generated RNAs are not only geometrically plausible but also thermodynamically stable. We demonstrate the effectiveness of RNA-EFM through extensive experiments. RNA-EFM significantly outperforms state-of-the-art baselines in terms of RMSD, lDDT, sequence recovery, and binding energy improvement. These results highlight the importance of incorporating biophysical constraints into RNA design and establish RNA-EFM as a promising framework.</p><p><strong>Availability and implementation: </strong>The source code for RNA-EFM is available at: https://github.com/abrarrahmanabir/RNA-EFM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf258"},"PeriodicalIF":2.8,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701795/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}