Bioinformatics advances最新文献_第8页

ProkBERT PhaStyle: accurate phage lifestyle prediction with pretrained genomic language models. ProkBERT PhaStyle：准确的噬菌体生活方式预测与预训练基因组语言模型。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-09 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf188

Judit Juhász, Noémi Ligeti-Nagy, Babett Bodnár, János Juhász, Sándor Pongor, Balázs Ligeti

Motivation: Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. Phage sequences from metagenome or virome assemblies are often fragmented, and the diversity of environmental phages is not well known. Current computational approaches often rely on database comparisons that require significant effort and expertise to update. We propose using genomic language models (LMs) for phage lifestyle classification, allowing efficient direct analysis from nucleotide sequences without the need for sophisticated preprocessing pipelines or manually curated databases. We trained three genomic LMs (DNABERT-2, Nucleotide Transformer, and ProkBERT) on datasets of short, fragmented sequences. These models were then compared with dedicated phage lifestyle prediction methods in terms of accuracy, prediction speed, and generalization capability.

Results: ProkBERT PhaStyle achieves accuracy comparable to, and in many cases higher than, state-of-the-art models across various scenarios. It demonstrates the ability to generalize to unseen data in our benchmarks, accurately classifies phages from extreme environments, and also demonstrates high inference speed.

Availability and implementation: Genomic LMs offer a simple and computationally efficient alternative for solving complex classification tasks, such as phage lifestyle prediction. ProkBERT PhaStyle's simplicity, speed, and performance suggest its utility in various ecological and clinical applications.

动机：噬菌体生活方式预测，即将噬菌体序列分类为毒性或温带，在生物医学和生态应用中至关重要。来自宏基因组或病毒组的噬菌体序列通常是碎片化的，并且环境噬菌体的多样性尚不清楚。当前的计算方法通常依赖于数据库比较，需要大量的努力和专业知识来更新。我们建议使用基因组语言模型（LMs）进行噬菌体生活方式分类，允许从核苷酸序列中进行有效的直接分析，而无需复杂的预处理管道或手动管理的数据库。我们在短片段序列的数据集上训练了三个基因组lm （DNABERT-2， Nucleotide Transformer和ProkBERT）。然后将这些模型与专用噬菌体生活方式预测方法在准确性、预测速度和泛化能力方面进行比较。结果：ProkBERT PhaStyle在各种情况下达到了与最先进的模型相当的精度，并且在许多情况下高于最先进的模型。它展示了在我们的基准测试中推广到未见数据的能力，准确地对极端环境中的噬菌体进行分类，并且还展示了高推断速度。可用性和实现：基因组lm为解决复杂的分类任务（如噬菌体生活方式预测）提供了一种简单且计算效率高的替代方案。ProkBERT PhaStyle的简单性，速度和性能表明其在各种生态和临床应用中的实用性。

{"title":"ProkBERT PhaStyle: accurate phage lifestyle prediction with pretrained genomic language models.","authors":"Judit Juhász, Noémi Ligeti-Nagy, Babett Bodnár, János Juhász, Sándor Pongor, Balázs Ligeti","doi":"10.1093/bioadv/vbaf188","DOIUrl":"10.1093/bioadv/vbaf188","url":null,"abstract":"Motivation: Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. Phage sequences from metagenome or virome assemblies are often fragmented, and the diversity of environmental phages is not well known. Current computational approaches often rely on database comparisons that require significant effort and expertise to update. We propose using genomic language models (LMs) for phage lifestyle classification, allowing efficient direct analysis from nucleotide sequences without the need for sophisticated preprocessing pipelines or manually curated databases. We trained three genomic LMs (DNABERT-2, Nucleotide Transformer, and ProkBERT) on datasets of short, fragmented sequences. These models were then compared with dedicated phage lifestyle prediction methods in terms of accuracy, prediction speed, and generalization capability.Results: ProkBERT PhaStyle achieves accuracy comparable to, and in many cases higher than, state-of-the-art models across various scenarios. It demonstrates the ability to generalize to unseen data in our benchmarks, accurately classifies phages from extreme environments, and also demonstrates high inference speed.Availability and implementation: Genomic LMs offer a simple and computationally efficient alternative for solving complex classification tasks, such as phage lifestyle prediction. ProkBERT PhaStyle's simplicity, speed, and performance suggest its utility in various ecological and clinical applications.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf188"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12603353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145508266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GlobDB: a comprehensive species-dereplicated microbial genome resource. GlobDB：一个全面的物种去复制微生物基因组资源。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-09 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf280

Daan R Speth, Nick Pullen, Samuel T N Aroney, Benjamin L Coltman, Jay Osvatic, Ben J Woodcroft, Thomas Rattei, Michael Wagner

Motivation: Over the past years, substantial numbers of microbial species' genomes have been deposited outside of conventional INSDC databases.

Results: The GlobDB aggregates 14 independent genomic catalogues to provide a comprehensive database of species-dereplicated microbial genomes, with consistent taxonomy, annotations, and additional analysis resources. The GlobDB more than doubles the number of microbial species represented by genomes relative to the field standard genome taxonomy database.

Availability and implementation: The GlobDB is available at https://globdb.org/.

动机：在过去的几年里，大量的微生物物种基因组已经在传统的INSDC数据库之外沉积。结果：GlobDB汇集了14个独立的基因组目录，提供了一个全面的物种去复制微生物基因组数据库，具有一致的分类、注释和额外的分析资源。与现场标准基因组分类数据库相比，GlobDB基因组所代表的微生物物种数量增加了一倍以上。可用性和实现：GlobDB可在https://globdb.org/上获得。

引用次数: 0

BEREN: a bioinformatic tool for recovering giant viruses, polinton-like viruses, and virophages in metagenomic data. BEREN：一个生物信息学工具，用于在宏基因组数据中恢复巨型病毒、波顿样病毒和病毒噬菌体。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-08 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf284

Benjamin Minch, Mohammad Moniruzzaman

Motivation: Viruses in the kingdom Bamfordvirae, specifically giant viruses (NCLDVs) in the phylum Nucleocytoviricota and smaller members in the Preplasmiviricota phylum, are widespread and important groups of viruses that infect eukaryotes. While viruses in this kingdom, such as giant viruses, polinton-like viruses, and virophages, have gained large interest from researchers in recent years, there is still a lack of streamlined tools for the recovery of their genomes from metagenomic datasets.

Results: Here, we present, BEREN, a comprehensive bioinformatic tool to unlock the diversity of these viruses in metagenomes through five modules for NCLDV genome, contig, and marker gene recovery, metabolic protein annotation, and Preplasmiviricota genome identification and annotation. BEREN's performance was benchmarked against other mainstream virus recovery tools using a mock metagenome, demonstrating superior recovery rates of NCLDV contigs and Preplasmiviricota genomes. Overall, BEREN offers a user-friendly, transparent bioinformatic solution for studying the ecological and functional roles of these eukaryotic viruses, facilitating broader access to their metagenomic analysis.

Availability and implementation: BEREN is available at https://gitlab.com/benminch1/BEREN, and results from testing BEREN on a real-world metagenome are available in the Supplementary Files.

动机：Bamfordvirae领域的病毒，特别是核细胞病毒门的巨型病毒（NCLDVs）和原质病毒门的较小成员，是感染真核生物的广泛和重要的病毒群。尽管近年来，巨型病毒、波顿样病毒和病毒噬菌体等病毒引起了研究人员的极大兴趣，但仍然缺乏从宏基因组数据集中恢复其基因组的简化工具。结果：在这里，我们提出了BEREN，一个综合性的生物信息学工具，通过NCLDV基因组、contig和标记基因恢复、代谢蛋白注释和前质viricota基因组鉴定和注释五个模块来解锁这些病毒在宏基因组中的多样性。使用模拟宏基因组对BEREN的性能与其他主流病毒恢复工具进行基准测试，显示NCLDV组和前质粒病毒基因组的高回收率。总的来说，BEREN为研究这些真核病毒的生态和功能作用提供了一个用户友好、透明的生物信息学解决方案，促进了更广泛的宏基因组分析。可用性和实现：BEREN可在https://gitlab.com/benminch1/BEREN上获得，在真实的宏基因组上测试BEREN的结果可在补充文件中获得。

{"title":"BEREN: a bioinformatic tool for recovering giant viruses, polinton-like viruses, and virophages in metagenomic data.","authors":"Benjamin Minch, Mohammad Moniruzzaman","doi":"10.1093/bioadv/vbaf284","DOIUrl":"10.1093/bioadv/vbaf284","url":null,"abstract":"Motivation: Viruses in the kingdom Bamfordvirae, specifically giant viruses (NCLDVs) in the phylum Nucleocytoviricota and smaller members in the Preplasmiviricota phylum, are widespread and important groups of viruses that infect eukaryotes. While viruses in this kingdom, such as giant viruses, polinton-like viruses, and virophages, have gained large interest from researchers in recent years, there is still a lack of streamlined tools for the recovery of their genomes from metagenomic datasets.Results: Here, we present, BEREN, a comprehensive bioinformatic tool to unlock the diversity of these viruses in metagenomes through five modules for NCLDV genome, contig, and marker gene recovery, metabolic protein annotation, and Preplasmiviricota genome identification and annotation. BEREN's performance was benchmarked against other mainstream virus recovery tools using a mock metagenome, demonstrating superior recovery rates of NCLDV contigs and Preplasmiviricota genomes. Overall, BEREN offers a user-friendly, transparent bioinformatic solution for studying the ecological and functional roles of these eukaryotic viruses, facilitating broader access to their metagenomic analysis.Availability and implementation: BEREN is available at https://gitlab.com/benminch1/BEREN, and results from testing BEREN on a real-world metagenome are available in the Supplementary Files.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf284"},"PeriodicalIF":2.8,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12638062/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Matrix-based vector representations in neural networks for classifying molecular biology data. 分子生物学数据分类神经网络中基于矩阵的向量表示。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-08 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf251

Loris Nanni, Sheryl Brahnam, Daniel Fusaro

Summary: Selecting an appropriate classifier is essential for achieving accurate classification. In this study, we propose novel neural network (NNs)-based alternatives to standard classifiers as support vector machines. NNs, particularly convolutional neural networks and transformer networks, have shown exceptional performance in processing image data. To leverage this capability, we explore methods for transforming 1D vector data into 2D matrix representations, enabling the application of NNs pre-trained on large-scale image datasets. Specifically, we introduce a new data restructuring technique based on Wigner transforms, and we compare many methods proposed in the literature. The effectiveness and robustness of our approach are assessed using various benchmark datasets, from peptide classification to DNA barcoding classification, demonstrating consistently strong performance.

Availability and implementation: All source code and related resources used in this work are made publicly available at https://github.com/LorisNanni/Matrix-Representation-of-Vectors-in-Neural-Networks-for-Data-Classification.

摘要：选择合适的分类器是实现准确分类的关键。在这项研究中，我们提出了新的基于神经网络（nn）的替代标准分类器作为支持向量机。神经网络，特别是卷积神经网络和变压器网络，在处理图像数据方面表现出优异的性能。为了利用这种能力，我们探索了将一维矢量数据转换为二维矩阵表示的方法，从而能够在大规模图像数据集上应用预训练的神经网络。具体来说，我们介绍了一种新的基于Wigner变换的数据重构技术，并比较了文献中提出的许多方法。我们的方法的有效性和鲁棒性使用各种基准数据集进行评估，从肽分类到DNA条形码分类，显示出一致的强大性能。可用性和实现：本工作中使用的所有源代码和相关资源都可以在https://github.com/LorisNanni/Matrix-Representation-of-Vectors-in-Neural-Networks-for-Data-Classification上公开获得。

{"title":"Matrix-based vector representations in neural networks for classifying molecular biology data.","authors":"Loris Nanni, Sheryl Brahnam, Daniel Fusaro","doi":"10.1093/bioadv/vbaf251","DOIUrl":"10.1093/bioadv/vbaf251","url":null,"abstract":"Summary: Selecting an appropriate classifier is essential for achieving accurate classification. In this study, we propose novel neural network (NNs)-based alternatives to standard classifiers as support vector machines. NNs, particularly convolutional neural networks and transformer networks, have shown exceptional performance in processing image data. To leverage this capability, we explore methods for transforming 1D vector data into 2D matrix representations, enabling the application of NNs pre-trained on large-scale image datasets. Specifically, we introduce a new data restructuring technique based on Wigner transforms, and we compare many methods proposed in the literature. The effectiveness and robustness of our approach are assessed using various benchmark datasets, from peptide classification to DNA barcoding classification, demonstrating consistently strong performance.Availability and implementation: All source code and related resources used in this work are made publicly available at https://github.com/LorisNanni/Matrix-Representation-of-Vectors-in-Neural-Networks-for-Data-Classification.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf251"},"PeriodicalIF":2.8,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WebCMap: an R package for high-throughput connectivity analysis within the CMap framework. WebCMap：一个在CMap框架内用于高吞吐量连接分析的R包。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-05 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf278

Hongen Kang, Yin-Ying Wang, Peilin Jia

Motivation: Experimentally generated drug-induced transcriptomic signatures are valuable resources to infer candidate drugs for unseen transcriptomes. The Connectivity Map (CMap) includes over 720 000 compound-induced signatures and has been widely used in drug repurposing. However, the computational resources required for an unbiased screen across all these signatures, along with the inconsistent results from different methods, presented huge challenges for the connectivity analyses.

Results: In this study, we developed WebCMap, an R package to search for candidate compounds with similar or reverse activities across all CMap drug-induced signatures. WebCMap implements six widely used methods and a meta-score to evaluate the consistency among these methods. Through a web-accelerated framework, pre-calculated statistics for the permutation test, and multi-core parallelization, WebCMap enables fast screening and retrieval of the results on personal computers within a reasonable time.

Availability and implementation: WebCMap is available at https://github.com/geneprophet/WebCMap.

动机：实验产生的药物诱导的转录组特征是推断未知转录组候选药物的宝贵资源。连通性图（CMap）包括超过72万个化合物诱导的特征，已广泛用于药物再利用。然而，在所有这些签名中进行无偏筛选所需的计算资源，以及不同方法的不一致结果，为连接性分析带来了巨大的挑战。在这项研究中，我们开发了WebCMap，这是一个R包，用于搜索所有CMap药物诱导特征中具有相似或反向活性的候选化合物。WebCMap实现了六个广泛使用的方法和一个元评分来评估这些方法之间的一致性。通过web加速框架、预先计算的排列测试统计数据和多核并行化，WebCMap可以在合理的时间内在个人计算机上快速筛选和检索结果。可用性和实现：WebCMap可在https://github.com/geneprophet/WebCMap上获得。

{"title":"WebCMap: an R package for high-throughput connectivity analysis within the CMap framework.","authors":"Hongen Kang, Yin-Ying Wang, Peilin Jia","doi":"10.1093/bioadv/vbaf278","DOIUrl":"10.1093/bioadv/vbaf278","url":null,"abstract":"Motivation: Experimentally generated drug-induced transcriptomic signatures are valuable resources to infer candidate drugs for unseen transcriptomes. The Connectivity Map (CMap) includes over 720 000 compound-induced signatures and has been widely used in drug repurposing. However, the computational resources required for an unbiased screen across all these signatures, along with the inconsistent results from different methods, presented huge challenges for the connectivity analyses.Results: In this study, we developed WebCMap, an R package to search for candidate compounds with similar or reverse activities across all CMap drug-induced signatures. WebCMap implements six widely used methods and a meta-score to evaluate the consistency among these methods. Through a web-accelerated framework, pre-calculated statistics for the permutation test, and multi-core parallelization, WebCMap enables fast screening and retrieval of the results on personal computers within a reasonable time.Availability and implementation: WebCMap is available at https://github.com/geneprophet/WebCMap.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf278"},"PeriodicalIF":2.8,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12629228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

zAMP and zAMPExplorer: reproducible scalable amplicon-based metagenomics analysis and visualization. zAMP和zAMPExplorer：可复制可扩展的基于扩增子的宏基因组分析和可视化。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-04 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf255

Valentin Scherz, Sedreh Nassirnia, Farid Chaabane, Violeta Castelo-Szekely, Gilbert Greub, Trestan Pillonel, Claire Bertelli

Summary: To enable flexible, scalable, and reproducible microbiota profiling, we have developed zAMP, an open-source bioinformatics pipeline for the analysis of amplicon sequence data, such as 16S rRNA gene for bacteria and archaea or ITS for fungi. zAMP is complemented by two modules: one to process databases to optimize taxonomy assignment, and the second to benchmark primers, databases and classifier performances. Coupled with zAMPExplorer, an interactive R Shiny application that provides an intuitive interface for quality control, diversity analysis, and statistical testing, this complete toolbox addresses both research and clinical needs in microbiota profiling.

Availability and implementation: Comprehensive documentation and tutorials are provided alongside the source code of zAMP and zAMPExplorer software to facilitate installation and use. zAMP is implemented as a Snakemake workflow, ensuring reproducibility by running within Singularity or Docker containers, and is also easily installable via Bioconda. The zAMPExplorer application, designed for visualization and statistical analysis, can be installed using either a Docker image or from R-universe.

摘要：为了实现灵活、可扩展和可重复的微生物群分析，我们开发了zAMP，这是一个开源的生物信息学管道，用于分析扩增子序列数据，如细菌和古菌的16S rRNA基因或真菌的ITS。zAMP由两个模块补充：一个用于处理数据库以优化分类分配，第二个用于对引物、数据库和分类器性能进行基准测试。再加上zAMPExplorer，一个交互式R Shiny应用程序，为质量控制，多样性分析和统计测试提供了直观的界面，这个完整的工具箱解决了微生物群分析的研究和临床需求。可用性和实现：除了zAMP和zAMPExplorer软件的源代码外，还提供了全面的文档和教程，以方便安装和使用。zAMP是作为蛇形工作流实现的，通过在Singularity或Docker容器中运行来确保再现性，并且也可以通过Bioconda轻松安装。zAMPExplorer应用程序是为可视化和统计分析而设计的，可以使用Docker映像或从R-universe安装。

{"title":"zAMP and zAMPExplorer: reproducible scalable amplicon-based metagenomics analysis and visualization.","authors":"Valentin Scherz, Sedreh Nassirnia, Farid Chaabane, Violeta Castelo-Szekely, Gilbert Greub, Trestan Pillonel, Claire Bertelli","doi":"10.1093/bioadv/vbaf255","DOIUrl":"10.1093/bioadv/vbaf255","url":null,"abstract":"Summary: To enable flexible, scalable, and reproducible microbiota profiling, we have developed zAMP, an open-source bioinformatics pipeline for the analysis of amplicon sequence data, such as 16S rRNA gene for bacteria and archaea or ITS for fungi. zAMP is complemented by two modules: one to process databases to optimize taxonomy assignment, and the second to benchmark primers, databases and classifier performances. Coupled with zAMPExplorer, an interactive R Shiny application that provides an intuitive interface for quality control, diversity analysis, and statistical testing, this complete toolbox addresses both research and clinical needs in microbiota profiling.Availability and implementation: Comprehensive documentation and tutorials are provided alongside the source code of zAMP and zAMPExplorer software to facilitate installation and use. zAMP is implemented as a Snakemake workflow, ensuring reproducibility by running within Singularity or Docker containers, and is also easily installable via Bioconda. The zAMPExplorer application, designed for visualization and statistical analysis, can be installed using either a Docker image or from R-universe.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf255"},"PeriodicalIF":2.8,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12603355/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145508200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scExplorer: a comprehensive web server for single-cell RNA sequencing data analysis. scExplorer：用于单细胞RNA测序数据分析的综合web服务器。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-11-03 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf273

Sergio Hernández-Galaz, Andrés Hernández-Olivera, Felipe Villanelo, Alvaro Lladser, Alberto J M Martin

Summary: Computational analysis of single-cell RNA sequencing (scRNA-seq) data presents significant barriers for researchers lacking programming expertise, particularly for multi-dataset integration, scalable job management, and reproducible workflows. We developed scExplorer, a web-based platform that addresses these limitations through three key innovations: Comprehensive batch correction using four state-of-the-art algorithms (ComBat, Scanorama, BBKNN, and Harmony), SLURM-based job scheduling with pause/resume functionality for large-scale analyses, and automated generation of publication-ready reports with exportable configuration files ensuring complete reproducibility. The platform's modular Docker architecture supports both standalone and client-server deployments, enabling analysis of datasets ranging from thousands to hundreds of thousands of cells. An openly documented REST API clarifies how the interface orchestrates analyses and supports transparent operation. scExplorer eliminates the technical barriers that prevent non-computational researchers from performing rigorous scRNA-seq analysis while maintaining the transparency and reproducibility standards required for collaborative research.

Availability and implementation: https://apps.cienciavida.org/scexplorer/.

摘要：单细胞RNA测序（scRNA-seq）数据的计算分析对于缺乏编程专业知识的研究人员来说存在重大障碍，特别是在多数据集集成、可扩展的作业管理和可重复的工作流程方面。我们开发了scExplorer，这是一个基于网络的平台，通过三个关键创新解决了这些限制：使用四种最先进的算法（ComBat, Scanorama， BBKNN和Harmony）进行全面批量校正，基于slurm的作业调度，具有暂停/恢复功能，用于大规模分析，以及自动生成具有可导出配置文件的出版准备报告，确保完全再现性。该平台的模块化Docker架构支持独立部署和客户端-服务器部署，能够分析从数千到数十万个单元的数据集。公开记录的REST API阐明了接口如何编排分析并支持透明操作。scExplorer消除了阻碍非计算研究人员进行严格scRNA-seq分析的技术障碍，同时保持了合作研究所需的透明度和可重复性标准。可用性和实现：https://apps.cienciavida.org/scexplorer/。

{"title":"scExplorer: a comprehensive web server for single-cell RNA sequencing data analysis.","authors":"Sergio Hernández-Galaz, Andrés Hernández-Olivera, Felipe Villanelo, Alvaro Lladser, Alberto J M Martin","doi":"10.1093/bioadv/vbaf273","DOIUrl":"10.1093/bioadv/vbaf273","url":null,"abstract":"Summary: Computational analysis of single-cell RNA sequencing (scRNA-seq) data presents significant barriers for researchers lacking programming expertise, particularly for multi-dataset integration, scalable job management, and reproducible workflows. We developed scExplorer, a web-based platform that addresses these limitations through three key innovations: Comprehensive batch correction using four state-of-the-art algorithms (ComBat, Scanorama, BBKNN, and Harmony), SLURM-based job scheduling with pause/resume functionality for large-scale analyses, and automated generation of publication-ready reports with exportable configuration files ensuring complete reproducibility. The platform's modular Docker architecture supports both standalone and client-server deployments, enabling analysis of datasets ranging from thousands to hundreds of thousands of cells. An openly documented REST API clarifies how the interface orchestrates analyses and supports transparent operation. scExplorer eliminates the technical barriers that prevent non-computational researchers from performing rigorous scRNA-seq analysis while maintaining the transparency and reproducibility standards required for collaborative research.Availability and implementation: https://apps.cienciavida.org/scexplorer/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf273"},"PeriodicalIF":2.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627405/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoRTE: a web-service for constructing temporal networks from genotype-tissue expression data. CoRTE：一个从基因型-组织表达数据构建时间网络的web服务。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-10-31 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf272

Pietro Cinaglia, Mario Cannataro

Motivation: A comprehensive and in-depth deciphering of the dynamics concerning gene expressions is essential for understanding intricate biological mechanisms; for instance, the latter can be effectively addressed via network science, and Gene Co-expression Networks (GCNs), specifically. However, a typical GCN is based on a static model, which limits the ability to reflect changes that occur over time. To overcome this issue, we designed an open-source user-friendly web-service for constructing temporal networks from genotype-tissue expression data: COnstructing Real-world TEmporal networks (CoRTE).

Results: CoRTE bases the construction of a temporal network on the statistical analysis of the related gene co-expressions across successive age ranges, to define an ordered set of time points. In our experimentation we investigated gene co-expression dynamics across age groups in brain tissues associated with Alzheimer's Disease, processing curated aging-related data via the proposed web-service. The latter has effectively generated the temporal network consisting of a set of gene pairs that showed statistically significant co-expressions over time. Results demonstrated its capacity to capture time-dependent gene interactions relevant for aging-related disease progression. From a purely applicative point of view, CoRTE may be particularly suitable for exploring aging-related changes, disease development, and other time-dependent biological events.

Availability and implementation: CoRTE is freely available at https://github.com/pietrocinaglia/corte-ws.

动机：全面深入地解读基因表达的动态对于理解复杂的生物学机制至关重要；例如，后者可以通过网络科学，特别是基因共表达网络（GCNs）有效地解决。然而，典型的GCN是基于静态模型的，这限制了反映随时间发生的变化的能力。为了克服这个问题，我们设计了一个开源的用户友好的web服务，用于从基因型组织表达数据构建时间网络：构建真实世界的时间网络（CoRTE）。结果：CoRTE基于对连续年龄范围内相关基因共表达的统计分析构建了一个时间网络，定义了一个有序的时间点集合。在我们的实验中，我们研究了与阿尔茨海默病相关的脑组织中不同年龄组的基因共表达动态，通过提议的网络服务处理与衰老相关的数据。后者有效地产生了由一组基因对组成的时间网络，这些基因对随着时间的推移显示出统计上显著的共表达。结果表明，它能够捕获与衰老相关疾病进展相关的时间依赖性基因相互作用。从纯粹的应用角度来看，CoRTE可能特别适合于探索与衰老相关的变化、疾病发展和其他时间依赖性的生物事件。可用性和实现：CoRTE可以在https://github.com/pietrocinaglia/corte-ws上免费获得。

{"title":"CoRTE: a web-service for constructing temporal networks from genotype-tissue expression data.","authors":"Pietro Cinaglia, Mario Cannataro","doi":"10.1093/bioadv/vbaf272","DOIUrl":"10.1093/bioadv/vbaf272","url":null,"abstract":"Motivation: A comprehensive and in-depth deciphering of the dynamics concerning gene expressions is essential for understanding intricate biological mechanisms; for instance, the latter can be effectively addressed via network science, and Gene Co-expression Networks (GCNs), specifically. However, a typical GCN is based on a static model, which limits the ability to reflect changes that occur over time. To overcome this issue, we designed an open-source user-friendly web-service for constructing temporal networks from genotype-tissue expression data: COnstructing Real-world TEmporal networks (CoRTE).Results: CoRTE bases the construction of a temporal network on the statistical analysis of the related gene co-expressions across successive age ranges, to define an ordered set of time points. In our experimentation we investigated gene co-expression dynamics across age groups in brain tissues associated with Alzheimer's Disease, processing curated aging-related data via the proposed web-service. The latter has effectively generated the temporal network consisting of a set of gene pairs that showed statistically significant co-expressions over time. Results demonstrated its capacity to capture time-dependent gene interactions relevant for aging-related disease progression. From a purely applicative point of view, CoRTE may be particularly suitable for exploring aging-related changes, disease development, and other time-dependent biological events.Availability and implementation: CoRTE is freely available at https://github.com/pietrocinaglia/corte-ws.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf272"},"PeriodicalIF":2.8,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12633645/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Long short-term memory-based deep learning model for the discovery of antimicrobial peptides targeting Mycobacterium tuberculosis. 基于长短期记忆的深度学习模型用于发现针对结核分枝杆菌的抗菌肽。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-10-31 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf274

Linfeng Wang, Susana Campino, Taane G Clark, Jody E Phelan

Motivation: Tuberculosis, caused by Mycobacterium tuberculosis, remains a global health challenge driven by rising antibiotic resistance. Antimicrobial peptides offer a promising alternative due to membrane-disruptive activity and low resistance potential, yet the scarcity of TB-specific AMP data constrains targeted development. We present a reproducible deep learning protocol that integrates long short-term memory networks with transfer learning to classify and generate TB-active peptides.

Results: Classifiers were pretrained on a large corpus of general AMPs and fine-tuned on curated TB-specific sequences using frozen encoder and full backpropagation strategies. We benchmarked four model variants [unidirectional and bidirectional long short-term memories (LSTMs), with and without attention] on a held-out TB test set; the unidirectional LSTM with a frozen encoder achieved the best performance (accuracy 90%, AUC 0.97). In parallel, LSTM-based generative models were trained to produce de novo TB-active peptides. A generator trained exclusively on TB data produced 94 of 100 peptides predicted as antimicrobial by AMP Scanner, outperforming transfer learning-based generators. Generated peptides were evaluated for antimicrobial activity, toxicity, structure, and AMP-like physicochemical traits, and four candidates shared ≥84% identity with known TB-AMPs.

Availability and implementation: The complete model and data can be found at: https://github.com/linfeng-wang/TB-AMP-design.

动机：由结核分枝杆菌引起的结核病，由于抗生素耐药性上升，仍然是一项全球卫生挑战。抗菌肽具有膜破坏活性和低耐药潜力，是一种很有前景的替代方案，但结核病特异性AMP数据的缺乏限制了靶向开发。我们提出了一种可重复的深度学习协议，该协议集成了长短期记忆网络和迁移学习，以分类和生成结核病活性肽。结果：分类器在大型通用amp语料库上进行了预训练，并使用冻结编码器和完全反向传播策略对策划的结核病特异性序列进行了微调。我们在一个固定的结核病测试集上对四种模型变体[单向和双向长短期记忆（LSTMs），有和没有注意]进行了基准测试；具有冻结编码器的单向LSTM获得了最好的性能（精度90%，AUC 0.97）。同时，基于lstm的生成模型被训练以产生新的结核病活性肽。一个专门针对结核病数据进行训练的生成器产生了AMP Scanner预测的100个抗菌肽中的94个，优于基于迁移学习的生成器。对生成的肽进行抗菌活性、毒性、结构和类抗菌肽的理化特性评估，4个候选肽与已知tb -抗菌肽具有≥84%的一致性。可用性和实现：完整的模型和数据可以在https://github.com/linfeng-wang/TB-AMP-design上找到。

{"title":"Long short-term memory-based deep learning model for the discovery of antimicrobial peptides targeting Mycobacterium tuberculosis.","authors":"Linfeng Wang, Susana Campino, Taane G Clark, Jody E Phelan","doi":"10.1093/bioadv/vbaf274","DOIUrl":"10.1093/bioadv/vbaf274","url":null,"abstract":"Motivation: Tuberculosis, caused by Mycobacterium tuberculosis, remains a global health challenge driven by rising antibiotic resistance. Antimicrobial peptides offer a promising alternative due to membrane-disruptive activity and low resistance potential, yet the scarcity of TB-specific AMP data constrains targeted development. We present a reproducible deep learning protocol that integrates long short-term memory networks with transfer learning to classify and generate TB-active peptides.Results: Classifiers were pretrained on a large corpus of general AMPs and fine-tuned on curated TB-specific sequences using frozen encoder and full backpropagation strategies. We benchmarked four model variants [unidirectional and bidirectional long short-term memories (LSTMs), with and without attention] on a held-out TB test set; the unidirectional LSTM with a frozen encoder achieved the best performance (accuracy 90%, AUC 0.97). In parallel, LSTM-based generative models were trained to produce de novo TB-active peptides. A generator trained exclusively on TB data produced 94 of 100 peptides predicted as antimicrobial by AMP Scanner, outperforming transfer learning-based generators. Generated peptides were evaluated for antimicrobial activity, toxicity, structure, and AMP-like physicochemical traits, and four candidates shared ≥84% identity with known TB-AMPs.Availability and implementation: The complete model and data can be found at: https://github.com/linfeng-wang/TB-AMP-design.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf274"},"PeriodicalIF":2.8,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12603352/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145508185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PSO-FeatureFusion: a general framework for fusing heterogeneous features via particle swarm optimization. PSO-FeatureFusion：通过粒子群优化实现异构特征融合的通用框架。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-10-29 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf263

Raziyeh Masumshah, Changiz Eslahchi

Motivation: Integrating heterogeneous biological data is a central challenge in bioinformatics, especially when modeling complex relationships among entities such as drugs, diseases, and molecular features. Existing methods often rely on static or separate feature extraction processes, which may fail to capture interactions across diverse feature types and reduce predictive accuracy.

Results: To address these limitations, we propose PSO-FeatureFusion, a unified framework that combines particle swarm optimization with neural networks to jointly integrate and optimize features from multiple biological entities. By modeling pairwise feature interactions and learning their optimal contributions, the framework captures individual feature signals and their interdependencies in a task-agnostic and modular manner. We applied PSO-FeatureFusion to two bioinformatics tasks-drug-drug interaction and drug-disease association prediction-using multiple benchmark datasets. Across both tasks, the framework achieved strong performance across evaluation metrics, often outperforming or matching state-of-the-art baselines, including deep learning and graph-based models. The method also demonstrated robustness with limited hyperparameter tuning and flexibility across datasets with varying feature structures. PSO-FeatureFusion provides a scalable and practical solution for researchers working with high-dimensional biological data. Its adaptability and interpretability make it well-suited for applications in drug discovery, disease prediction, and other bioinformatics domains.

Availability and implementation: The source code and datasets are available at https://github.com/raziyehmasumshah/PSO-FeatureFusion.

动机：整合异构生物数据是生物信息学的核心挑战，特别是在对药物、疾病和分子特征等实体之间的复杂关系进行建模时。现有的方法通常依赖于静态或独立的特征提取过程，这可能无法捕获不同特征类型之间的交互，从而降低预测的准确性。结果：为了解决这些局限性，我们提出了PSO-FeatureFusion，这是一个将粒子群优化与神经网络相结合的统一框架，可以共同整合和优化来自多个生物实体的特征。通过对两两特征交互建模并学习它们的最优贡献，该框架以任务不可知和模块化的方式捕获单个特征信号及其相互依赖性。我们使用多个基准数据集将PSO-FeatureFusion应用于两个生物信息学任务-药物-药物相互作用和药物-疾病关联预测。在这两项任务中，该框架在评估指标上都取得了出色的表现，通常优于或匹配最先进的基线，包括深度学习和基于图的模型。该方法还证明了鲁棒性，具有有限的超参数调整和跨不同特征结构的数据集的灵活性。PSO-FeatureFusion为研究人员处理高维生物数据提供了可扩展的实用解决方案。它的适应性和可解释性使其非常适合于药物发现、疾病预测和其他生物信息学领域的应用。可用性和实现：源代码和数据集可在https://github.com/raziyehmasumshah/PSO-FeatureFusion上获得。

{"title":"PSO-FeatureFusion: a general framework for fusing heterogeneous features via particle swarm optimization.","authors":"Raziyeh Masumshah, Changiz Eslahchi","doi":"10.1093/bioadv/vbaf263","DOIUrl":"10.1093/bioadv/vbaf263","url":null,"abstract":"Motivation: Integrating heterogeneous biological data is a central challenge in bioinformatics, especially when modeling complex relationships among entities such as drugs, diseases, and molecular features. Existing methods often rely on static or separate feature extraction processes, which may fail to capture interactions across diverse feature types and reduce predictive accuracy.Results: To address these limitations, we propose PSO-FeatureFusion, a unified framework that combines particle swarm optimization with neural networks to jointly integrate and optimize features from multiple biological entities. By modeling pairwise feature interactions and learning their optimal contributions, the framework captures individual feature signals and their interdependencies in a task-agnostic and modular manner. We applied PSO-FeatureFusion to two bioinformatics tasks-drug-drug interaction and drug-disease association prediction-using multiple benchmark datasets. Across both tasks, the framework achieved strong performance across evaluation metrics, often outperforming or matching state-of-the-art baselines, including deep learning and graph-based models. The method also demonstrated robustness with limited hyperparameter tuning and flexibility across datasets with varying feature structures. PSO-FeatureFusion provides a scalable and practical solution for researchers working with high-dimensional biological data. Its adaptability and interpretability make it well-suited for applications in drug discovery, disease prediction, and other bioinformatics domains.Availability and implementation: The source code and datasets are available at https://github.com/raziyehmasumshah/PSO-FeatureFusion.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf263"},"PeriodicalIF":2.8,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596698/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0