首页 > 最新文献

Bioinformatics advances最新文献

英文 中文
Transcriptional and epigenetic regulation of Ca2+-signaling genes in hepatitis B-derived hepatocellular carcinoma and their association with the cancer hallmarks. 乙型肝炎源性肝细胞癌中Ca2+信号基因的转录和表观遗传调控及其与癌症特征的关联。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-27 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbaf331
Guadalupe Hernández-Martínez, Andrés Hernández-Oliveras, Ángel Zarain-Herzberg, Juan Santiago-García

Motivation: Dysregulation of Ca2+-signaling genes has been shown in some types of cancer; however, it is virtually unknown in hepatitis B-derived hepatocellular carcinoma (HBV-HCC). Here, we evaluate the transcriptional and epigenetic regulation of Ca2+-signaling genes in HBV-HCC and whether their expression is associated with cancer hallmarks, and prognostic potential.

Results: We identified 432 differentially expressed Ca2+-signaling genes in HBV-HCC, including 134 that are specific to this condition, and were not found in non-HBV HCC. Fifty-three of these genes were associated with cancer hallmarks, of which 17 exhibited potential prognostic value by Cox multivariate analyses. We also provide new evidence for epigenetic regulation by post-transcriptional histone modifications and DNA methylation at the promoter of some of these genes. Finally, using Least Absolute Shrinkage and Selection Operator (LASSO) regression, we identified a four-gene prognostic signature (FBLN1, STC2, C1R, and F2RL2) that robustly stratified patient outcomes. This study presents the first integrative transcriptomic and epigenetic analysis of Ca2+-signaling genes in HBV-HCC, introducing a novel four-gene signature with prognostic potential. These findings highlight the relevance of a dysregulation of a subset of Ca2+-signaling genes as a distinctive feature of HBV-HCC.

Availability and implementation: All data generated or analyzed during this study are included in this article.

动机:在某些类型的癌症中已经发现Ca2+信号基因的失调;然而,它在乙型肝炎源性肝细胞癌(HBV-HCC)中几乎是未知的。在这里,我们评估了HBV-HCC中Ca2+信号基因的转录和表观遗传调控,以及它们的表达是否与癌症特征和预后潜力相关。结果:我们在HBV-HCC中发现了432个差异表达的Ca2+信号基因,其中134个是特异性的,而在非hbv HCC中没有发现。这些基因中53个与癌症特征相关,其中17个通过Cox多变量分析显示出潜在的预后价值。我们也为这些基因的启动子转录后组蛋白修饰和DNA甲基化的表观遗传调控提供了新的证据。最后,使用最小绝对收缩和选择算子(LASSO)回归,我们确定了一个四基因预后特征(FBLN1, STC2, C1R和F2RL2),该特征有力地划分了患者的预后。这项研究首次提出了HBV-HCC中Ca2+信号基因的综合转录组学和表观遗传学分析,引入了一种具有预后潜力的新型四基因标记。这些发现强调了Ca2+信号基因亚群失调作为HBV-HCC的显著特征的相关性。可用性和实现:本研究过程中生成或分析的所有数据都包含在本文中。
{"title":"Transcriptional and epigenetic regulation of Ca<sup>2+</sup>-signaling genes in hepatitis B-derived hepatocellular carcinoma and their association with the cancer hallmarks.","authors":"Guadalupe Hernández-Martínez, Andrés Hernández-Oliveras, Ángel Zarain-Herzberg, Juan Santiago-García","doi":"10.1093/bioadv/vbaf331","DOIUrl":"10.1093/bioadv/vbaf331","url":null,"abstract":"<p><strong>Motivation: </strong>Dysregulation of Ca<sup>2+</sup>-signaling genes has been shown in some types of cancer; however, it is virtually unknown in hepatitis B-derived hepatocellular carcinoma (HBV-HCC). Here, we evaluate the transcriptional and epigenetic regulation of Ca<sup>2+</sup>-signaling genes in HBV-HCC and whether their expression is associated with cancer hallmarks, and prognostic potential.</p><p><strong>Results: </strong>We identified 432 differentially expressed Ca<sup>2+</sup>-signaling genes in HBV-HCC, including 134 that are specific to this condition, and were not found in non-HBV HCC. Fifty-three of these genes were associated with cancer hallmarks, of which 17 exhibited potential prognostic value by Cox multivariate analyses. We also provide new evidence for epigenetic regulation by post-transcriptional histone modifications and DNA methylation at the promoter of some of these genes. Finally, using Least Absolute Shrinkage and Selection Operator (LASSO) regression, we identified a four-gene prognostic signature (<i>FBLN1</i>, <i>STC2</i>, <i>C1R</i>, and <i>F2RL2</i>) that robustly stratified patient outcomes. This study presents the first integrative transcriptomic and epigenetic analysis of Ca<sup>2+</sup>-signaling genes in HBV-HCC, introducing a novel four-gene signature with prognostic potential. These findings highlight the relevance of a dysregulation of a subset of Ca<sup>2+</sup>-signaling genes as a distinctive feature of HBV-HCC.</p><p><strong>Availability and implementation: </strong>All data generated or analyzed during this study are included in this article.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf331"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866915/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cell type annotation using large language models (LLMs) and CytoAnalyst. 使用大型语言模型(LLMs)和CytoAnalyst进行细胞类型注释。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-27 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag001
Khoi Nguyen, Duy Tran, Phuong Nguyen, Seungil Ro, Phi Bya, Tin Nguyen

Motivation: Cell annotation is fundamental for single-cell data interpretation. Accurate annotation allows us to identify cell types, understand their functions, trace developmental trajectories, and pinpoint alterations associated with a condition of interest. However, this complex process demands extensive manual curation, domain expertise, and proficiency across diverse bioinformatics tools. These challenges impede reproducibility and consistency.

Results: We have developed a new approach for semi-automatic cell type annotation, powered by large language models (LLMs). Given the input single-cell data, we first perform dimension reduction, clustering, and differential analysis to identify distinct cell groups and their respective markers. Next, we utilize Meta's Llama and structured prompting to infer potential cell types. This approach greatly reduces manual labor from researchers while maintaining biological accuracy through enforced ontology, tissue context, and marker gene signatures. Our solution is freely accessible through our web-based platform named CytoAnalyst, hosted on a high-performance infrastructure with optimized networking and storage capabilities. CytoAnalyst also offers capabilities for quality control, embedding analysis, clustering, differential analysis, gene set analysis, cell enrichment, cell type annotation, and pseudo-time trajectory inference.

Availability and implementation: CytoAnalyst is freely available at https://cytoanalyst.tinnguyen-lab.com/. The CytoAnalyst handbook, including step-by-step tutorials and example case studies, is available at https://cytoanalyst.tinnguyen-lab.com/docs/.

动机:单元格注释是单细胞数据解释的基础。准确的注释使我们能够识别细胞类型,了解它们的功能,追踪发育轨迹,并查明与感兴趣的条件相关的变化。然而,这个复杂的过程需要大量的人工管理、领域专业知识和对各种生物信息学工具的熟练掌握。这些挑战阻碍了再现性和一致性。结果:我们开发了一种基于大型语言模型(llm)的半自动细胞类型标注新方法。给定输入的单细胞数据,我们首先执行降维、聚类和差异分析,以识别不同的细胞组及其各自的标记。接下来,我们利用Meta的Llama和结构化提示来推断潜在的细胞类型。这种方法大大减少了研究人员的体力劳动,同时通过强制本体论、组织背景和标记基因签名保持生物准确性。我们的解决方案可通过基于网络的CytoAnalyst平台免费访问,该平台托管在具有优化网络和存储功能的高性能基础设施上。CytoAnalyst还提供质量控制、嵌入分析、聚类、差异分析、基因集分析、细胞富集、细胞类型注释和伪时间轨迹推断等功能。可用性和实现:CytoAnalyst可在https://cytoanalyst.tinnguyen-lab.com/免费获得。CytoAnalyst手册,包括一步一步的教程和示例案例研究,可在https://cytoanalyst.tinnguyen-lab.com/docs/上获得。
{"title":"Cell type annotation using large language models (LLMs) and CytoAnalyst.","authors":"Khoi Nguyen, Duy Tran, Phuong Nguyen, Seungil Ro, Phi Bya, Tin Nguyen","doi":"10.1093/bioadv/vbag001","DOIUrl":"10.1093/bioadv/vbag001","url":null,"abstract":"<p><strong>Motivation: </strong>Cell annotation is fundamental for single-cell data interpretation. Accurate annotation allows us to identify cell types, understand their functions, trace developmental trajectories, and pinpoint alterations associated with a condition of interest. However, this complex process demands extensive manual curation, domain expertise, and proficiency across diverse bioinformatics tools. These challenges impede reproducibility and consistency.</p><p><strong>Results: </strong>We have developed a new approach for semi-automatic cell type annotation, powered by large language models (LLMs). Given the input single-cell data, we first perform dimension reduction, clustering, and differential analysis to identify distinct cell groups and their respective markers. Next, we utilize Meta's Llama and structured prompting to infer potential cell types. This approach greatly reduces manual labor from researchers while maintaining biological accuracy through enforced ontology, tissue context, and marker gene signatures. Our solution is freely accessible through our web-based platform named CytoAnalyst, hosted on a high-performance infrastructure with optimized networking and storage capabilities. CytoAnalyst also offers capabilities for quality control, embedding analysis, clustering, differential analysis, gene set analysis, cell enrichment, cell type annotation, and pseudo-time trajectory inference.</p><p><strong>Availability and implementation: </strong>CytoAnalyst is freely available at https://cytoanalyst.tinnguyen-lab.com/. The CytoAnalyst handbook, including step-by-step tutorials and example case studies, is available at https://cytoanalyst.tinnguyen-lab.com/docs/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag001"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12883444/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The power and limits of predicting inter-protein exon-exon interactions using protein 3D structures. 利用蛋白质3D结构预测蛋白质间外显子-外显子相互作用的能力和局限性。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-27 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag032
Jeanine Liebold, Aylin Del Moral-Morales, Karen Manalastas-Cantos, Olga Tsoy, Stefan Kurtz, Jan Baumbach, Khalique Newaz

Motivation: Alternative splicing (AS) effects on cellular functions can be captured by studying changes in the underlying protein-protein interactions (PPIs). Because AS results in the gain or loss of exons, existing methods for predicting AS-related PPI changes utilize known inter-protein exon-exon interactions (EEIs), which cover <0.5% of known human PPIs. Hence, there is a need to extend the limited EEI knowledge to advance the functional understanding of AS. Here, we explore whether existing 3-dimensional (3D) protein structure-based computational PPI interface prediction (PPIIP) methods, originally designed to predict inter-protein residue-residue interactions (RRIs), can be utilized to predict EEIs.

Results: We evaluate the PPIIP methods for the RRI- and EEI-prediction tasks using all known experimentally determined 3D structures of human protein heterodimers from the Protein Data Bank available at the time of data collection. From these heterodimers, we determined 230 000 RRIs and 20 400 EEIs as ground truth. We provide the first evidence of the adaptability of existing PPIIP methods to predict EEIs, with a performance score of up to 76 % based on the area under the receiver operating characteristic curve. Insights, data, and computational pipelines from our study can guide future developments of computational methods for solving the task of predicting EEIs.

Availability and implementation: Data and source code are available at https://github.com/lieboldj/EEIpred.

动机:选择性剪接(AS)对细胞功能的影响可以通过研究潜在的蛋白质-蛋白质相互作用(PPIs)的变化来捕获。由于AS会导致外显子的增加或减少,现有的预测AS相关PPI变化的方法利用已知的蛋白质间外显子-外显子相互作用(EEIs),这涵盖了结果:我们评估PPIIP方法用于RRI和eei预测任务,使用所有已知的实验确定的人类蛋白质异源二聚体的3D结构,这些结构来自蛋白质数据库,在数据收集时可用。从这些异源二聚体中,我们确定了~ 230,000个RRIs和~ 20400个eei作为基本事实。我们提供了现有PPIIP方法预测eei的适应性的第一个证据,基于接收器工作特性曲线下的面积,性能得分高达~ 76%。从我们的研究中获得的见解、数据和计算管道可以指导解决预测eei任务的计算方法的未来发展。可用性和实现:数据和源代码可在https://github.com/lieboldj/EEIpred上获得。
{"title":"The power and limits of predicting inter-protein exon-exon interactions using protein 3D structures.","authors":"Jeanine Liebold, Aylin Del Moral-Morales, Karen Manalastas-Cantos, Olga Tsoy, Stefan Kurtz, Jan Baumbach, Khalique Newaz","doi":"10.1093/bioadv/vbag032","DOIUrl":"https://doi.org/10.1093/bioadv/vbag032","url":null,"abstract":"<p><strong>Motivation: </strong>Alternative splicing (AS) effects on cellular functions can be captured by studying changes in the underlying protein-protein interactions (PPIs). Because AS results in the gain or loss of exons, existing methods for predicting AS-related PPI changes utilize known inter-protein exon-exon interactions (EEIs), which cover <0.5% of known human PPIs. Hence, there is a need to extend the limited EEI knowledge to advance the functional understanding of AS. Here, we explore whether existing 3-dimensional (3D) protein structure-based computational PPI interface prediction (PPIIP) methods, originally designed to predict inter-protein residue-residue interactions (RRIs), can be utilized to predict EEIs.</p><p><strong>Results: </strong>We evaluate the PPIIP methods for the RRI- and EEI-prediction tasks using all known experimentally determined 3D structures of human protein heterodimers from the Protein Data Bank available at the time of data collection. From these heterodimers, we determined <math><mrow><mo>∼</mo> <mn>230</mn> <mo> </mo> <mn>000</mn></mrow> </math> RRIs and <math><mrow><mo>∼</mo> <mn>20</mn> <mo> </mo> <mn>400</mn></mrow> </math> EEIs as ground truth. We provide the first evidence of the adaptability of existing PPIIP methods to predict EEIs, with a performance score of up to <math><mrow><mo>∼</mo> <mn>76</mn> <mi>%</mi></mrow> </math> based on the area under the receiver operating characteristic curve. Insights, data, and computational pipelines from our study can guide future developments of computational methods for solving the task of predicting EEIs.</p><p><strong>Availability and implementation: </strong>Data and source code are available at https://github.com/lieboldj/EEIpred.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag032"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12974993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147438083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empowering biological knowledgebases: advances in human-in-the-loop AI-driven literature curation. 增强生物知识库:人工智能驱动的文献管理的进展。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-27 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag028
Valerie Wood, Matt Jeffryes, Andrew F Green, Matthias Blum, Sandra Orchard, Simona Panni, Federica Quaglia, Raul Rodriguez-Esteban, James Seager, Silvio C E Tosatto, Ulrike Wittig, Melissa Harrison

Biological knowledgebases facilitate discovery across the life sciences by structuring experimental findings into human-readable and computable formats. These essential resources are maintained by a small number of professional biocurators worldwide and face combined chronic underfunding and the exponential growth of the literature. In this perspective, we review how artificial intelligence, particularly large language models and agentic systems, can augment literature-curation workflows. Applications include literature recommendation, entity recognition, data extraction, summarization, ontology development, and quality control with emphasis on published use cases at Global Core BioData Resources and ELIXIR Core Data Resources. We identify key challenges, including the scarcity of training data, difficulty in extracting complex relationships, and concerns about error propagation. To address these challenges, we propose a human-in-the-loop framework where generative artificial intelligence approaches accelerate routine tasks while curators provide critical evaluation and domain expertise. We also propose practical recommendations for the community, including the creation of shared benchmark datasets, harmonized evaluation frameworks, and best-practice guidelines for transparent human-in-the-loop AI deployment in biocuration. These synergistic partnerships will be critical to ensure biological rigour, accelerating knowledge integration while maintaining the quality essential for trusted biological resources.

生物知识库通过将实验结果结构化为人类可读和可计算的格式,促进了整个生命科学的发现。这些重要的资源由世界范围内的少数专业生物馆长维护,面临着长期资金不足和文献指数增长的双重问题。从这个角度来看,我们回顾了人工智能,特别是大型语言模型和代理系统,如何增强文献管理工作流程。应用程序包括文献推荐、实体识别、数据提取、摘要、本体开发和质量控制,重点是在Global Core BioData Resources和ELIXIR Core data Resources上发布的用例。我们确定了关键的挑战,包括训练数据的稀缺性,提取复杂关系的困难,以及对错误传播的关注。为了应对这些挑战,我们提出了一个人在循环框架,其中生成人工智能方法加速日常任务,而策展人提供关键的评估和领域专业知识。我们还为社区提出了实用的建议,包括创建共享的基准数据集,统一的评估框架,以及在生物定位中透明的人在环人工智能部署的最佳实践指南。这些协同伙伴关系对于确保生物严谨性,加速知识整合,同时保持可信赖生物资源所必需的质量至关重要。
{"title":"Empowering biological knowledgebases: advances in human-in-the-loop AI-driven literature curation.","authors":"Valerie Wood, Matt Jeffryes, Andrew F Green, Matthias Blum, Sandra Orchard, Simona Panni, Federica Quaglia, Raul Rodriguez-Esteban, James Seager, Silvio C E Tosatto, Ulrike Wittig, Melissa Harrison","doi":"10.1093/bioadv/vbag028","DOIUrl":"10.1093/bioadv/vbag028","url":null,"abstract":"<p><p>Biological knowledgebases facilitate discovery across the life sciences by structuring experimental findings into human-readable and computable formats. These essential resources are maintained by a small number of professional biocurators worldwide and face combined chronic underfunding and the exponential growth of the literature. In this perspective, we review how artificial intelligence, particularly large language models and agentic systems, can augment literature-curation workflows. Applications include literature recommendation, entity recognition, data extraction, summarization, ontology development, and quality control with emphasis on published use cases at Global Core BioData Resources and ELIXIR Core Data Resources. We identify key challenges, including the scarcity of training data, difficulty in extracting complex relationships, and concerns about error propagation. To address these challenges, we propose a human-in-the-loop framework where generative artificial intelligence approaches accelerate routine tasks while curators provide critical evaluation and domain expertise. We also propose practical recommendations for the community, including the creation of shared benchmark datasets, harmonized evaluation frameworks, and best-practice guidelines for transparent human-in-the-loop AI deployment in biocuration. These synergistic partnerships will be critical to ensure biological rigour, accelerating knowledge integration while maintaining the quality essential for trusted biological resources.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag028"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12904773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
gLeiden: accelerated community detection algorithms using directed and undirected graphs on GPUs. 在gpu上使用有向图和无向图的加速社区检测算法。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-27 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbaf327
Beenish Gul, Maria Murach, Stefan Bekarinov, Kevin Skadron

Motivation: Community detection methods are applied to single cell RNA sequencing (i.e. scRNA-seq) and mass cytometry data to efficiently identify major cell types and their subtypes, but their computational demands increase, particularly given the substantial growth in dataset sizes. The Leiden algorithm, an emerging method in this field, offers inherent parallelism that remains underutilized due to the limited parallel processing capabilities offered by today's modern multi-core CPUs, which have fewer than 100 cores (typically 32-64 CPUs). However, Leiden can achieve significant performance gains when implemented on GPUs. GPUs offer high memory bandwidth and an extensive array of parallel processing units that map well to the parallelism in Leiden. As far as we know, cuGraph is the only implementation that has mapped the Leiden algorithm to GPUs, using a blend of Python and C languages. However, it only supports undirected graphs, potentially discarding the valuable information carried by edge directionality. In addition, this Python implementation for GPUs is comparatively slower than a C/C++ based implementation, reducing the significant performance gains provided by a GPU-based speedup. Conversely, a C/C++ based implementation optimizes performance more effectively, ensuring an accurate baseline comparison when performing GPU acceleration.

Results: We developed a tool named gLeiden, a lightweight CUDA C++ based GPU implementation of the Leiden algorithm and, to the best of our knowledge, the very first GPU implementation that supports directed graphs, which generally demands nearly twice the computational time and memory resources compared to undirected graphs. The results show that our directed gLeiden outperforms the directed cLeiden version and shows 11× and 12× speedup on very large datasets. Our undirected ucLeiden and ugLeiden implementations significantly outperform the original Java version, with up to 42× speedup on large datasets. However, when comparing the undirected ugLeiden version with cuGraph, ugLeiden performance is comparable on smaller datasets and 58% faster on larger datasets. These results position our GPU-based Leiden implementation as a high-performance alternative to existing state-of-the-art community detection tools.

Availability and implementation: The source code and sample data are available at: https://github.com/Beenishgul/Leiden and https://figshare.com/s/3b51e463a56e2a374bdf.

动机:社区检测方法被应用于单细胞RNA测序(即scRNA-seq)和大量细胞计数数据,以有效地识别主要细胞类型及其亚型,但它们的计算需求增加,特别是考虑到数据集大小的大幅增长。Leiden算法是该领域的一种新兴方法,它提供了固有的并行性,但由于当今现代多核cpu(通常为32-64个cpu)提供的并行处理能力有限,这种并行性仍未得到充分利用。然而,当在gpu上实现时,Leiden可以获得显着的性能提升。gpu提供高内存带宽和广泛的并行处理单元阵列,可以很好地映射到Leiden的并行性。据我们所知,cuGraph是唯一一个将Leiden算法映射到gpu的实现,它使用了Python和C语言的混合。然而,它只支持无向图,可能会丢弃边缘方向性所携带的有价值的信息。此外,这种针对gpu的Python实现相对于基于C/ c++的实现要慢,从而降低了基于gpu的加速所带来的显著性能提升。相反,基于C/ c++的实现可以更有效地优化性能,确保在执行GPU加速时进行准确的基线比较。结果:我们开发了一个名为gLeiden的工具,这是一个基于Leiden算法的轻量级CUDA c++ GPU实现,据我们所知,这是第一个支持有向图的GPU实现,与无向图相比,它通常需要近两倍的计算时间和内存资源。结果表明,我们的定向格莱顿版本在非常大的数据集上表现出11倍和12倍的加速。我们的undirected ucLeiden和ugLeiden实现明显优于原始Java版本,在大型数据集上的加速高达42倍。然而,当将无向ugLeiden版本与cuGraph进行比较时,ugLeiden在较小数据集上的性能相当,在较大数据集上的性能要快58%。这些结果将我们基于gpu的Leiden实现定位为现有最先进的社区检测工具的高性能替代方案。可用性和实现:源代码和示例数据可从https://github.com/Beenishgul/Leiden和https://figshare.com/s/3b51e463a56e2a374bdf获得。
{"title":"gLeiden: accelerated community detection algorithms using directed and undirected graphs on GPUs.","authors":"Beenish Gul, Maria Murach, Stefan Bekarinov, Kevin Skadron","doi":"10.1093/bioadv/vbaf327","DOIUrl":"10.1093/bioadv/vbaf327","url":null,"abstract":"<p><strong>Motivation: </strong>Community detection methods are applied to single cell RNA sequencing (i.e. scRNA-seq) and mass cytometry data to efficiently identify major cell types and their subtypes, but their computational demands increase, particularly given the substantial growth in dataset sizes. The Leiden algorithm, an emerging method in this field, offers inherent parallelism that remains underutilized due to the limited parallel processing capabilities offered by today's modern multi-core CPUs, which have fewer than 100 cores (typically 32-64 CPUs). However, Leiden can achieve significant performance gains when implemented on GPUs. GPUs offer high memory bandwidth and an extensive array of parallel processing units that map well to the parallelism in Leiden. As far as we know, cuGraph is the only implementation that has mapped the Leiden algorithm to GPUs, using a blend of Python and C languages. However, it only supports undirected graphs, potentially discarding the valuable information carried by edge directionality. In addition, this Python implementation for GPUs is comparatively slower than a C/C++ based implementation, reducing the significant performance gains provided by a GPU-based speedup. Conversely, a C/C++ based implementation optimizes performance more effectively, ensuring an accurate baseline comparison when performing GPU acceleration.</p><p><strong>Results: </strong>We developed a tool named gLeiden, a lightweight CUDA C++ based GPU implementation of the Leiden algorithm and, to the best of our knowledge, the very first GPU implementation that supports directed graphs, which generally demands nearly twice the computational time and memory resources compared to undirected graphs. The results show that our directed gLeiden outperforms the directed cLeiden version and shows 11× and 12× speedup on very large datasets. Our undirected ucLeiden and ugLeiden implementations significantly outperform the original Java version, with up to 42× speedup on large datasets. However, when comparing the undirected ugLeiden version with cuGraph, ugLeiden performance is comparable on smaller datasets and 58% faster on larger datasets. These results position our GPU-based Leiden implementation as a high-performance alternative to existing state-of-the-art community detection tools.</p><p><strong>Availability and implementation: </strong>The source code and sample data are available at: https://github.com/Beenishgul/Leiden and https://figshare.com/s/3b51e463a56e2a374bdf.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf327"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987761/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147464028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepFRAG: a method for cancer detection based on DNA fragmentomics and deep learning. DeepFRAG:基于DNA片段组学和深度学习的癌症检测方法。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-27 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag024
Andrey Koch, Eldar Giladi

Motivation: Cancer screening using liquid biopsy technology has become standard in modern clinical and preventive oncology. This method analyzes cell-free DNA (cfDNA) circulating in a patient's bloodstream. While mutation-based diagnostics using deep exome sequencing are highly sensitive and specific, an alternative approach involves examining cfDNA fragment size distribution profiles. This method is less expensive and can be derived from low-depth whole genome sequencing (WGS).

Results: Our study presents DeepFRAG: a new cancer detection method based on deep learning analysis of cfDNA fragment size distribution profiles using wavelet transform. We utilized two independent cohorts comprising 73 patients with stage III and IV cancers (breast, colorectal, pancreatic, lung, and liver) and 80 healthy individuals. We introduced an original data augmentation technique specific to WGS fragment size data, ensuring sufficient data for training the deep learning model. The proposed method demonstrated high accuracy, with a median test AUROC (area under the receiver operating characteristic curve) of 0.974 and a sensitivity of 96.1% at 98.8% specificity. Our approach offers several advantages, including high accuracy, cost-effectiveness, robustness, and suitability for detecting major cancer types. This method represents a promising advancement in cancer screening technology, expanding the options available for noninvasive cancer detection, with the goal of improving patient outcomes.

Availability and implementation: Data and source code are available at https://github.com/andreykoch/DeepFRAG.

动机:利用液体活检技术进行肿瘤筛查已成为现代临床和预防肿瘤学的标准。该方法分析患者血液循环中的游离DNA (cfDNA)。虽然基于突变的诊断使用深度外显子组测序是高度敏感和特异性的,但另一种方法包括检查cfDNA片段大小分布概况。该方法成本较低,可从低深度全基因组测序(WGS)中衍生出来。结果:我们的研究提出了一种基于小波变换对cfDNA片段大小分布谱进行深度学习分析的新型癌症检测方法DeepFRAG。我们使用了两个独立的队列,包括73名III期和IV期癌症患者(乳腺癌、结肠直肠癌、胰腺癌、肺癌和肝癌)和80名健康个体。我们引入了一种针对WGS碎片大小数据的原始数据增强技术,确保有足够的数据用于训练深度学习模型。该方法准确度高,中位AUROC(受试者工作特征曲线下面积)为0.974,灵敏度为96.1%,特异度为98.8%。我们的方法具有几个优点,包括高精度、成本效益高、稳健性和检测主要癌症类型的适用性。这种方法代表了癌症筛查技术的一个有前途的进步,扩大了非侵入性癌症检测的选择,目标是改善患者的预后。可用性和实现:数据和源代码可在https://github.com/andreykoch/DeepFRAG上获得。
{"title":"DeepFRAG: a method for cancer detection based on DNA fragmentomics and deep learning.","authors":"Andrey Koch, Eldar Giladi","doi":"10.1093/bioadv/vbag024","DOIUrl":"https://doi.org/10.1093/bioadv/vbag024","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer screening using liquid biopsy technology has become standard in modern clinical and preventive oncology. This method analyzes cell-free DNA (cfDNA) circulating in a patient's bloodstream. While mutation-based diagnostics using deep exome sequencing are highly sensitive and specific, an alternative approach involves examining cfDNA fragment size distribution profiles. This method is less expensive and can be derived from low-depth whole genome sequencing (WGS).</p><p><strong>Results: </strong>Our study presents DeepFRAG: a new cancer detection method based on deep learning analysis of cfDNA fragment size distribution profiles using wavelet transform. We utilized two independent cohorts comprising 73 patients with stage III and IV cancers (breast, colorectal, pancreatic, lung, and liver) and 80 healthy individuals. We introduced an original data augmentation technique specific to WGS fragment size data, ensuring sufficient data for training the deep learning model. The proposed method demonstrated high accuracy, with a median test AUROC (area under the receiver operating characteristic curve) of 0.974 and a sensitivity of 96.1% at 98.8% specificity. Our approach offers several advantages, including high accuracy, cost-effectiveness, robustness, and suitability for detecting major cancer types. This method represents a promising advancement in cancer screening technology, expanding the options available for noninvasive cancer detection, with the goal of improving patient outcomes.</p><p><strong>Availability and implementation: </strong>Data and source code are available at https://github.com/andreykoch/DeepFRAG.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag024"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12973171/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147438052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CaRinDB: an integrated database of common cancer mutations and residue interaction network parameters. CaRinDB:常见癌症突变和残基相互作用网络参数的集成数据库。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-25 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbaf313
Daniela Coelho Batista Guedes Pereira, João Vitor Ferreira Cavalcante, Laise Florentino Cavalcanti, Raul Maia Falcão, Jorge Estefano Santana de Souza, Rodrigo Juliani Siqueira Dalmolin, Thaís Gaudencio do Rêgo, Serghei Mangul, Gustavo Antônio de Souza, Patrick Terrematte, João Paulo Matos Santos Lima

Motivation: Predicting the impact of missense mutations on protein structure and function is a fundamental challenge for cancer research and clinical applications. Despite all the computational advances and, more recently, the use of artificial intelligence (AI), assessing the functional consequences of residue substitutions remains a challenging task. Proteins have complex three-dimensional structures, where the maintenance of their functionality depends on chemical interactions between amino acid residues. Single substitutions can affect these interactions, leading to more profound structural changes that are difficult to visualize.

Results: Here, we present CaRinDB, a database that integrates cancer-associated missense mutation data, functional predictions, molecular features, allelic frequencies, and residue interaction network (RIN) parameters derived from Protein Data Bank structures and AlphaFold models. Users can access and explore variant information through an intuitive web portal, with custom plots and tables to visualize and analyze cancer-associated mutation data. CaRinDB is the first database that unites distinct annotation features of cancer-associated mutations and their structural impacts, utilizing RINs graph parameters and a source of compiled and processed data for the development of AI tools.

Availability and implementation: CaRinDB is freely available at https://bioinfo.imd.ufrn.br/CaRinDB/. The integrated development environment used was Jupyter notebooks, available on GitHub (https://github.com/evomol-lab/CaRinDB). CaRinDB web interface was implemented in R and Shiny.

动机:预测错义突变对蛋白质结构和功能的影响是癌症研究和临床应用的基本挑战。尽管计算技术取得了很大的进步,而且最近人工智能(AI)的应用也越来越广泛,但评估残留物替代的功能后果仍然是一项具有挑战性的任务。蛋白质具有复杂的三维结构,其功能的维持依赖于氨基酸残基之间的化学相互作用。单次取代可以影响这些相互作用,导致难以可视化的更深刻的结构变化。在此,我们提出了CaRinDB,一个整合了癌症相关错义突变数据、功能预测、分子特征、等位基因频率和残基相互作用网络(RIN)参数的数据库,这些参数来源于蛋白质数据库结构和AlphaFold模型。用户可以通过一个直观的门户网站访问和探索变异信息,使用自定义的图表和表格来可视化和分析癌症相关的突变数据。CaRinDB是第一个将癌症相关突变及其结构影响的不同注释特征结合起来的数据库,利用RINs图参数和用于开发人工智能工具的编译和处理数据来源。可用性和实现:CaRinDB可在https://bioinfo.imd.ufrn.br/CaRinDB/免费获得。使用的集成开发环境是Jupyter notebook,可以在GitHub (https://github.com/evomol-lab/CaRinDB)上获得。CaRinDB web界面是用R和Shiny实现的。
{"title":"CaRinDB: an integrated database of common cancer mutations and residue interaction network parameters.","authors":"Daniela Coelho Batista Guedes Pereira, João Vitor Ferreira Cavalcante, Laise Florentino Cavalcanti, Raul Maia Falcão, Jorge Estefano Santana de Souza, Rodrigo Juliani Siqueira Dalmolin, Thaís Gaudencio do Rêgo, Serghei Mangul, Gustavo Antônio de Souza, Patrick Terrematte, João Paulo Matos Santos Lima","doi":"10.1093/bioadv/vbaf313","DOIUrl":"10.1093/bioadv/vbaf313","url":null,"abstract":"<p><strong>Motivation: </strong>Predicting the impact of missense mutations on protein structure and function is a fundamental challenge for cancer research and clinical applications. Despite all the computational advances and, more recently, the use of artificial intelligence (AI), assessing the functional consequences of residue substitutions remains a challenging task. Proteins have complex three-dimensional structures, where the maintenance of their functionality depends on chemical interactions between amino acid residues. Single substitutions can affect these interactions, leading to more profound structural changes that are difficult to visualize.</p><p><strong>Results: </strong>Here, we present CaRinDB, a database that integrates cancer-associated missense mutation data, functional predictions, molecular features, allelic frequencies, and residue interaction network (RIN) parameters derived from Protein Data Bank structures and AlphaFold models. Users can access and explore variant information through an intuitive web portal, with custom plots and tables to visualize and analyze cancer-associated mutation data. CaRinDB is the first database that unites distinct annotation features of cancer-associated mutations and their structural impacts, utilizing RINs graph parameters and a source of compiled and processed data for the development of AI tools.</p><p><strong>Availability and implementation: </strong>CaRinDB is freely available at https://bioinfo.imd.ufrn.br/CaRinDB/. The integrated development environment used was Jupyter notebooks, available on GitHub (https://github.com/evomol-lab/CaRinDB). CaRinDB web interface was implemented in R and Shiny.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf313"},"PeriodicalIF":2.8,"publicationDate":"2026-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12872580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146144863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
pygenstrat: a Python package for EIGENSTRAT data processing. pygenstrat:一个用于EIGENSTRAT数据处理的Python包。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-23 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag022
Dilek Koptekin

Motivation: Ancient DNA studies rely heavily on the EIGENSTRAT genotype format (.geno, .ind, .snp) for standard population genetic analyses including PCA, f-statistics, and qpWave/qpAdm. However, there is limited software available for processing EIGENSTRAT format data. pygenstrat , a Python package, is presented here, providing a command-line interface for comprehensive EIGENSTRAT data processing with extensive filtering, subsetting, and conversion options. pygenstrat implements memory-efficient, chunked processing algorithms for handling large ancient DNA datasets with low memory usage. It supports comprehensive operations, including updating individual and SNP files, subsetting datasets by selecting individuals or SNPs, filtering by minor allele frequency and missingness, pseudo-haploidisation, allele polarization, as well as conversion between EIGENSTRAT (text) and ANCESTRYMAP (binary) formats. Its modular architecture and Python implementation enable rapid integration with custom pipelines and future extensions.

Results: Benchmarking on the Allen Ancient DNA Resource (v 62.0) shows 2×-15× speedups and 90%-95% memory reduction compared to convertf, while producing equivalent outputs for standard operations. These improvements reduce turnaround time in ancient DNA workflows and facilitate reproducible processing.

Availability and implementation: pygenstrat is open-source, available at https://github.com/dkoptekin/pygenstrat.

动机:古代DNA研究严重依赖于特征基因型格式。基因族群。印第安纳州。snp)用于标准群体遗传分析,包括PCA, f-statistics和qpWave/qpAdm。然而,有有限的软件可用于处理特征strat格式的数据。pygenstrat是一个Python包,它提供了一个命令行接口,用于全面的EIGENSTRAT数据处理,具有广泛的过滤、子集和转换选项。pygenstrat实现了内存高效的分块处理算法,用于处理具有低内存使用率的大型古代DNA数据集。它支持全面的操作,包括更新个体和SNP文件,通过选择个体或SNP来子集数据集,通过次要等位基因频率和缺失进行过滤,伪单倍体化,等位基因极化,以及在EIGENSTRAT(文本)和ANCESTRYMAP(二进制)格式之间进行转换。它的模块化架构和Python实现可以快速集成自定义管道和未来的扩展。结果:在Allen Ancient DNA Resource (v 62.0)上进行基准测试显示,与convertf相比,2×-15×加速和90%-95%的内存减少,同时为标准操作产生等效输出。这些改进减少了古代DNA工作流程的周转时间,并促进了可重复处理。可用性和实现:pygenstrat是开源的,可从https://github.com/dkoptekin/pygenstrat获得。
{"title":"pygenstrat: a Python package for EIGENSTRAT data processing.","authors":"Dilek Koptekin","doi":"10.1093/bioadv/vbag022","DOIUrl":"10.1093/bioadv/vbag022","url":null,"abstract":"<p><strong>Motivation: </strong>Ancient DNA studies rely heavily on the EIGENSTRAT genotype format (.geno, .ind, .snp) for standard population genetic analyses including PCA, f-statistics, and qpWave/qpAdm. However, there is limited software available for processing EIGENSTRAT format data. <b><i>pygenstrat</i></b> , a Python package, is presented here, providing a command-line interface for comprehensive EIGENSTRAT data processing with extensive filtering, subsetting, and conversion options. <i>pygenstrat</i> implements memory-efficient, chunked processing algorithms for handling large ancient DNA datasets with low memory usage. It supports comprehensive operations, including updating individual and SNP files, subsetting datasets by selecting individuals or SNPs, filtering by minor allele frequency and missingness, pseudo-haploidisation, allele polarization, as well as conversion between EIGENSTRAT (text) and ANCESTRYMAP (binary) formats. Its modular architecture and Python implementation enable rapid integration with custom pipelines and future extensions.</p><p><strong>Results: </strong>Benchmarking on the Allen Ancient DNA Resource (v 62.0) shows 2×-15× speedups and 90%-95% memory reduction compared to <i>convertf</i>, while producing equivalent outputs for standard operations. These improvements reduce turnaround time in ancient DNA workflows and facilitate reproducible processing.</p><p><strong>Availability and implementation: </strong><i>pygenstrat</i> is open-source, available at https://github.com/dkoptekin/pygenstrat.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag022"},"PeriodicalIF":2.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12895063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HDAnalyzeR: streamlining data analysis for biomarker research. HDAnalyzeR:简化生物标志物研究的数据分析。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-23 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag020
Konstantinos Antonopoulos, Emil Johansson, Josefin Kenrick, Leo Dahl, Fredrik Edfors, Mathias Uhlén, María Bueno Álvez

Motivation: Exploration of large-scale biological datasets remains a central challenge in computational biology. While many tools are available, they are often developed in isolation, leading to fragmented workflows, duplicated efforts, and limited reproducibility. There is a pressing need for flexible, standardized solutions that unify exploratory data analysis and biomarker discovery across diverse platforms.

Results: We present HDAnalyzeR, a user-friendly and extensible R package for the streamlined analysis of high-dimensional biological data. HDAnalyzeR provides modular, reproducible workflows that support a range of analyses, from quality control and dimensionality reduction to differential expression and enrichment analysis. The package features built-in visualization, metadata-aware modeling, and seamless integration with interactive apps and learning resources. We also present two case studies, where HDAnalyzeR dramatically reduced analysis time and code complexity while providing biologically meaningful insights, such as classification of blood cancer types with AUC = 1.0 and identification of thousands of solid tumor-associated genes. HDAnalyzeR is designed to support both beginner users and experienced bioinformaticians, promoting transparency, reproducibility, and publication-quality output.

Availability and implementation: HDAnalyzeR is freely available both as an open-source R package at https://github.com/kantonopoulos/HDAnalyzeR and a web application at https://hdanalyzer.serve.scilifelab.se.

动机:探索大规模生物数据集仍然是计算生物学的核心挑战。虽然有许多工具可用,但它们通常是孤立开发的,导致工作流程碎片化、重复工作和可再现性受限。迫切需要灵活、标准化的解决方案,统一不同平台的探索性数据分析和生物标志物发现。结果:我们提出了HDAnalyzeR,一个用户友好和可扩展的R软件包,用于高维生物数据的流线型分析。HDAnalyzeR提供模块化,可重复的工作流程,支持一系列分析,从质量控制和降维到差异表达和富集分析。该软件包具有内置的可视化,元数据感知建模,以及与交互式应用程序和学习资源的无缝集成。我们还介绍了两个案例研究,其中HDAnalyzeR显着减少了分析时间和代码复杂性,同时提供了具有生物学意义的见解,例如AUC = 1.0的血癌类型分类和数千个实体肿瘤相关基因的鉴定。HDAnalyzeR旨在支持初学者和经验丰富的生物信息学家,促进透明度,可重复性和出版质量的输出。可用性和实现:HDAnalyzeR可以免费获得,既可以作为开源R包在https://github.com/kantonopoulos/HDAnalyzeR上,也可以作为web应用程序在https://hdanalyzer.serve.scilifelab.se上。
{"title":"HDAnalyzeR: streamlining data analysis for biomarker research.","authors":"Konstantinos Antonopoulos, Emil Johansson, Josefin Kenrick, Leo Dahl, Fredrik Edfors, Mathias Uhlén, María Bueno Álvez","doi":"10.1093/bioadv/vbag020","DOIUrl":"https://doi.org/10.1093/bioadv/vbag020","url":null,"abstract":"<p><strong>Motivation: </strong>Exploration of large-scale biological datasets remains a central challenge in computational biology. While many tools are available, they are often developed in isolation, leading to fragmented workflows, duplicated efforts, and limited reproducibility. There is a pressing need for flexible, standardized solutions that unify exploratory data analysis and biomarker discovery across diverse platforms.</p><p><strong>Results: </strong>We present HDAnalyzeR, a user-friendly and extensible R package for the streamlined analysis of high-dimensional biological data. HDAnalyzeR provides modular, reproducible workflows that support a range of analyses, from quality control and dimensionality reduction to differential expression and enrichment analysis. The package features built-in visualization, metadata-aware modeling, and seamless integration with interactive apps and learning resources. We also present two case studies, where HDAnalyzeR dramatically reduced analysis time and code complexity while providing biologically meaningful insights, such as classification of blood cancer types with AUC = 1.0 and identification of thousands of solid tumor-associated genes. HDAnalyzeR is designed to support both beginner users and experienced bioinformaticians, promoting transparency, reproducibility, and publication-quality output.</p><p><strong>Availability and implementation: </strong>HDAnalyzeR is freely available both as an open-source R package at https://github.com/kantonopoulos/HDAnalyzeR and a web application at https://hdanalyzer.serve.scilifelab.se.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag020"},"PeriodicalIF":2.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12925248/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OMAnnotator: a novel approach to building an annotated consensus genome sequence. OMAnnotator:一种新的方法来建立一个带注释的共识基因组序列。
IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-22 eCollection Date: 2026-01-01 DOI: 10.1093/bioadv/vbag015
Sadé Bates, Christophe Dessimoz, Yannis Nevers

Motivation: Advances in sequencing technologies have enabled researchers to sequence whole genomes rapidly and cheaply. However, despite improvements in genome assembly, structural genome annotation (i.e. the identification of protein-coding genes) remains challenging, particularly for eukaryotic genomes. It requires using several approaches (typically ab initio, transcriptomics, and homology search), which may give substantially different results. Deciding which gene models to retain in a consensus is far from trivial, and automated approaches tend to lag behind laborious manual curation efforts in accuracy.

Results: We present OMAnnotator, a novel approach to building a consensus annotation. OMAnnotator repurposes the OMA algorithm, originally designed to elucidate evolutionary relationships among genes across species, to integrate predictions from different annotation sources into a consensus annotation, using evolutionary information as a tie-breaker. During benchmarking on the Drosophila melanogaster reference, OMAnnotator's consensus improved upon its source annotations and two state-of-the-art pipelines used as annotation combiners with the same inputs. When applied to three recently published genomes, OMAnnotator gave substantial improvements in two cases, and mixed results in the third, which had already benefitted from extensive expert curation. This underlines the method's effectiveness and robustness for combining the results of disagreeing annotation softwares, strengthening the toolkit for eukaryotic genome annotation.

Availability and implementation: OMAnnotator is available on GitHub (https://github.com/DessimozLab/OMAnnotator).

动机:测序技术的进步使研究人员能够快速而廉价地对整个基因组进行测序。然而,尽管基因组组装有所改进,但结构基因组注释(即蛋白质编码基因的鉴定)仍然具有挑战性,特别是对于真核生物基因组。它需要使用几种方法(通常是从头算、转录组学和同源性搜索),这些方法可能会给出截然不同的结果。决定在共识中保留哪些基因模型远非微不足道,自动化方法往往在准确性方面落后于人工管理的努力。结果:我们提出了OMAnnotator,一种构建共识注释的新方法。OMAnnotator重新定义了OMA算法,该算法最初设计用于阐明跨物种基因之间的进化关系,将来自不同注释源的预测集成到一个共识注释中,使用进化信息作为关键因素。在对Drosophila melanogaster参考进行基准测试期间,OMAnnotator的共识改进了其源注释和两个最先进的管道,这些管道用作具有相同输入的注释组合器。当应用于最近发表的三个基因组时,OMAnnotator在两个案例中给出了实质性的改进,在第三个案例中给出了混合结果,这已经受益于广泛的专家管理。这强调了该方法的有效性和鲁棒性,以结合不同的注释软件的结果,加强真核生物基因组注释的工具包。可用性和实现:OMAnnotator在GitHub上可用(https://github.com/DessimozLab/OMAnnotator)。
{"title":"OMAnnotator: a novel approach to building an annotated consensus genome sequence.","authors":"Sadé Bates, Christophe Dessimoz, Yannis Nevers","doi":"10.1093/bioadv/vbag015","DOIUrl":"https://doi.org/10.1093/bioadv/vbag015","url":null,"abstract":"<p><strong>Motivation: </strong>Advances in sequencing technologies have enabled researchers to sequence whole genomes rapidly and cheaply. However, despite improvements in genome assembly, structural genome annotation (i.e. the identification of protein-coding genes) remains challenging, particularly for eukaryotic genomes. It requires using several approaches (typically <i>ab initio</i>, transcriptomics, and homology search), which may give substantially different results. Deciding which gene models to retain in a consensus is far from trivial, and automated approaches tend to lag behind laborious manual curation efforts in accuracy.</p><p><strong>Results: </strong>We present OMAnnotator, a novel approach to building a consensus annotation. OMAnnotator repurposes the OMA algorithm, originally designed to elucidate evolutionary relationships among genes across species, to integrate predictions from different annotation sources into a consensus annotation, using evolutionary information as a tie-breaker. During benchmarking on the <i>Drosophila melanogaster</i> reference, OMAnnotator's consensus improved upon its source annotations and two state-of-the-art pipelines used as annotation combiners with the same inputs. When applied to three recently published genomes, OMAnnotator gave substantial improvements in two cases, and mixed results in the third, which had already benefitted from extensive expert curation. This underlines the method's effectiveness and robustness for combining the results of disagreeing annotation softwares, strengthening the toolkit for eukaryotic genome annotation.</p><p><strong>Availability and implementation: </strong>OMAnnotator is available on GitHub (https://github.com/DessimozLab/OMAnnotator).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag015"},"PeriodicalIF":2.8,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12927413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147286388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics advances
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1