Pub Date : 2026-01-27eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf331
Guadalupe Hernández-Martínez, Andrés Hernández-Oliveras, Ángel Zarain-Herzberg, Juan Santiago-García
Motivation: Dysregulation of Ca2+-signaling genes has been shown in some types of cancer; however, it is virtually unknown in hepatitis B-derived hepatocellular carcinoma (HBV-HCC). Here, we evaluate the transcriptional and epigenetic regulation of Ca2+-signaling genes in HBV-HCC and whether their expression is associated with cancer hallmarks, and prognostic potential.
Results: We identified 432 differentially expressed Ca2+-signaling genes in HBV-HCC, including 134 that are specific to this condition, and were not found in non-HBV HCC. Fifty-three of these genes were associated with cancer hallmarks, of which 17 exhibited potential prognostic value by Cox multivariate analyses. We also provide new evidence for epigenetic regulation by post-transcriptional histone modifications and DNA methylation at the promoter of some of these genes. Finally, using Least Absolute Shrinkage and Selection Operator (LASSO) regression, we identified a four-gene prognostic signature (FBLN1, STC2, C1R, and F2RL2) that robustly stratified patient outcomes. This study presents the first integrative transcriptomic and epigenetic analysis of Ca2+-signaling genes in HBV-HCC, introducing a novel four-gene signature with prognostic potential. These findings highlight the relevance of a dysregulation of a subset of Ca2+-signaling genes as a distinctive feature of HBV-HCC.
Availability and implementation: All data generated or analyzed during this study are included in this article.
{"title":"Transcriptional and epigenetic regulation of Ca<sup>2+</sup>-signaling genes in hepatitis B-derived hepatocellular carcinoma and their association with the cancer hallmarks.","authors":"Guadalupe Hernández-Martínez, Andrés Hernández-Oliveras, Ángel Zarain-Herzberg, Juan Santiago-García","doi":"10.1093/bioadv/vbaf331","DOIUrl":"10.1093/bioadv/vbaf331","url":null,"abstract":"<p><strong>Motivation: </strong>Dysregulation of Ca<sup>2+</sup>-signaling genes has been shown in some types of cancer; however, it is virtually unknown in hepatitis B-derived hepatocellular carcinoma (HBV-HCC). Here, we evaluate the transcriptional and epigenetic regulation of Ca<sup>2+</sup>-signaling genes in HBV-HCC and whether their expression is associated with cancer hallmarks, and prognostic potential.</p><p><strong>Results: </strong>We identified 432 differentially expressed Ca<sup>2+</sup>-signaling genes in HBV-HCC, including 134 that are specific to this condition, and were not found in non-HBV HCC. Fifty-three of these genes were associated with cancer hallmarks, of which 17 exhibited potential prognostic value by Cox multivariate analyses. We also provide new evidence for epigenetic regulation by post-transcriptional histone modifications and DNA methylation at the promoter of some of these genes. Finally, using Least Absolute Shrinkage and Selection Operator (LASSO) regression, we identified a four-gene prognostic signature (<i>FBLN1</i>, <i>STC2</i>, <i>C1R</i>, and <i>F2RL2</i>) that robustly stratified patient outcomes. This study presents the first integrative transcriptomic and epigenetic analysis of Ca<sup>2+</sup>-signaling genes in HBV-HCC, introducing a novel four-gene signature with prognostic potential. These findings highlight the relevance of a dysregulation of a subset of Ca<sup>2+</sup>-signaling genes as a distinctive feature of HBV-HCC.</p><p><strong>Availability and implementation: </strong>All data generated or analyzed during this study are included in this article.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf331"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866915/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Cell annotation is fundamental for single-cell data interpretation. Accurate annotation allows us to identify cell types, understand their functions, trace developmental trajectories, and pinpoint alterations associated with a condition of interest. However, this complex process demands extensive manual curation, domain expertise, and proficiency across diverse bioinformatics tools. These challenges impede reproducibility and consistency.
Results: We have developed a new approach for semi-automatic cell type annotation, powered by large language models (LLMs). Given the input single-cell data, we first perform dimension reduction, clustering, and differential analysis to identify distinct cell groups and their respective markers. Next, we utilize Meta's Llama and structured prompting to infer potential cell types. This approach greatly reduces manual labor from researchers while maintaining biological accuracy through enforced ontology, tissue context, and marker gene signatures. Our solution is freely accessible through our web-based platform named CytoAnalyst, hosted on a high-performance infrastructure with optimized networking and storage capabilities. CytoAnalyst also offers capabilities for quality control, embedding analysis, clustering, differential analysis, gene set analysis, cell enrichment, cell type annotation, and pseudo-time trajectory inference.
Availability and implementation: CytoAnalyst is freely available at https://cytoanalyst.tinnguyen-lab.com/. The CytoAnalyst handbook, including step-by-step tutorials and example case studies, is available at https://cytoanalyst.tinnguyen-lab.com/docs/.
{"title":"Cell type annotation using large language models (LLMs) and CytoAnalyst.","authors":"Khoi Nguyen, Duy Tran, Phuong Nguyen, Seungil Ro, Phi Bya, Tin Nguyen","doi":"10.1093/bioadv/vbag001","DOIUrl":"10.1093/bioadv/vbag001","url":null,"abstract":"<p><strong>Motivation: </strong>Cell annotation is fundamental for single-cell data interpretation. Accurate annotation allows us to identify cell types, understand their functions, trace developmental trajectories, and pinpoint alterations associated with a condition of interest. However, this complex process demands extensive manual curation, domain expertise, and proficiency across diverse bioinformatics tools. These challenges impede reproducibility and consistency.</p><p><strong>Results: </strong>We have developed a new approach for semi-automatic cell type annotation, powered by large language models (LLMs). Given the input single-cell data, we first perform dimension reduction, clustering, and differential analysis to identify distinct cell groups and their respective markers. Next, we utilize Meta's Llama and structured prompting to infer potential cell types. This approach greatly reduces manual labor from researchers while maintaining biological accuracy through enforced ontology, tissue context, and marker gene signatures. Our solution is freely accessible through our web-based platform named CytoAnalyst, hosted on a high-performance infrastructure with optimized networking and storage capabilities. CytoAnalyst also offers capabilities for quality control, embedding analysis, clustering, differential analysis, gene set analysis, cell enrichment, cell type annotation, and pseudo-time trajectory inference.</p><p><strong>Availability and implementation: </strong>CytoAnalyst is freely available at https://cytoanalyst.tinnguyen-lab.com/. The CytoAnalyst handbook, including step-by-step tutorials and example case studies, is available at https://cytoanalyst.tinnguyen-lab.com/docs/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag001"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12883444/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag032
Jeanine Liebold, Aylin Del Moral-Morales, Karen Manalastas-Cantos, Olga Tsoy, Stefan Kurtz, Jan Baumbach, Khalique Newaz
Motivation: Alternative splicing (AS) effects on cellular functions can be captured by studying changes in the underlying protein-protein interactions (PPIs). Because AS results in the gain or loss of exons, existing methods for predicting AS-related PPI changes utilize known inter-protein exon-exon interactions (EEIs), which cover <0.5% of known human PPIs. Hence, there is a need to extend the limited EEI knowledge to advance the functional understanding of AS. Here, we explore whether existing 3-dimensional (3D) protein structure-based computational PPI interface prediction (PPIIP) methods, originally designed to predict inter-protein residue-residue interactions (RRIs), can be utilized to predict EEIs.
Results: We evaluate the PPIIP methods for the RRI- and EEI-prediction tasks using all known experimentally determined 3D structures of human protein heterodimers from the Protein Data Bank available at the time of data collection. From these heterodimers, we determined RRIs and EEIs as ground truth. We provide the first evidence of the adaptability of existing PPIIP methods to predict EEIs, with a performance score of up to based on the area under the receiver operating characteristic curve. Insights, data, and computational pipelines from our study can guide future developments of computational methods for solving the task of predicting EEIs.
Availability and implementation: Data and source code are available at https://github.com/lieboldj/EEIpred.
{"title":"The power and limits of predicting inter-protein exon-exon interactions using protein 3D structures.","authors":"Jeanine Liebold, Aylin Del Moral-Morales, Karen Manalastas-Cantos, Olga Tsoy, Stefan Kurtz, Jan Baumbach, Khalique Newaz","doi":"10.1093/bioadv/vbag032","DOIUrl":"https://doi.org/10.1093/bioadv/vbag032","url":null,"abstract":"<p><strong>Motivation: </strong>Alternative splicing (AS) effects on cellular functions can be captured by studying changes in the underlying protein-protein interactions (PPIs). Because AS results in the gain or loss of exons, existing methods for predicting AS-related PPI changes utilize known inter-protein exon-exon interactions (EEIs), which cover <0.5% of known human PPIs. Hence, there is a need to extend the limited EEI knowledge to advance the functional understanding of AS. Here, we explore whether existing 3-dimensional (3D) protein structure-based computational PPI interface prediction (PPIIP) methods, originally designed to predict inter-protein residue-residue interactions (RRIs), can be utilized to predict EEIs.</p><p><strong>Results: </strong>We evaluate the PPIIP methods for the RRI- and EEI-prediction tasks using all known experimentally determined 3D structures of human protein heterodimers from the Protein Data Bank available at the time of data collection. From these heterodimers, we determined <math><mrow><mo>∼</mo> <mn>230</mn> <mo> </mo> <mn>000</mn></mrow> </math> RRIs and <math><mrow><mo>∼</mo> <mn>20</mn> <mo> </mo> <mn>400</mn></mrow> </math> EEIs as ground truth. We provide the first evidence of the adaptability of existing PPIIP methods to predict EEIs, with a performance score of up to <math><mrow><mo>∼</mo> <mn>76</mn> <mi>%</mi></mrow> </math> based on the area under the receiver operating characteristic curve. Insights, data, and computational pipelines from our study can guide future developments of computational methods for solving the task of predicting EEIs.</p><p><strong>Availability and implementation: </strong>Data and source code are available at https://github.com/lieboldj/EEIpred.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag032"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12974993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147438083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag028
Valerie Wood, Matt Jeffryes, Andrew F Green, Matthias Blum, Sandra Orchard, Simona Panni, Federica Quaglia, Raul Rodriguez-Esteban, James Seager, Silvio C E Tosatto, Ulrike Wittig, Melissa Harrison
Biological knowledgebases facilitate discovery across the life sciences by structuring experimental findings into human-readable and computable formats. These essential resources are maintained by a small number of professional biocurators worldwide and face combined chronic underfunding and the exponential growth of the literature. In this perspective, we review how artificial intelligence, particularly large language models and agentic systems, can augment literature-curation workflows. Applications include literature recommendation, entity recognition, data extraction, summarization, ontology development, and quality control with emphasis on published use cases at Global Core BioData Resources and ELIXIR Core Data Resources. We identify key challenges, including the scarcity of training data, difficulty in extracting complex relationships, and concerns about error propagation. To address these challenges, we propose a human-in-the-loop framework where generative artificial intelligence approaches accelerate routine tasks while curators provide critical evaluation and domain expertise. We also propose practical recommendations for the community, including the creation of shared benchmark datasets, harmonized evaluation frameworks, and best-practice guidelines for transparent human-in-the-loop AI deployment in biocuration. These synergistic partnerships will be critical to ensure biological rigour, accelerating knowledge integration while maintaining the quality essential for trusted biological resources.
生物知识库通过将实验结果结构化为人类可读和可计算的格式,促进了整个生命科学的发现。这些重要的资源由世界范围内的少数专业生物馆长维护,面临着长期资金不足和文献指数增长的双重问题。从这个角度来看,我们回顾了人工智能,特别是大型语言模型和代理系统,如何增强文献管理工作流程。应用程序包括文献推荐、实体识别、数据提取、摘要、本体开发和质量控制,重点是在Global Core BioData Resources和ELIXIR Core data Resources上发布的用例。我们确定了关键的挑战,包括训练数据的稀缺性,提取复杂关系的困难,以及对错误传播的关注。为了应对这些挑战,我们提出了一个人在循环框架,其中生成人工智能方法加速日常任务,而策展人提供关键的评估和领域专业知识。我们还为社区提出了实用的建议,包括创建共享的基准数据集,统一的评估框架,以及在生物定位中透明的人在环人工智能部署的最佳实践指南。这些协同伙伴关系对于确保生物严谨性,加速知识整合,同时保持可信赖生物资源所必需的质量至关重要。
{"title":"Empowering biological knowledgebases: advances in human-in-the-loop AI-driven literature curation.","authors":"Valerie Wood, Matt Jeffryes, Andrew F Green, Matthias Blum, Sandra Orchard, Simona Panni, Federica Quaglia, Raul Rodriguez-Esteban, James Seager, Silvio C E Tosatto, Ulrike Wittig, Melissa Harrison","doi":"10.1093/bioadv/vbag028","DOIUrl":"10.1093/bioadv/vbag028","url":null,"abstract":"<p><p>Biological knowledgebases facilitate discovery across the life sciences by structuring experimental findings into human-readable and computable formats. These essential resources are maintained by a small number of professional biocurators worldwide and face combined chronic underfunding and the exponential growth of the literature. In this perspective, we review how artificial intelligence, particularly large language models and agentic systems, can augment literature-curation workflows. Applications include literature recommendation, entity recognition, data extraction, summarization, ontology development, and quality control with emphasis on published use cases at Global Core BioData Resources and ELIXIR Core Data Resources. We identify key challenges, including the scarcity of training data, difficulty in extracting complex relationships, and concerns about error propagation. To address these challenges, we propose a human-in-the-loop framework where generative artificial intelligence approaches accelerate routine tasks while curators provide critical evaluation and domain expertise. We also propose practical recommendations for the community, including the creation of shared benchmark datasets, harmonized evaluation frameworks, and best-practice guidelines for transparent human-in-the-loop AI deployment in biocuration. These synergistic partnerships will be critical to ensure biological rigour, accelerating knowledge integration while maintaining the quality essential for trusted biological resources.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag028"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12904773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf327
Beenish Gul, Maria Murach, Stefan Bekarinov, Kevin Skadron
Motivation: Community detection methods are applied to single cell RNA sequencing (i.e. scRNA-seq) and mass cytometry data to efficiently identify major cell types and their subtypes, but their computational demands increase, particularly given the substantial growth in dataset sizes. The Leiden algorithm, an emerging method in this field, offers inherent parallelism that remains underutilized due to the limited parallel processing capabilities offered by today's modern multi-core CPUs, which have fewer than 100 cores (typically 32-64 CPUs). However, Leiden can achieve significant performance gains when implemented on GPUs. GPUs offer high memory bandwidth and an extensive array of parallel processing units that map well to the parallelism in Leiden. As far as we know, cuGraph is the only implementation that has mapped the Leiden algorithm to GPUs, using a blend of Python and C languages. However, it only supports undirected graphs, potentially discarding the valuable information carried by edge directionality. In addition, this Python implementation for GPUs is comparatively slower than a C/C++ based implementation, reducing the significant performance gains provided by a GPU-based speedup. Conversely, a C/C++ based implementation optimizes performance more effectively, ensuring an accurate baseline comparison when performing GPU acceleration.
Results: We developed a tool named gLeiden, a lightweight CUDA C++ based GPU implementation of the Leiden algorithm and, to the best of our knowledge, the very first GPU implementation that supports directed graphs, which generally demands nearly twice the computational time and memory resources compared to undirected graphs. The results show that our directed gLeiden outperforms the directed cLeiden version and shows 11× and 12× speedup on very large datasets. Our undirected ucLeiden and ugLeiden implementations significantly outperform the original Java version, with up to 42× speedup on large datasets. However, when comparing the undirected ugLeiden version with cuGraph, ugLeiden performance is comparable on smaller datasets and 58% faster on larger datasets. These results position our GPU-based Leiden implementation as a high-performance alternative to existing state-of-the-art community detection tools.
Availability and implementation: The source code and sample data are available at: https://github.com/Beenishgul/Leiden and https://figshare.com/s/3b51e463a56e2a374bdf.
动机:社区检测方法被应用于单细胞RNA测序(即scRNA-seq)和大量细胞计数数据,以有效地识别主要细胞类型及其亚型,但它们的计算需求增加,特别是考虑到数据集大小的大幅增长。Leiden算法是该领域的一种新兴方法,它提供了固有的并行性,但由于当今现代多核cpu(通常为32-64个cpu)提供的并行处理能力有限,这种并行性仍未得到充分利用。然而,当在gpu上实现时,Leiden可以获得显着的性能提升。gpu提供高内存带宽和广泛的并行处理单元阵列,可以很好地映射到Leiden的并行性。据我们所知,cuGraph是唯一一个将Leiden算法映射到gpu的实现,它使用了Python和C语言的混合。然而,它只支持无向图,可能会丢弃边缘方向性所携带的有价值的信息。此外,这种针对gpu的Python实现相对于基于C/ c++的实现要慢,从而降低了基于gpu的加速所带来的显著性能提升。相反,基于C/ c++的实现可以更有效地优化性能,确保在执行GPU加速时进行准确的基线比较。结果:我们开发了一个名为gLeiden的工具,这是一个基于Leiden算法的轻量级CUDA c++ GPU实现,据我们所知,这是第一个支持有向图的GPU实现,与无向图相比,它通常需要近两倍的计算时间和内存资源。结果表明,我们的定向格莱顿版本在非常大的数据集上表现出11倍和12倍的加速。我们的undirected ucLeiden和ugLeiden实现明显优于原始Java版本,在大型数据集上的加速高达42倍。然而,当将无向ugLeiden版本与cuGraph进行比较时,ugLeiden在较小数据集上的性能相当,在较大数据集上的性能要快58%。这些结果将我们基于gpu的Leiden实现定位为现有最先进的社区检测工具的高性能替代方案。可用性和实现:源代码和示例数据可从https://github.com/Beenishgul/Leiden和https://figshare.com/s/3b51e463a56e2a374bdf获得。
{"title":"gLeiden: accelerated community detection algorithms using directed and undirected graphs on GPUs.","authors":"Beenish Gul, Maria Murach, Stefan Bekarinov, Kevin Skadron","doi":"10.1093/bioadv/vbaf327","DOIUrl":"10.1093/bioadv/vbaf327","url":null,"abstract":"<p><strong>Motivation: </strong>Community detection methods are applied to single cell RNA sequencing (i.e. scRNA-seq) and mass cytometry data to efficiently identify major cell types and their subtypes, but their computational demands increase, particularly given the substantial growth in dataset sizes. The Leiden algorithm, an emerging method in this field, offers inherent parallelism that remains underutilized due to the limited parallel processing capabilities offered by today's modern multi-core CPUs, which have fewer than 100 cores (typically 32-64 CPUs). However, Leiden can achieve significant performance gains when implemented on GPUs. GPUs offer high memory bandwidth and an extensive array of parallel processing units that map well to the parallelism in Leiden. As far as we know, cuGraph is the only implementation that has mapped the Leiden algorithm to GPUs, using a blend of Python and C languages. However, it only supports undirected graphs, potentially discarding the valuable information carried by edge directionality. In addition, this Python implementation for GPUs is comparatively slower than a C/C++ based implementation, reducing the significant performance gains provided by a GPU-based speedup. Conversely, a C/C++ based implementation optimizes performance more effectively, ensuring an accurate baseline comparison when performing GPU acceleration.</p><p><strong>Results: </strong>We developed a tool named gLeiden, a lightweight CUDA C++ based GPU implementation of the Leiden algorithm and, to the best of our knowledge, the very first GPU implementation that supports directed graphs, which generally demands nearly twice the computational time and memory resources compared to undirected graphs. The results show that our directed gLeiden outperforms the directed cLeiden version and shows 11× and 12× speedup on very large datasets. Our undirected ucLeiden and ugLeiden implementations significantly outperform the original Java version, with up to 42× speedup on large datasets. However, when comparing the undirected ugLeiden version with cuGraph, ugLeiden performance is comparable on smaller datasets and 58% faster on larger datasets. These results position our GPU-based Leiden implementation as a high-performance alternative to existing state-of-the-art community detection tools.</p><p><strong>Availability and implementation: </strong>The source code and sample data are available at: https://github.com/Beenishgul/Leiden and https://figshare.com/s/3b51e463a56e2a374bdf.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf327"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987761/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147464028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag024
Andrey Koch, Eldar Giladi
Motivation: Cancer screening using liquid biopsy technology has become standard in modern clinical and preventive oncology. This method analyzes cell-free DNA (cfDNA) circulating in a patient's bloodstream. While mutation-based diagnostics using deep exome sequencing are highly sensitive and specific, an alternative approach involves examining cfDNA fragment size distribution profiles. This method is less expensive and can be derived from low-depth whole genome sequencing (WGS).
Results: Our study presents DeepFRAG: a new cancer detection method based on deep learning analysis of cfDNA fragment size distribution profiles using wavelet transform. We utilized two independent cohorts comprising 73 patients with stage III and IV cancers (breast, colorectal, pancreatic, lung, and liver) and 80 healthy individuals. We introduced an original data augmentation technique specific to WGS fragment size data, ensuring sufficient data for training the deep learning model. The proposed method demonstrated high accuracy, with a median test AUROC (area under the receiver operating characteristic curve) of 0.974 and a sensitivity of 96.1% at 98.8% specificity. Our approach offers several advantages, including high accuracy, cost-effectiveness, robustness, and suitability for detecting major cancer types. This method represents a promising advancement in cancer screening technology, expanding the options available for noninvasive cancer detection, with the goal of improving patient outcomes.
Availability and implementation: Data and source code are available at https://github.com/andreykoch/DeepFRAG.
{"title":"DeepFRAG: a method for cancer detection based on DNA fragmentomics and deep learning.","authors":"Andrey Koch, Eldar Giladi","doi":"10.1093/bioadv/vbag024","DOIUrl":"https://doi.org/10.1093/bioadv/vbag024","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer screening using liquid biopsy technology has become standard in modern clinical and preventive oncology. This method analyzes cell-free DNA (cfDNA) circulating in a patient's bloodstream. While mutation-based diagnostics using deep exome sequencing are highly sensitive and specific, an alternative approach involves examining cfDNA fragment size distribution profiles. This method is less expensive and can be derived from low-depth whole genome sequencing (WGS).</p><p><strong>Results: </strong>Our study presents DeepFRAG: a new cancer detection method based on deep learning analysis of cfDNA fragment size distribution profiles using wavelet transform. We utilized two independent cohorts comprising 73 patients with stage III and IV cancers (breast, colorectal, pancreatic, lung, and liver) and 80 healthy individuals. We introduced an original data augmentation technique specific to WGS fragment size data, ensuring sufficient data for training the deep learning model. The proposed method demonstrated high accuracy, with a median test AUROC (area under the receiver operating characteristic curve) of 0.974 and a sensitivity of 96.1% at 98.8% specificity. Our approach offers several advantages, including high accuracy, cost-effectiveness, robustness, and suitability for detecting major cancer types. This method represents a promising advancement in cancer screening technology, expanding the options available for noninvasive cancer detection, with the goal of improving patient outcomes.</p><p><strong>Availability and implementation: </strong>Data and source code are available at https://github.com/andreykoch/DeepFRAG.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag024"},"PeriodicalIF":2.8,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12973171/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147438052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-25eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf313
Daniela Coelho Batista Guedes Pereira, João Vitor Ferreira Cavalcante, Laise Florentino Cavalcanti, Raul Maia Falcão, Jorge Estefano Santana de Souza, Rodrigo Juliani Siqueira Dalmolin, Thaís Gaudencio do Rêgo, Serghei Mangul, Gustavo Antônio de Souza, Patrick Terrematte, João Paulo Matos Santos Lima
Motivation: Predicting the impact of missense mutations on protein structure and function is a fundamental challenge for cancer research and clinical applications. Despite all the computational advances and, more recently, the use of artificial intelligence (AI), assessing the functional consequences of residue substitutions remains a challenging task. Proteins have complex three-dimensional structures, where the maintenance of their functionality depends on chemical interactions between amino acid residues. Single substitutions can affect these interactions, leading to more profound structural changes that are difficult to visualize.
Results: Here, we present CaRinDB, a database that integrates cancer-associated missense mutation data, functional predictions, molecular features, allelic frequencies, and residue interaction network (RIN) parameters derived from Protein Data Bank structures and AlphaFold models. Users can access and explore variant information through an intuitive web portal, with custom plots and tables to visualize and analyze cancer-associated mutation data. CaRinDB is the first database that unites distinct annotation features of cancer-associated mutations and their structural impacts, utilizing RINs graph parameters and a source of compiled and processed data for the development of AI tools.
Availability and implementation: CaRinDB is freely available at https://bioinfo.imd.ufrn.br/CaRinDB/. The integrated development environment used was Jupyter notebooks, available on GitHub (https://github.com/evomol-lab/CaRinDB). CaRinDB web interface was implemented in R and Shiny.
{"title":"CaRinDB: an integrated database of common cancer mutations and residue interaction network parameters.","authors":"Daniela Coelho Batista Guedes Pereira, João Vitor Ferreira Cavalcante, Laise Florentino Cavalcanti, Raul Maia Falcão, Jorge Estefano Santana de Souza, Rodrigo Juliani Siqueira Dalmolin, Thaís Gaudencio do Rêgo, Serghei Mangul, Gustavo Antônio de Souza, Patrick Terrematte, João Paulo Matos Santos Lima","doi":"10.1093/bioadv/vbaf313","DOIUrl":"10.1093/bioadv/vbaf313","url":null,"abstract":"<p><strong>Motivation: </strong>Predicting the impact of missense mutations on protein structure and function is a fundamental challenge for cancer research and clinical applications. Despite all the computational advances and, more recently, the use of artificial intelligence (AI), assessing the functional consequences of residue substitutions remains a challenging task. Proteins have complex three-dimensional structures, where the maintenance of their functionality depends on chemical interactions between amino acid residues. Single substitutions can affect these interactions, leading to more profound structural changes that are difficult to visualize.</p><p><strong>Results: </strong>Here, we present CaRinDB, a database that integrates cancer-associated missense mutation data, functional predictions, molecular features, allelic frequencies, and residue interaction network (RIN) parameters derived from Protein Data Bank structures and AlphaFold models. Users can access and explore variant information through an intuitive web portal, with custom plots and tables to visualize and analyze cancer-associated mutation data. CaRinDB is the first database that unites distinct annotation features of cancer-associated mutations and their structural impacts, utilizing RINs graph parameters and a source of compiled and processed data for the development of AI tools.</p><p><strong>Availability and implementation: </strong>CaRinDB is freely available at https://bioinfo.imd.ufrn.br/CaRinDB/. The integrated development environment used was Jupyter notebooks, available on GitHub (https://github.com/evomol-lab/CaRinDB). CaRinDB web interface was implemented in R and Shiny.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf313"},"PeriodicalIF":2.8,"publicationDate":"2026-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12872580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146144863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag022
Dilek Koptekin
Motivation: Ancient DNA studies rely heavily on the EIGENSTRAT genotype format (.geno, .ind, .snp) for standard population genetic analyses including PCA, f-statistics, and qpWave/qpAdm. However, there is limited software available for processing EIGENSTRAT format data. pygenstrat , a Python package, is presented here, providing a command-line interface for comprehensive EIGENSTRAT data processing with extensive filtering, subsetting, and conversion options. pygenstrat implements memory-efficient, chunked processing algorithms for handling large ancient DNA datasets with low memory usage. It supports comprehensive operations, including updating individual and SNP files, subsetting datasets by selecting individuals or SNPs, filtering by minor allele frequency and missingness, pseudo-haploidisation, allele polarization, as well as conversion between EIGENSTRAT (text) and ANCESTRYMAP (binary) formats. Its modular architecture and Python implementation enable rapid integration with custom pipelines and future extensions.
Results: Benchmarking on the Allen Ancient DNA Resource (v 62.0) shows 2×-15× speedups and 90%-95% memory reduction compared to convertf, while producing equivalent outputs for standard operations. These improvements reduce turnaround time in ancient DNA workflows and facilitate reproducible processing.
Availability and implementation: pygenstrat is open-source, available at https://github.com/dkoptekin/pygenstrat.
动机:古代DNA研究严重依赖于特征基因型格式。基因族群。印第安纳州。snp)用于标准群体遗传分析,包括PCA, f-statistics和qpWave/qpAdm。然而,有有限的软件可用于处理特征strat格式的数据。pygenstrat是一个Python包,它提供了一个命令行接口,用于全面的EIGENSTRAT数据处理,具有广泛的过滤、子集和转换选项。pygenstrat实现了内存高效的分块处理算法,用于处理具有低内存使用率的大型古代DNA数据集。它支持全面的操作,包括更新个体和SNP文件,通过选择个体或SNP来子集数据集,通过次要等位基因频率和缺失进行过滤,伪单倍体化,等位基因极化,以及在EIGENSTRAT(文本)和ANCESTRYMAP(二进制)格式之间进行转换。它的模块化架构和Python实现可以快速集成自定义管道和未来的扩展。结果:在Allen Ancient DNA Resource (v 62.0)上进行基准测试显示,与convertf相比,2×-15×加速和90%-95%的内存减少,同时为标准操作产生等效输出。这些改进减少了古代DNA工作流程的周转时间,并促进了可重复处理。可用性和实现:pygenstrat是开源的,可从https://github.com/dkoptekin/pygenstrat获得。
{"title":"pygenstrat: a Python package for EIGENSTRAT data processing.","authors":"Dilek Koptekin","doi":"10.1093/bioadv/vbag022","DOIUrl":"10.1093/bioadv/vbag022","url":null,"abstract":"<p><strong>Motivation: </strong>Ancient DNA studies rely heavily on the EIGENSTRAT genotype format (.geno, .ind, .snp) for standard population genetic analyses including PCA, f-statistics, and qpWave/qpAdm. However, there is limited software available for processing EIGENSTRAT format data. <b><i>pygenstrat</i></b> , a Python package, is presented here, providing a command-line interface for comprehensive EIGENSTRAT data processing with extensive filtering, subsetting, and conversion options. <i>pygenstrat</i> implements memory-efficient, chunked processing algorithms for handling large ancient DNA datasets with low memory usage. It supports comprehensive operations, including updating individual and SNP files, subsetting datasets by selecting individuals or SNPs, filtering by minor allele frequency and missingness, pseudo-haploidisation, allele polarization, as well as conversion between EIGENSTRAT (text) and ANCESTRYMAP (binary) formats. Its modular architecture and Python implementation enable rapid integration with custom pipelines and future extensions.</p><p><strong>Results: </strong>Benchmarking on the Allen Ancient DNA Resource (v 62.0) shows 2×-15× speedups and 90%-95% memory reduction compared to <i>convertf</i>, while producing equivalent outputs for standard operations. These improvements reduce turnaround time in ancient DNA workflows and facilitate reproducible processing.</p><p><strong>Availability and implementation: </strong><i>pygenstrat</i> is open-source, available at https://github.com/dkoptekin/pygenstrat.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag022"},"PeriodicalIF":2.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12895063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag020
Konstantinos Antonopoulos, Emil Johansson, Josefin Kenrick, Leo Dahl, Fredrik Edfors, Mathias Uhlén, María Bueno Álvez
Motivation: Exploration of large-scale biological datasets remains a central challenge in computational biology. While many tools are available, they are often developed in isolation, leading to fragmented workflows, duplicated efforts, and limited reproducibility. There is a pressing need for flexible, standardized solutions that unify exploratory data analysis and biomarker discovery across diverse platforms.
Results: We present HDAnalyzeR, a user-friendly and extensible R package for the streamlined analysis of high-dimensional biological data. HDAnalyzeR provides modular, reproducible workflows that support a range of analyses, from quality control and dimensionality reduction to differential expression and enrichment analysis. The package features built-in visualization, metadata-aware modeling, and seamless integration with interactive apps and learning resources. We also present two case studies, where HDAnalyzeR dramatically reduced analysis time and code complexity while providing biologically meaningful insights, such as classification of blood cancer types with AUC = 1.0 and identification of thousands of solid tumor-associated genes. HDAnalyzeR is designed to support both beginner users and experienced bioinformaticians, promoting transparency, reproducibility, and publication-quality output.
Availability and implementation: HDAnalyzeR is freely available both as an open-source R package at https://github.com/kantonopoulos/HDAnalyzeR and a web application at https://hdanalyzer.serve.scilifelab.se.
{"title":"HDAnalyzeR: streamlining data analysis for biomarker research.","authors":"Konstantinos Antonopoulos, Emil Johansson, Josefin Kenrick, Leo Dahl, Fredrik Edfors, Mathias Uhlén, María Bueno Álvez","doi":"10.1093/bioadv/vbag020","DOIUrl":"https://doi.org/10.1093/bioadv/vbag020","url":null,"abstract":"<p><strong>Motivation: </strong>Exploration of large-scale biological datasets remains a central challenge in computational biology. While many tools are available, they are often developed in isolation, leading to fragmented workflows, duplicated efforts, and limited reproducibility. There is a pressing need for flexible, standardized solutions that unify exploratory data analysis and biomarker discovery across diverse platforms.</p><p><strong>Results: </strong>We present HDAnalyzeR, a user-friendly and extensible R package for the streamlined analysis of high-dimensional biological data. HDAnalyzeR provides modular, reproducible workflows that support a range of analyses, from quality control and dimensionality reduction to differential expression and enrichment analysis. The package features built-in visualization, metadata-aware modeling, and seamless integration with interactive apps and learning resources. We also present two case studies, where HDAnalyzeR dramatically reduced analysis time and code complexity while providing biologically meaningful insights, such as classification of blood cancer types with AUC = 1.0 and identification of thousands of solid tumor-associated genes. HDAnalyzeR is designed to support both beginner users and experienced bioinformaticians, promoting transparency, reproducibility, and publication-quality output.</p><p><strong>Availability and implementation: </strong>HDAnalyzeR is freely available both as an open-source R package at https://github.com/kantonopoulos/HDAnalyzeR and a web application at https://hdanalyzer.serve.scilifelab.se.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag020"},"PeriodicalIF":2.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12925248/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag015
Sadé Bates, Christophe Dessimoz, Yannis Nevers
Motivation: Advances in sequencing technologies have enabled researchers to sequence whole genomes rapidly and cheaply. However, despite improvements in genome assembly, structural genome annotation (i.e. the identification of protein-coding genes) remains challenging, particularly for eukaryotic genomes. It requires using several approaches (typically ab initio, transcriptomics, and homology search), which may give substantially different results. Deciding which gene models to retain in a consensus is far from trivial, and automated approaches tend to lag behind laborious manual curation efforts in accuracy.
Results: We present OMAnnotator, a novel approach to building a consensus annotation. OMAnnotator repurposes the OMA algorithm, originally designed to elucidate evolutionary relationships among genes across species, to integrate predictions from different annotation sources into a consensus annotation, using evolutionary information as a tie-breaker. During benchmarking on the Drosophila melanogaster reference, OMAnnotator's consensus improved upon its source annotations and two state-of-the-art pipelines used as annotation combiners with the same inputs. When applied to three recently published genomes, OMAnnotator gave substantial improvements in two cases, and mixed results in the third, which had already benefitted from extensive expert curation. This underlines the method's effectiveness and robustness for combining the results of disagreeing annotation softwares, strengthening the toolkit for eukaryotic genome annotation.
Availability and implementation: OMAnnotator is available on GitHub (https://github.com/DessimozLab/OMAnnotator).
{"title":"OMAnnotator: a novel approach to building an annotated consensus genome sequence.","authors":"Sadé Bates, Christophe Dessimoz, Yannis Nevers","doi":"10.1093/bioadv/vbag015","DOIUrl":"https://doi.org/10.1093/bioadv/vbag015","url":null,"abstract":"<p><strong>Motivation: </strong>Advances in sequencing technologies have enabled researchers to sequence whole genomes rapidly and cheaply. However, despite improvements in genome assembly, structural genome annotation (i.e. the identification of protein-coding genes) remains challenging, particularly for eukaryotic genomes. It requires using several approaches (typically <i>ab initio</i>, transcriptomics, and homology search), which may give substantially different results. Deciding which gene models to retain in a consensus is far from trivial, and automated approaches tend to lag behind laborious manual curation efforts in accuracy.</p><p><strong>Results: </strong>We present OMAnnotator, a novel approach to building a consensus annotation. OMAnnotator repurposes the OMA algorithm, originally designed to elucidate evolutionary relationships among genes across species, to integrate predictions from different annotation sources into a consensus annotation, using evolutionary information as a tie-breaker. During benchmarking on the <i>Drosophila melanogaster</i> reference, OMAnnotator's consensus improved upon its source annotations and two state-of-the-art pipelines used as annotation combiners with the same inputs. When applied to three recently published genomes, OMAnnotator gave substantial improvements in two cases, and mixed results in the third, which had already benefitted from extensive expert curation. This underlines the method's effectiveness and robustness for combining the results of disagreeing annotation softwares, strengthening the toolkit for eukaryotic genome annotation.</p><p><strong>Availability and implementation: </strong>OMAnnotator is available on GitHub (https://github.com/DessimozLab/OMAnnotator).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag015"},"PeriodicalIF":2.8,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12927413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147286388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}