Pub Date : 2026-02-02DOI: 10.1177/15578666261416560
Muyao Huang, Martin C Frith
The insertion of mitochondrial genome-derived DNA sequences into the nuclear genome is a frequent event in organismal evolution, resulting in nuclear-mitochondrial DNA segments (NUMTs), which serve as a significant driving force for genome evolution. Once incorporated into the nuclear genome, some NUMTs can be conserved for extended periods and may potentially acquire novel cellular roles. However, current mainstream methods for detecting NUMTs are inefficient at identifying ancient and highly degraded NUMTs, leading to their prevalence and impact being underestimated. These ancient NUMTs likely play a far greater role in genetic functions than previously recognized, including contributing to the acquisition of functional exons. This study focuses on identifying ancient NUMTs in mammalian genomes using enhanced high-sensitivity sequence comparison methods. A sensitive and accurate NUMT searching pipeline was established, predicting 1013 NUMTs in the human reference genome, 364 (36%) of which are newly detected compared to the University of California, Santa Cruz (UCSC) reference human NUMTs database. Notably, 90 pre-eutherian human NUMTs were identified, representing significantly older NUMTs than previously reported, with origins dating back at least 100 million years. The most ancient mammalian NUMT could even date back over 160 million years, inserted into the nuclear genome of the common ancestor of therian mammals. This study provides a comprehensive exploration of the quantity and evolutionary history of mammalian NUMTs, paving the way for future research on endosymbiotic impact on the evolution of nuclear genomes.
{"title":"Probability-Based Sequence Comparison Finds Pre-Eutherian Nuclear Mitochondrial DNA Segments in Mammalian Genomes.","authors":"Muyao Huang, Martin C Frith","doi":"10.1177/15578666261416560","DOIUrl":"https://doi.org/10.1177/15578666261416560","url":null,"abstract":"<p><p>The insertion of mitochondrial genome-derived DNA sequences into the nuclear genome is a frequent event in organismal evolution, resulting in nuclear-mitochondrial DNA segments (NUMTs), which serve as a significant driving force for genome evolution. Once incorporated into the nuclear genome, some NUMTs can be conserved for extended periods and may potentially acquire novel cellular roles. However, current mainstream methods for detecting NUMTs are inefficient at identifying ancient and highly degraded NUMTs, leading to their prevalence and impact being underestimated. These ancient NUMTs likely play a far greater role in genetic functions than previously recognized, including contributing to the acquisition of functional exons. This study focuses on identifying ancient NUMTs in mammalian genomes using enhanced high-sensitivity sequence comparison methods. A sensitive and accurate NUMT searching pipeline was established, predicting 1013 NUMTs in the human reference genome, 364 (36%) of which are newly detected compared to the University of California, Santa Cruz (UCSC) reference human NUMTs database. Notably, 90 pre-eutherian human NUMTs were identified, representing significantly older NUMTs than previously reported, with origins dating back at least 100 million years. The most ancient mammalian NUMT could even date back over 160 million years, inserted into the nuclear genome of the common ancestor of therian mammals. This study provides a comprehensive exploration of the quantity and evolutionary history of mammalian NUMTs, paving the way for future research on endosymbiotic impact on the evolution of nuclear genomes.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"15578666261416560"},"PeriodicalIF":1.6,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146105709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-04DOI: 10.1177/15578666251388496
Xiao Liang, Lenwood S Heath
Cancer genes, including oncogenes and tumor suppressor genes, are crucial subjects of study in human health. With advancements in nonhuman primate genome research, recent investigations have started to explore the relationship between human cancer genes and various primate species, implying possible links between cancer genes and nonhuman primate species. However, these findings are commonly limited to only a few primate species. Here, we investigate the distribution of 726 human cancer genes across 31 nonhuman primate species and 4 nonprimate species, comparing them with a randomly selected set of 339 other human genes. Overall, a higher ratio of cancer genes was found to have emerged before primate speciation compared with random human genes; however, this trend did not necessarily continue after primate speciation. Furthermore, our investigation into primate clades suggests that the absence of certain cancer genes in a clade may reflect recent emergence in other clades in which they exist, rather than elimination within the clades in which they are absent.
{"title":"Reflections on the Presence of Human Cancer Genes in Primate Genomes.","authors":"Xiao Liang, Lenwood S Heath","doi":"10.1177/15578666251388496","DOIUrl":"10.1177/15578666251388496","url":null,"abstract":"<p><p>Cancer genes, including oncogenes and tumor suppressor genes, are crucial subjects of study in human health. With advancements in nonhuman primate genome research, recent investigations have started to explore the relationship between human cancer genes and various primate species, implying possible links between cancer genes and nonhuman primate species. However, these findings are commonly limited to only a few primate species. Here, we investigate the distribution of 726 human cancer genes across 31 nonhuman primate species and 4 nonprimate species, comparing them with a randomly selected set of 339 other human genes. Overall, a higher ratio of cancer genes was found to have emerged before primate speciation compared with random human genes; however, this trend did not necessarily continue after primate speciation. Furthermore, our investigation into primate clades suggests that the absence of certain cancer genes in a clade may reflect recent emergence in other clades in which they exist, rather than elimination within the clades in which they are absent.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"219-235"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145389885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-02-02DOI: 10.1177/15578666251392595
Yao-Ban Chan, Celine Scornavacca, Michael Charleston
Reconciliations are a mathematical tool to compare the phylogenetic trees of genes to the species that contain them, accounting for events such as gene duplication and loss. Traditional reconciliation methods have predominantly relied on parsimony to infer gene-only evolutionary events and usually make the hypothesis that genes evolve independently. Recently, more advanced models have been developed that account for complex gene interactions stemming from phenomena such as segmental duplications, where multiple genes undergo simultaneous duplication. In this article, we study the NP-hard problem of reconciling gene trees to a species tree with segmental duplications, without the aid of synteny information. We address this problem by proposing a novel probabilistic approach, imposing a Boltzmann distribution over the space of reconciliations. This allows for a Gibbs sampling-like Markov chain Monte Carlo algorithm that uses simulated annealing to effectively find or approximate the most parsimonious reconciliation, as demonstrated through rigorous simulations and re-analysis of empirical datasets. Our findings present a promising new framework for addressing NP-hard reconciliation challenges in phylogenetics, enhancing our understanding of gene evolution and its relationship with species evolution.
{"title":"A Probabilistic Algorithm for Gene-Species Reconciliation with Segmental Duplications.","authors":"Yao-Ban Chan, Celine Scornavacca, Michael Charleston","doi":"10.1177/15578666251392595","DOIUrl":"10.1177/15578666251392595","url":null,"abstract":"<p><p>Reconciliations are a mathematical tool to compare the phylogenetic trees of genes to the species that contain them, accounting for events such as gene duplication and loss. Traditional reconciliation methods have predominantly relied on parsimony to infer gene-only evolutionary events and usually make the hypothesis that genes evolve independently. Recently, more advanced models have been developed that account for complex gene interactions stemming from phenomena such as segmental duplications, where multiple genes undergo simultaneous duplication. In this article, we study the NP-hard problem of reconciling gene trees to a species tree with segmental duplications, without the aid of synteny information. We address this problem by proposing a novel probabilistic approach, imposing a Boltzmann distribution over the space of reconciliations. This allows for a Gibbs sampling-like Markov chain Monte Carlo algorithm that uses simulated annealing to effectively find or approximate the most parsimonious reconciliation, as demonstrated through rigorous simulations and re-analysis of empirical datasets. Our findings present a promising new framework for addressing NP-hard reconciliation challenges in phylogenetics, enhancing our understanding of gene evolution and its relationship with species evolution.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"203-218"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145809960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-02-02DOI: 10.1177/15578666251382251
Donghui Son, Jaejik Kim
Dynamic systems encompass a broad class of mathematical models used to describe the behavior of complex networks or systems over time. One of the most common approaches to modeling such dynamics is through a set of ordinary differential equations (ODEs), typically constructed based on hypotheses, known interactions, or observed trajectories. However, ODEs are deterministic and inflexible, while biological data are typically noisy. Thus, the model fit might not account for all possible data variations, and there might be a discrepancy between the actual biological process and the assumed model. This discrepancy could lead to inaccuracies in the prediction and interpretation of the biological networks. Therefore, it is required to validate ODE models in terms of observed data. Given that biological networks typically involve multiple sources of errors and uncertainties, the validation process should account for these factors. The Bayesian approaches offer a robust framework for quantifying errors and uncertainties. Thus, in this study, we propose a Bayesian validation method for ODE models that addresses model inadequacy, presented as bias. Since the proposed method estimates bias as a function of time, it can provide prediction bounds for the entire observed time interval. Consequently, it allows for a direct evaluation of the model's validity across the whole time interval, and it can lead to better prediction by correcting the bias.
{"title":"Bayesian Validation of Dynamic Systems for Biological Networks.","authors":"Donghui Son, Jaejik Kim","doi":"10.1177/15578666251382251","DOIUrl":"10.1177/15578666251382251","url":null,"abstract":"<p><p>Dynamic systems encompass a broad class of mathematical models used to describe the behavior of complex networks or systems over time. One of the most common approaches to modeling such dynamics is through a set of ordinary differential equations (ODEs), typically constructed based on hypotheses, known interactions, or observed trajectories. However, ODEs are deterministic and inflexible, while biological data are typically noisy. Thus, the model fit might not account for all possible data variations, and there might be a discrepancy between the actual biological process and the assumed model. This discrepancy could lead to inaccuracies in the prediction and interpretation of the biological networks. Therefore, it is required to validate ODE models in terms of observed data. Given that biological networks typically involve multiple sources of errors and uncertainties, the validation process should account for these factors. The Bayesian approaches offer a robust framework for quantifying errors and uncertainties. Thus, in this study, we propose a Bayesian validation method for ODE models that addresses model inadequacy, presented as bias. Since the proposed method estimates bias as a function of time, it can provide prediction bounds for the entire observed time interval. Consequently, it allows for a direct evaluation of the model's validity across the whole time interval, and it can lead to better prediction by correcting the bias.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"267-279"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145225417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-02-02DOI: 10.1177/15578666251391211
Xiang Shi, Jiayi Kang, Nan Sun, Xin Zhao, Stephen S-T Yau
The rapid expansion of biological data in recent decades has highlighted the need for efficient methods in sequence analysis. Traditional pairwise alignment approaches are both time-consuming and memory-intensive. Alignment-free (AF) methods such as natural vector (NV) and k-mer operate on a one-dimensional framework, interpreting DNA primarily as a linear string of nucleotides. To achieve a more comprehensive interpretation of molecular structure, this study incorporates the three-dimensional architectural features of DNA and introduces a novel AF method named Multi-perspective natural vector (MNV). The MNV method maps genome sequences of varying lengths to points within a unified geometric space, facilitating large-size data processing tasks such as variant classification and clustering. Across datasets of different sizes and types, MNV attains a 100% convex hull separation ratio in lower dimensions compared with widely used methods NV and k-mer methods. In neural network classification, MNV achieves better classification accuracy of 99.55% and 98.78% on SARS-CoV-2 and poliovirus datasets respectively, demonstrating its effectiveness in viral genome analysis while maintaining computational efficiency.
{"title":"Multi-Perspective Natural Vector: A Novel Method for Viral Sequence Feature Extraction.","authors":"Xiang Shi, Jiayi Kang, Nan Sun, Xin Zhao, Stephen S-T Yau","doi":"10.1177/15578666251391211","DOIUrl":"10.1177/15578666251391211","url":null,"abstract":"<p><p>The rapid expansion of biological data in recent decades has highlighted the need for efficient methods in sequence analysis. Traditional pairwise alignment approaches are both time-consuming and memory-intensive. Alignment-free (AF) methods such as natural vector (NV) and k-mer operate on a one-dimensional framework, interpreting DNA primarily as a linear string of nucleotides. To achieve a more comprehensive interpretation of molecular structure, this study incorporates the three-dimensional architectural features of DNA and introduces a novel AF method named Multi-perspective natural vector (MNV). The MNV method maps genome sequences of varying lengths to points within a unified geometric space, facilitating large-size data processing tasks such as variant classification and clustering. Across datasets of different sizes and types, MNV attains a 100% convex hull separation ratio in lower dimensions compared with widely used methods NV and k-mer methods. In neural network classification, MNV achieves better classification accuracy of 99.55% and 98.78% on SARS-CoV-2 and poliovirus datasets respectively, demonstrating its effectiveness in viral genome analysis while maintaining computational efficiency.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"255-266"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145701154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-30DOI: 10.1177/15578666251406305
Vy Q Ong, Bich N Choi, Devin P Lundy, Yingnan Zhang, Yin Wan, Hongyan Xu, Santu Ghosh, Hui Jiang, Yang Shi
Quadratic forms of multivariate normal variables play a critical role in statistical applications, particularly in genomics and bioinformatics. However, accurately computing small right-tail probabilities (p-values) for large-scale quadratic forms is computationally challenging due to the intractability of their probability distributions, as well as significant numerical constraints and computational burdens. To address these problems, we propose Markov chain Monte Carlo cross-entropy (MCMC-CE), an innovative algorithm that integrates MCMC sampling with the CE method, coupled with leading eigenvalue extraction and Satterthwaite-type approximation techniques. Our approach efficiently estimates small p-values for quadratic forms with their ranks exceeding 10,000. Through extensive simulation studies and real-world applications in genomics, including genome-wide association studies and pathway enrichment analyses, our method demonstrates advantageous numerical accuracy and computational reliability compared with existing approaches such as Davies', Imhof's, Farebrother's, Liu-Tang-Zhang's, and saddlepoint approximation methods. MCMC-CE provides a robust and scalable solution for accurately computing small p-values for quadratic forms, facilitating more precise statistical inference in large-scale genomic studies.
{"title":"MCMC-CE: A Novel and Efficient Algorithm for Estimating Small Right-Tail Probabilities of Quadratic Forms with Applications in Genomics.","authors":"Vy Q Ong, Bich N Choi, Devin P Lundy, Yingnan Zhang, Yin Wan, Hongyan Xu, Santu Ghosh, Hui Jiang, Yang Shi","doi":"10.1177/15578666251406305","DOIUrl":"10.1177/15578666251406305","url":null,"abstract":"<p><p>Quadratic forms of multivariate normal variables play a critical role in statistical applications, particularly in genomics and bioinformatics. However, accurately computing small right-tail probabilities (<i>p</i>-values) for large-scale quadratic forms is computationally challenging due to the intractability of their probability distributions, as well as significant numerical constraints and computational burdens. To address these problems, we propose Markov chain Monte Carlo cross-entropy (MCMC-CE), an innovative algorithm that integrates MCMC sampling with the CE method, coupled with leading eigenvalue extraction and Satterthwaite-type approximation techniques. Our approach efficiently estimates small <i>p</i>-values for quadratic forms with their ranks exceeding 10,000. Through extensive simulation studies and real-world applications in genomics, including genome-wide association studies and pathway enrichment analyses, our method demonstrates advantageous numerical accuracy and computational reliability compared with existing approaches such as Davies', Imhof's, Farebrother's, Liu-Tang-Zhang's, and saddlepoint approximation methods. MCMC-CE provides a robust and scalable solution for accurately computing small <i>p</i>-values for quadratic forms, facilitating more precise statistical inference in large-scale genomic studies.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"236-254"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1177/15578666251410587
Inass Soukarieh, Gerhard Hessler, Hervé Minoux, Marcel Mohr, Friedemann Schmidt, Jan Wenzel, Pierre Barbillon, Hugo Gangloff, Pierre Gloaguen
Mathematical modeling in systems toxicology enables a comprehensive understanding of the effects of pharmaceutical substances on cardiac health. However, the complexity of these models limits their widespread application in early drug discovery. In this article, we introduce a novel approach to solving parameterized models of cardiac action potentials (CAPs) by combining meta-learning techniques with systems biology-informed neural networks (SBINNs). The proposed method, hyperSBINN, effectively addresses the challenge of predicting the effects of various compounds at different concentrations on CAPs, outperforming traditional differential equation solvers in speed. Our model efficiently handles scenarios with limited data and complex parameterized differential equations. The hyperSBINN model demonstrates robust performance in predicting APD90 values, indicating its potential as a reliable tool for modeling cardiac electrophysiology and aiding in preclinical drug development. This framework represents an advancement in computational modeling, offering a scalable and efficient solution for simulating and understanding complex biological systems.
{"title":"HyperSBINN: A Hypernetwork-Enhanced Systems Biology-Informed Neural Network for Efficient Drug Cardiosafety Assessment.","authors":"Inass Soukarieh, Gerhard Hessler, Hervé Minoux, Marcel Mohr, Friedemann Schmidt, Jan Wenzel, Pierre Barbillon, Hugo Gangloff, Pierre Gloaguen","doi":"10.1177/15578666251410587","DOIUrl":"https://doi.org/10.1177/15578666251410587","url":null,"abstract":"<p><p>Mathematical modeling in systems toxicology enables a comprehensive understanding of the effects of pharmaceutical substances on cardiac health. However, the complexity of these models limits their widespread application in early drug discovery. In this article, we introduce a novel approach to solving parameterized models of cardiac action potentials (CAPs) by combining meta-learning techniques with systems biology-informed neural networks (SBINNs). The proposed method, hyperSBINN, effectively addresses the challenge of predicting the effects of various compounds at different concentrations on CAPs, outperforming traditional differential equation solvers in speed. Our model efficiently handles scenarios with limited data and complex parameterized differential equations. The hyperSBINN model demonstrates robust performance in predicting APD90 values, indicating its potential as a reliable tool for modeling cardiac electrophysiology and aiding in preclinical drug development. This framework represents an advancement in computational modeling, offering a scalable and efficient solution for simulating and understanding complex biological systems.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"15578666251410587"},"PeriodicalIF":1.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1177/15578666261416557
Enrico Rossignolo, Matteo Comin
A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64× faster-highlighting its effectiveness for large-scale genomic data processing.
计算基因组学的核心任务是将输入序列转化为其组成的k-mers。有效地存储这些k-mer集合对于扩展生物信息学工作流程至关重要。一种常见的策略包括将k-mers表示为de Bruijn图(dBG),并通过最小路径覆盖派生出紧凑的纯文本形式。在本文中,我们介绍了USTAR-CR (unitistitch Advanced constRuction with Colors Reordering),一种用于压缩多个k-mer集的快速且节省空间的算法。USTAR-CR利用彩色dbg的结构特性来构建简洁的纯文本表示,同时还结合了一种有效的编码k-mer颜色信息的方案。我们在真实的测序数据集上评估USTAR-CR,并将其与最先进的工具GGCAT进行基准测试。USTAR-CR实现了卓越的压缩比,显着降低了内存使用,并提供了显著的速度提升-高达64倍-突出了其大规模基因组数据处理的有效性。
{"title":"USTAR-CR: Efficient and Compact Compression of <i>k</i>-Mer Sets Through Colored de Bruijn Graphs.","authors":"Enrico Rossignolo, Matteo Comin","doi":"10.1177/15578666261416557","DOIUrl":"https://doi.org/10.1177/15578666261416557","url":null,"abstract":"<p><p>A core task in computational genomics is transforming input sequences into their constituent <i>k</i>-mers. Efficiently storing these <i>k</i>-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the <i>k</i>-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple <i>k</i>-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding <i>k</i>-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64× faster-highlighting its effectiveness for large-scale genomic data processing.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"15578666261416557"},"PeriodicalIF":1.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2026-01-30DOI: 10.1177/15578666251383562
Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li
Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.
{"title":"ClusterDE: A Statistical Software Package for Removing Double-Dipping Bias in Post-Clustering Differential Expression Analysis.","authors":"Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li","doi":"10.1177/15578666251383562","DOIUrl":"10.1177/15578666251383562","url":null,"abstract":"<p><p>Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"36-42"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145251583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2026-01-30DOI: 10.1177/15578666251374694
Leah L Weber, Anna Hart, Idoia Ochoa, Mohammed El-Kebir
Cancer arises through an evolutionary process in which somatic mutations, including single-nucleotide variants (SNVs) and copy number aberrations (CNAs), drive the development of a malignant, heterogeneous tumor. Reconstructing this evolutionary history from sequencing data is critical for understanding the order in which mutations are acquired and the dynamic interplay between different types of alterations. Advances in modern whole-genome single-cell sequencing now enable the accurate inference of copy number profiles in individual cells. However, the low sequencing coverage of these low-pass sequencing technologies poses a challenge for reliably inferring the presence or absence of SNVs within tumor cells, limiting the ability to simultaneously study the evolutionary relationships between SNVs and CNAs. In this work, we introduce a novel tumor phylogeny inference method, Pharming, that infers the joint evolutionary history of SNVs and CNAs. Our key insight is to leverage the high accuracy of copy number inference methods and the fact that SNVs co-occur in regions with CNAs in order to enable more precise tumor phylogeny reconstruction for both alteration types. We demonstrate via simulations that Pharming outperforms state-of-the-art tumor phylogeny inference methods. Additionally, we apply Pharming to a triple-negative breast cancer case, achieving high-resolution, joint reconstruction of CNA and SNV evolution, including the de novo detection of a clonal whole-genome duplication event. Thus, Pharming offers the potential for more comprehensive and detailed tumor phylogeny inference for high-throughput, low-coverage single-cell DNA sequencing technologies compared with existing approaches.
{"title":"Pharming: Joint Clonal Tree Reconstruction of Single-Nucleotide Variant and Copy Number Aberration Evolution from Single-Cell DNA Sequencing of Tumors.","authors":"Leah L Weber, Anna Hart, Idoia Ochoa, Mohammed El-Kebir","doi":"10.1177/15578666251374694","DOIUrl":"10.1177/15578666251374694","url":null,"abstract":"<p><p>Cancer arises through an evolutionary process in which somatic mutations, including single-nucleotide variants (SNVs) and copy number aberrations (CNAs), drive the development of a malignant, heterogeneous tumor. Reconstructing this evolutionary history from sequencing data is critical for understanding the order in which mutations are acquired and the dynamic interplay between different types of alterations. Advances in modern whole-genome single-cell sequencing now enable the accurate inference of copy number profiles in individual cells. However, the low sequencing coverage of these low-pass sequencing technologies poses a challenge for reliably inferring the presence or absence of SNVs within tumor cells, limiting the ability to simultaneously study the evolutionary relationships between SNVs and CNAs. In this work, we introduce a novel tumor phylogeny inference method, Pharming, that infers the joint evolutionary history of SNVs and CNAs. Our key insight is to leverage the high accuracy of copy number inference methods and the fact that SNVs co-occur in regions with CNAs in order to enable more precise tumor phylogeny reconstruction for both alteration types. We demonstrate via simulations that Pharming outperforms state-of-the-art tumor phylogeny inference methods. Additionally, we apply Pharming to a triple-negative breast cancer case, achieving high-resolution, joint reconstruction of CNA and SNV evolution, including the de novo detection of a clonal whole-genome duplication event. Thus, Pharming offers the potential for more comprehensive and detailed tumor phylogeny inference for high-throughput, low-coverage single-cell DNA sequencing technologies compared with existing approaches.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"2-18"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145053702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}