Journal of Computational Biology最新文献

Probability-Based Sequence Comparison Finds Pre-Eutherian Nuclear Mitochondrial DNA Segments in Mammalian Genomes. 基于概率的序列比较发现哺乳动物基因组中的前真兽核线粒体DNA片段。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-02-02 DOI: 10.1177/15578666261416560

Muyao Huang, Martin C Frith

The insertion of mitochondrial genome-derived DNA sequences into the nuclear genome is a frequent event in organismal evolution, resulting in nuclear-mitochondrial DNA segments (NUMTs), which serve as a significant driving force for genome evolution. Once incorporated into the nuclear genome, some NUMTs can be conserved for extended periods and may potentially acquire novel cellular roles. However, current mainstream methods for detecting NUMTs are inefficient at identifying ancient and highly degraded NUMTs, leading to their prevalence and impact being underestimated. These ancient NUMTs likely play a far greater role in genetic functions than previously recognized, including contributing to the acquisition of functional exons. This study focuses on identifying ancient NUMTs in mammalian genomes using enhanced high-sensitivity sequence comparison methods. A sensitive and accurate NUMT searching pipeline was established, predicting 1013 NUMTs in the human reference genome, 364 (36%) of which are newly detected compared to the University of California, Santa Cruz (UCSC) reference human NUMTs database. Notably, 90 pre-eutherian human NUMTs were identified, representing significantly older NUMTs than previously reported, with origins dating back at least 100 million years. The most ancient mammalian NUMT could even date back over 160 million years, inserted into the nuclear genome of the common ancestor of therian mammals. This study provides a comprehensive exploration of the quantity and evolutionary history of mammalian NUMTs, paving the way for future research on endosymbiotic impact on the evolution of nuclear genomes.

线粒体基因组衍生的DNA序列插入到核基因组中是生物体进化过程中经常发生的事件，从而产生核线粒体DNA片段（numt），这是基因组进化的重要动力。一旦整合到核基因组中，一些numt可以保存较长时间，并可能获得新的细胞作用。然而，目前检测numt的主流方法在识别古老和高度降解的numt方面效率低下，导致它们的流行程度和影响被低估。这些古老的numt可能在遗传功能中发挥比以前认识到的更大的作用，包括有助于获得功能性外显子。本研究的重点是利用增强的高灵敏度序列比较方法鉴定哺乳动物基因组中的古代numt。建立了灵敏、准确的NUMT检索管道，预测了人类参考基因组中的1013个NUMT，其中364个（36%）与加州大学圣克鲁兹分校（UCSC）的人类参考基因组数据库相比是新检测到的。值得注意的是，发现了90个前真石器时代的人类numt，比之前报道的要古老得多，其起源至少可以追溯到1亿年前。最古老的哺乳动物NUMT甚至可以追溯到1.6亿年前，插入到兽类哺乳动物共同祖先的核基因组中。本研究对哺乳动物numt的数量和进化史进行了全面的探索，为今后研究内共生对核基因组进化的影响铺平了道路。

{"title":"Probability-Based Sequence Comparison Finds Pre-Eutherian Nuclear Mitochondrial DNA Segments in Mammalian Genomes.","authors":"Muyao Huang, Martin C Frith","doi":"10.1177/15578666261416560","DOIUrl":"https://doi.org/10.1177/15578666261416560","url":null,"abstract":"The insertion of mitochondrial genome-derived DNA sequences into the nuclear genome is a frequent event in organismal evolution, resulting in nuclear-mitochondrial DNA segments (NUMTs), which serve as a significant driving force for genome evolution. Once incorporated into the nuclear genome, some NUMTs can be conserved for extended periods and may potentially acquire novel cellular roles. However, current mainstream methods for detecting NUMTs are inefficient at identifying ancient and highly degraded NUMTs, leading to their prevalence and impact being underestimated. These ancient NUMTs likely play a far greater role in genetic functions than previously recognized, including contributing to the acquisition of functional exons. This study focuses on identifying ancient NUMTs in mammalian genomes using enhanced high-sensitivity sequence comparison methods. A sensitive and accurate NUMT searching pipeline was established, predicting 1013 NUMTs in the human reference genome, 364 (36%) of which are newly detected compared to the University of California, Santa Cruz (UCSC) reference human NUMTs database. Notably, 90 pre-eutherian human NUMTs were identified, representing significantly older NUMTs than previously reported, with origins dating back at least 100 million years. The most ancient mammalian NUMT could even date back over 160 million years, inserted into the nuclear genome of the common ancestor of therian mammals. This study provides a comprehensive exploration of the quantity and evolutionary history of mammalian NUMTs, paving the way for future research on endosymbiotic impact on the evolution of nuclear genomes.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"15578666261416560"},"PeriodicalIF":1.6,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146105709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reflections on the Presence of Human Cancer Genes in Primate Genomes. 灵长类动物基因组中人类癌症基因存在的思考

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-02-01 Epub Date: 2025-11-04 DOI: 10.1177/15578666251388496

Xiao Liang, Lenwood S Heath

Cancer genes, including oncogenes and tumor suppressor genes, are crucial subjects of study in human health. With advancements in nonhuman primate genome research, recent investigations have started to explore the relationship between human cancer genes and various primate species, implying possible links between cancer genes and nonhuman primate species. However, these findings are commonly limited to only a few primate species. Here, we investigate the distribution of 726 human cancer genes across 31 nonhuman primate species and 4 nonprimate species, comparing them with a randomly selected set of 339 other human genes. Overall, a higher ratio of cancer genes was found to have emerged before primate speciation compared with random human genes; however, this trend did not necessarily continue after primate speciation. Furthermore, our investigation into primate clades suggests that the absence of certain cancer genes in a clade may reflect recent emergence in other clades in which they exist, rather than elimination within the clades in which they are absent.

癌症基因，包括致癌基因和肿瘤抑制基因，是人类健康研究的重要课题。随着非人灵长类动物基因组研究的不断深入，最近的研究开始探索人类癌症基因与各种灵长类动物之间的关系，暗示癌症基因与非人灵长类动物之间可能存在联系。然而，这些发现通常仅限于少数灵长类动物。在这里，我们研究了726种人类癌症基因在31种非人类灵长类动物和4种非灵长类动物中的分布，并将它们与随机选择的339种其他人类基因进行了比较。总的来说，与随机的人类基因相比，在灵长类物种形成之前出现的癌症基因比例更高；然而，这种趋势在灵长类动物物种形成后不一定会继续下去。此外，我们对灵长类进化支系的研究表明，一个进化支系中某些癌症基因的缺失可能反映了它们存在的其他进化支系中最近的出现，而不是在它们不存在的进化支系中消失。

引用次数: 0

A Probabilistic Algorithm for Gene-Species Reconciliation with Segmental Duplications. 一种具有片段重复的基因-物种协调概率算法。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-02-01 Epub Date: 2026-02-02 DOI: 10.1177/15578666251392595

Yao-Ban Chan, Celine Scornavacca, Michael Charleston

Reconciliations are a mathematical tool to compare the phylogenetic trees of genes to the species that contain them, accounting for events such as gene duplication and loss. Traditional reconciliation methods have predominantly relied on parsimony to infer gene-only evolutionary events and usually make the hypothesis that genes evolve independently. Recently, more advanced models have been developed that account for complex gene interactions stemming from phenomena such as segmental duplications, where multiple genes undergo simultaneous duplication. In this article, we study the NP-hard problem of reconciling gene trees to a species tree with segmental duplications, without the aid of synteny information. We address this problem by proposing a novel probabilistic approach, imposing a Boltzmann distribution over the space of reconciliations. This allows for a Gibbs sampling-like Markov chain Monte Carlo algorithm that uses simulated annealing to effectively find or approximate the most parsimonious reconciliation, as demonstrated through rigorous simulations and re-analysis of empirical datasets. Our findings present a promising new framework for addressing NP-hard reconciliation challenges in phylogenetics, enhancing our understanding of gene evolution and its relationship with species evolution.

调和是一种数学工具，用于比较基因的系统发育树与包含它们的物种，解释基因复制和丢失等事件。传统的和解方法主要依赖于简约性来推断基因的进化事件，并且通常假设基因是独立进化的。最近，更先进的模型被开发出来，用于解释源于片段复制等现象的复杂基因相互作用，其中多个基因同时进行复制。在本文中，我们研究了一个NP-hard问题，即在不借助合成信息的情况下，将基因树与具有片段重复的物种树进行协调。我们通过提出一种新的概率方法来解决这个问题，在调和空间上施加玻尔兹曼分布。这允许吉布斯采样式马尔可夫链蒙特卡罗算法，该算法使用模拟退火来有效地找到或近似最简洁的和解，如通过严格的模拟和经验数据集的重新分析所证明的那样。我们的发现为解决系统发育中NP-hard和解挑战提供了一个有希望的新框架，增强了我们对基因进化及其与物种进化关系的理解。

{"title":"A Probabilistic Algorithm for Gene-Species Reconciliation with Segmental Duplications.","authors":"Yao-Ban Chan, Celine Scornavacca, Michael Charleston","doi":"10.1177/15578666251392595","DOIUrl":"10.1177/15578666251392595","url":null,"abstract":"Reconciliations are a mathematical tool to compare the phylogenetic trees of genes to the species that contain them, accounting for events such as gene duplication and loss. Traditional reconciliation methods have predominantly relied on parsimony to infer gene-only evolutionary events and usually make the hypothesis that genes evolve independently. Recently, more advanced models have been developed that account for complex gene interactions stemming from phenomena such as segmental duplications, where multiple genes undergo simultaneous duplication. In this article, we study the NP-hard problem of reconciling gene trees to a species tree with segmental duplications, without the aid of synteny information. We address this problem by proposing a novel probabilistic approach, imposing a Boltzmann distribution over the space of reconciliations. This allows for a Gibbs sampling-like Markov chain Monte Carlo algorithm that uses simulated annealing to effectively find or approximate the most parsimonious reconciliation, as demonstrated through rigorous simulations and re-analysis of empirical datasets. Our findings present a promising new framework for addressing NP-hard reconciliation challenges in phylogenetics, enhancing our understanding of gene evolution and its relationship with species evolution.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"203-218"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145809960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian Validation of Dynamic Systems for Biological Networks. 生物网络动态系统的贝叶斯验证。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-02-01 Epub Date: 2026-02-02 DOI: 10.1177/15578666251382251

Donghui Son, Jaejik Kim

Dynamic systems encompass a broad class of mathematical models used to describe the behavior of complex networks or systems over time. One of the most common approaches to modeling such dynamics is through a set of ordinary differential equations (ODEs), typically constructed based on hypotheses, known interactions, or observed trajectories. However, ODEs are deterministic and inflexible, while biological data are typically noisy. Thus, the model fit might not account for all possible data variations, and there might be a discrepancy between the actual biological process and the assumed model. This discrepancy could lead to inaccuracies in the prediction and interpretation of the biological networks. Therefore, it is required to validate ODE models in terms of observed data. Given that biological networks typically involve multiple sources of errors and uncertainties, the validation process should account for these factors. The Bayesian approaches offer a robust framework for quantifying errors and uncertainties. Thus, in this study, we propose a Bayesian validation method for ODE models that addresses model inadequacy, presented as bias. Since the proposed method estimates bias as a function of time, it can provide prediction bounds for the entire observed time interval. Consequently, it allows for a direct evaluation of the model's validity across the whole time interval, and it can lead to better prediction by correcting the bias.

动态系统包含了广泛的数学模型，用于描述复杂网络或系统随时间的行为。建模这种动力学的最常用方法之一是通过一组常微分方程（ode），通常是基于假设、已知的相互作用或观察到的轨迹构建的。然而，ode是确定性的和不灵活的，而生物数据通常是嘈杂的。因此，模型拟合可能无法解释所有可能的数据变化，并且实际的生物过程与假设的模型之间可能存在差异。这种差异可能导致对生物网络的预测和解释不准确。因此，需要根据观察到的数据来验证ODE模型。鉴于生物网络通常涉及多个错误和不确定性来源，验证过程应考虑到这些因素。贝叶斯方法为量化误差和不确定性提供了一个强大的框架。因此，在本研究中，我们提出了一种用于ODE模型的贝叶斯验证方法，以解决模型的不足，即偏差。由于提出的方法估计偏差作为时间的函数，它可以提供整个观测时间区间的预测界限。因此，它允许在整个时间间隔内对模型的有效性进行直接评估，并且可以通过纠正偏差来进行更好的预测。

{"title":"Bayesian Validation of Dynamic Systems for Biological Networks.","authors":"Donghui Son, Jaejik Kim","doi":"10.1177/15578666251382251","DOIUrl":"10.1177/15578666251382251","url":null,"abstract":"Dynamic systems encompass a broad class of mathematical models used to describe the behavior of complex networks or systems over time. One of the most common approaches to modeling such dynamics is through a set of ordinary differential equations (ODEs), typically constructed based on hypotheses, known interactions, or observed trajectories. However, ODEs are deterministic and inflexible, while biological data are typically noisy. Thus, the model fit might not account for all possible data variations, and there might be a discrepancy between the actual biological process and the assumed model. This discrepancy could lead to inaccuracies in the prediction and interpretation of the biological networks. Therefore, it is required to validate ODE models in terms of observed data. Given that biological networks typically involve multiple sources of errors and uncertainties, the validation process should account for these factors. The Bayesian approaches offer a robust framework for quantifying errors and uncertainties. Thus, in this study, we propose a Bayesian validation method for ODE models that addresses model inadequacy, presented as bias. Since the proposed method estimates bias as a function of time, it can provide prediction bounds for the entire observed time interval. Consequently, it allows for a direct evaluation of the model's validity across the whole time interval, and it can lead to better prediction by correcting the bias.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"267-279"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145225417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Perspective Natural Vector: A Novel Method for Viral Sequence Feature Extraction. 多视角自然向量：一种新的病毒序列特征提取方法。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-02-01 Epub Date: 2026-02-02 DOI: 10.1177/15578666251391211

Xiang Shi, Jiayi Kang, Nan Sun, Xin Zhao, Stephen S-T Yau

The rapid expansion of biological data in recent decades has highlighted the need for efficient methods in sequence analysis. Traditional pairwise alignment approaches are both time-consuming and memory-intensive. Alignment-free (AF) methods such as natural vector (NV) and k-mer operate on a one-dimensional framework, interpreting DNA primarily as a linear string of nucleotides. To achieve a more comprehensive interpretation of molecular structure, this study incorporates the three-dimensional architectural features of DNA and introduces a novel AF method named Multi-perspective natural vector (MNV). The MNV method maps genome sequences of varying lengths to points within a unified geometric space, facilitating large-size data processing tasks such as variant classification and clustering. Across datasets of different sizes and types, MNV attains a 100% convex hull separation ratio in lower dimensions compared with widely used methods NV and k-mer methods. In neural network classification, MNV achieves better classification accuracy of 99.55% and 98.78% on SARS-CoV-2 and poliovirus datasets respectively, demonstrating its effectiveness in viral genome analysis while maintaining computational efficiency.

近几十年来，生物数据的快速增长凸显了对高效序列分析方法的需求。传统的成对对齐方法既耗时又占用大量内存。自然载体（NV）和k-mer等无比对（AF）方法在一维框架上操作，主要将DNA解释为核苷酸的线性字符串。为了更全面地解释分子结构，本研究结合DNA的三维结构特征，引入了一种新的AF方法——多视角自然向量法（Multi-perspective natural vector， MNV）。MNV方法将不同长度的基因组序列映射到统一几何空间内的点上，便于进行变异分类和聚类等大规模数据处理任务。在不同大小和类型的数据集上，与广泛使用的NV和k-mer方法相比，MNV方法在较低维度上实现了100%的凸包分离率。在神经网络分类中，MNV在SARS-CoV-2和脊髓灰质炎病毒数据集上的分类准确率分别达到了99.55%和98.78%，证明了其在保持计算效率的同时对病毒基因组分析的有效性。

{"title":"Multi-Perspective Natural Vector: A Novel Method for Viral Sequence Feature Extraction.","authors":"Xiang Shi, Jiayi Kang, Nan Sun, Xin Zhao, Stephen S-T Yau","doi":"10.1177/15578666251391211","DOIUrl":"10.1177/15578666251391211","url":null,"abstract":"The rapid expansion of biological data in recent decades has highlighted the need for efficient methods in sequence analysis. Traditional pairwise alignment approaches are both time-consuming and memory-intensive. Alignment-free (AF) methods such as natural vector (NV) and k-mer operate on a one-dimensional framework, interpreting DNA primarily as a linear string of nucleotides. To achieve a more comprehensive interpretation of molecular structure, this study incorporates the three-dimensional architectural features of DNA and introduces a novel AF method named Multi-perspective natural vector (MNV). The MNV method maps genome sequences of varying lengths to points within a unified geometric space, facilitating large-size data processing tasks such as variant classification and clustering. Across datasets of different sizes and types, MNV attains a 100% convex hull separation ratio in lower dimensions compared with widely used methods NV and k-mer methods. In neural network classification, MNV achieves better classification accuracy of 99.55% and 98.78% on SARS-CoV-2 and poliovirus datasets respectively, demonstrating its effectiveness in viral genome analysis while maintaining computational efficiency.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"255-266"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145701154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MCMC-CE: A Novel and Efficient Algorithm for Estimating Small Right-Tail Probabilities of Quadratic Forms with Applications in Genomics. MCMC-CE：一种估计二次型小右尾概率的高效算法及其在基因组学中的应用。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-02-01 Epub Date: 2026-01-30 DOI: 10.1177/15578666251406305

Vy Q Ong, Bich N Choi, Devin P Lundy, Yingnan Zhang, Yin Wan, Hongyan Xu, Santu Ghosh, Hui Jiang, Yang Shi

Quadratic forms of multivariate normal variables play a critical role in statistical applications, particularly in genomics and bioinformatics. However, accurately computing small right-tail probabilities (p-values) for large-scale quadratic forms is computationally challenging due to the intractability of their probability distributions, as well as significant numerical constraints and computational burdens. To address these problems, we propose Markov chain Monte Carlo cross-entropy (MCMC-CE), an innovative algorithm that integrates MCMC sampling with the CE method, coupled with leading eigenvalue extraction and Satterthwaite-type approximation techniques. Our approach efficiently estimates small p-values for quadratic forms with their ranks exceeding 10,000. Through extensive simulation studies and real-world applications in genomics, including genome-wide association studies and pathway enrichment analyses, our method demonstrates advantageous numerical accuracy and computational reliability compared with existing approaches such as Davies', Imhof's, Farebrother's, Liu-Tang-Zhang's, and saddlepoint approximation methods. MCMC-CE provides a robust and scalable solution for accurately computing small p-values for quadratic forms, facilitating more precise statistical inference in large-scale genomic studies.

多元正态变量的二次型在统计学应用中起着至关重要的作用，特别是在基因组学和生物信息学中。然而，精确计算大规模二次型的小右尾概率（p值）在计算上具有挑战性，因为它们的概率分布难以处理，以及显著的数值限制和计算负担。为了解决这些问题，我们提出了马尔可夫链蒙特卡罗交叉熵（MCMC-CE），这是一种将MCMC采样与CE方法相结合的创新算法，结合了领先的特征值提取和satterthwaite型近似技术。我们的方法有效地估计了秩超过10,000的二次型的小p值。通过广泛的模拟研究和基因组学的实际应用，包括全基因组关联研究和途径富集分析，我们的方法与现有的方法（如Davies， Imhof, Farebrother, Liu-Tang-Zhang）和鞍点近似方法相比，显示出了优越的数值准确性和计算可靠性。MCMC-CE为精确计算二次型的小p值提供了一个强大的可扩展的解决方案，促进了大规模基因组研究中更精确的统计推断。

{"title":"MCMC-CE: A Novel and Efficient Algorithm for Estimating Small Right-Tail Probabilities of Quadratic Forms with Applications in Genomics.","authors":"Vy Q Ong, Bich N Choi, Devin P Lundy, Yingnan Zhang, Yin Wan, Hongyan Xu, Santu Ghosh, Hui Jiang, Yang Shi","doi":"10.1177/15578666251406305","DOIUrl":"10.1177/15578666251406305","url":null,"abstract":"Quadratic forms of multivariate normal variables play a critical role in statistical applications, particularly in genomics and bioinformatics. However, accurately computing small right-tail probabilities (p-values) for large-scale quadratic forms is computationally challenging due to the intractability of their probability distributions, as well as significant numerical constraints and computational burdens. To address these problems, we propose Markov chain Monte Carlo cross-entropy (MCMC-CE), an innovative algorithm that integrates MCMC sampling with the CE method, coupled with leading eigenvalue extraction and Satterthwaite-type approximation techniques. Our approach efficiently estimates small p-values for quadratic forms with their ranks exceeding 10,000. Through extensive simulation studies and real-world applications in genomics, including genome-wide association studies and pathway enrichment analyses, our method demonstrates advantageous numerical accuracy and computational reliability compared with existing approaches such as Davies', Imhof's, Farebrother's, Liu-Tang-Zhang's, and saddlepoint approximation methods. MCMC-CE provides a robust and scalable solution for accurately computing small p-values for quadratic forms, facilitating more precise statistical inference in large-scale genomic studies.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"236-254"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HyperSBINN: A Hypernetwork-Enhanced Systems Biology-Informed Neural Network for Efficient Drug Cardiosafety Assessment. hyperbinn：用于有效药物心脏安全性评估的超网络增强系统生物学信息神经网络。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-01-30 DOI: 10.1177/15578666251410587

Inass Soukarieh, Gerhard Hessler, Hervé Minoux, Marcel Mohr, Friedemann Schmidt, Jan Wenzel, Pierre Barbillon, Hugo Gangloff, Pierre Gloaguen

Mathematical modeling in systems toxicology enables a comprehensive understanding of the effects of pharmaceutical substances on cardiac health. However, the complexity of these models limits their widespread application in early drug discovery. In this article, we introduce a novel approach to solving parameterized models of cardiac action potentials (CAPs) by combining meta-learning techniques with systems biology-informed neural networks (SBINNs). The proposed method, hyperSBINN, effectively addresses the challenge of predicting the effects of various compounds at different concentrations on CAPs, outperforming traditional differential equation solvers in speed. Our model efficiently handles scenarios with limited data and complex parameterized differential equations. The hyperSBINN model demonstrates robust performance in predicting APD90 values, indicating its potential as a reliable tool for modeling cardiac electrophysiology and aiding in preclinical drug development. This framework represents an advancement in computational modeling, offering a scalable and efficient solution for simulating and understanding complex biological systems.

系统毒理学中的数学建模使人们能够全面了解药物物质对心脏健康的影响。然而，这些模型的复杂性限制了它们在早期药物发现中的广泛应用。在本文中，我们介绍了一种将元学习技术与系统生物学信息神经网络（sbinn）相结合的新方法来求解心脏动作电位（CAPs）的参数化模型。提出的方法hyperSBINN，有效地解决了预测不同浓度的各种化合物对cap的影响的挑战，在速度上优于传统的微分方程求解器。我们的模型可以有效地处理有限数据和复杂参数化微分方程的情况。hyperSBINN模型在预测APD90值方面表现出强大的性能，表明其作为心脏电生理建模和临床前药物开发的可靠工具的潜力。这个框架代表了计算建模的一个进步，为模拟和理解复杂的生物系统提供了一个可扩展和有效的解决方案。

{"title":"HyperSBINN: A Hypernetwork-Enhanced Systems Biology-Informed Neural Network for Efficient Drug Cardiosafety Assessment.","authors":"Inass Soukarieh, Gerhard Hessler, Hervé Minoux, Marcel Mohr, Friedemann Schmidt, Jan Wenzel, Pierre Barbillon, Hugo Gangloff, Pierre Gloaguen","doi":"10.1177/15578666251410587","DOIUrl":"https://doi.org/10.1177/15578666251410587","url":null,"abstract":"Mathematical modeling in systems toxicology enables a comprehensive understanding of the effects of pharmaceutical substances on cardiac health. However, the complexity of these models limits their widespread application in early drug discovery. In this article, we introduce a novel approach to solving parameterized models of cardiac action potentials (CAPs) by combining meta-learning techniques with systems biology-informed neural networks (SBINNs). The proposed method, hyperSBINN, effectively addresses the challenge of predicting the effects of various compounds at different concentrations on CAPs, outperforming traditional differential equation solvers in speed. Our model efficiently handles scenarios with limited data and complex parameterized differential equations. The hyperSBINN model demonstrates robust performance in predicting APD90 values, indicating its potential as a reliable tool for modeling cardiac electrophysiology and aiding in preclinical drug development. This framework represents an advancement in computational modeling, offering a scalable and efficient solution for simulating and understanding complex biological systems.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"15578666251410587"},"PeriodicalIF":1.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

USTAR-CR: Efficient and Compact Compression of k-Mer Sets Through Colored de Bruijn Graphs. USTAR-CR：基于彩色de Bruijn图的k-Mer集的高效压缩。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-01-29 DOI: 10.1177/15578666261416557

Enrico Rossignolo, Matteo Comin

A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements-up to 64× faster-highlighting its effectiveness for large-scale genomic data processing.

计算基因组学的核心任务是将输入序列转化为其组成的k-mers。有效地存储这些k-mer集合对于扩展生物信息学工作流程至关重要。一种常见的策略包括将k-mers表示为de Bruijn图（dBG），并通过最小路径覆盖派生出紧凑的纯文本形式。在本文中，我们介绍了USTAR-CR (unitistitch Advanced constRuction with Colors Reordering)，一种用于压缩多个k-mer集的快速且节省空间的算法。USTAR-CR利用彩色dbg的结构特性来构建简洁的纯文本表示，同时还结合了一种有效的编码k-mer颜色信息的方案。我们在真实的测序数据集上评估USTAR-CR，并将其与最先进的工具GGCAT进行基准测试。USTAR-CR实现了卓越的压缩比，显着降低了内存使用，并提供了显著的速度提升-高达64倍-突出了其大规模基因组数据处理的有效性。

引用次数: 0

ClusterDE: A Statistical Software Package for Removing Double-Dipping Bias in Post-Clustering Differential Expression Analysis. 聚类后差异表达分析中去除双浸偏差的统计软件包。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-01-01 Epub Date: 2026-01-30 DOI: 10.1177/15578666251383562

Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li

Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.

单细胞和空间转录组学的典型方法包括聚集细胞或空间点，然后进行聚类后差异表达（DE）分析，以识别标记基因，将集群注释为细胞类型或空间域。然而，在聚类和DE分析中使用相同的数据（称为双重检测的问题）可能导致对DE基因的错误检测。特别是，过度聚类会产生人为的集群，这些集群被错误地解释为不同的细胞类型或空间域。为了解决这个问题，ClusterDE R包实现了一种使用合成空数据集的统计方法，该数据集由单个同质细胞群或空间域组成，但在基因均值、方差和基因-基因秩相关性方面与真实数据集相匹配。通过作为平行阴性对照，合成的零数据允许用户识别和去除由双浸产生的假阳性DE基因。本文介绍了ClusterDE R包，并提供了关于安装和使用的实用指导，以便在聚类之后进行更可靠的标记基因检测。

{"title":"ClusterDE: A Statistical Software Package for Removing Double-Dipping Bias in Post-Clustering Differential Expression Analysis.","authors":"Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li","doi":"10.1177/15578666251383562","DOIUrl":"10.1177/15578666251383562","url":null,"abstract":"Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"36-42"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145251583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pharming: Joint Clonal Tree Reconstruction of Single-Nucleotide Variant and Copy Number Aberration Evolution from Single-Cell DNA Sequencing of Tumors. 嫁接：肿瘤单细胞DNA序列单核苷酸变异的联合克隆树重建和拷贝数畸变进化。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology

Pub Date : 2026-01-01 Epub Date: 2026-01-30 DOI: 10.1177/15578666251374694

Leah L Weber, Anna Hart, Idoia Ochoa, Mohammed El-Kebir

Cancer arises through an evolutionary process in which somatic mutations, including single-nucleotide variants (SNVs) and copy number aberrations (CNAs), drive the development of a malignant, heterogeneous tumor. Reconstructing this evolutionary history from sequencing data is critical for understanding the order in which mutations are acquired and the dynamic interplay between different types of alterations. Advances in modern whole-genome single-cell sequencing now enable the accurate inference of copy number profiles in individual cells. However, the low sequencing coverage of these low-pass sequencing technologies poses a challenge for reliably inferring the presence or absence of SNVs within tumor cells, limiting the ability to simultaneously study the evolutionary relationships between SNVs and CNAs. In this work, we introduce a novel tumor phylogeny inference method, Pharming, that infers the joint evolutionary history of SNVs and CNAs. Our key insight is to leverage the high accuracy of copy number inference methods and the fact that SNVs co-occur in regions with CNAs in order to enable more precise tumor phylogeny reconstruction for both alteration types. We demonstrate via simulations that Pharming outperforms state-of-the-art tumor phylogeny inference methods. Additionally, we apply Pharming to a triple-negative breast cancer case, achieving high-resolution, joint reconstruction of CNA and SNV evolution, including the de novo detection of a clonal whole-genome duplication event. Thus, Pharming offers the potential for more comprehensive and detailed tumor phylogeny inference for high-throughput, low-coverage single-cell DNA sequencing technologies compared with existing approaches.

癌症是通过体细胞突变（包括单核苷酸变异（snv）和拷贝数畸变（CNAs））驱动恶性异质性肿瘤发展的进化过程产生的。从测序数据中重建这一进化历史对于理解突变获得的顺序和不同类型改变之间的动态相互作用至关重要。在现代全基因组单细胞测序的进步，现在可以准确推断拷贝数谱在单个细胞。然而，这些低通度测序技术的低测序覆盖率给可靠推断肿瘤细胞内snv的存在与否带来了挑战，限制了同时研究snv与CNAs之间进化关系的能力。在这项工作中，我们引入了一种新的肿瘤系统发育推断方法，Pharming，来推断snv和CNAs的共同进化史。我们的关键见解是利用拷贝数推断方法的高精度以及snv与CNAs共同发生的事实，以便对两种改变类型进行更精确的肿瘤系统发育重建。我们通过模拟证明，Pharming优于最先进的肿瘤系统发育推断方法。此外，我们将Pharming应用于三阴性乳腺癌病例，实现了CNA和SNV进化的高分辨率联合重建，包括克隆全基因组重复事件的从头检测。因此，与现有方法相比，Pharming为高通量、低覆盖率的单细胞DNA测序技术提供了更全面、更详细的肿瘤系统发育推断的潜力。

{"title":"Pharming: Joint Clonal Tree Reconstruction of Single-Nucleotide Variant and Copy Number Aberration Evolution from Single-Cell DNA Sequencing of Tumors.","authors":"Leah L Weber, Anna Hart, Idoia Ochoa, Mohammed El-Kebir","doi":"10.1177/15578666251374694","DOIUrl":"10.1177/15578666251374694","url":null,"abstract":"Cancer arises through an evolutionary process in which somatic mutations, including single-nucleotide variants (SNVs) and copy number aberrations (CNAs), drive the development of a malignant, heterogeneous tumor. Reconstructing this evolutionary history from sequencing data is critical for understanding the order in which mutations are acquired and the dynamic interplay between different types of alterations. Advances in modern whole-genome single-cell sequencing now enable the accurate inference of copy number profiles in individual cells. However, the low sequencing coverage of these low-pass sequencing technologies poses a challenge for reliably inferring the presence or absence of SNVs within tumor cells, limiting the ability to simultaneously study the evolutionary relationships between SNVs and CNAs. In this work, we introduce a novel tumor phylogeny inference method, Pharming, that infers the joint evolutionary history of SNVs and CNAs. Our key insight is to leverage the high accuracy of copy number inference methods and the fact that SNVs co-occur in regions with CNAs in order to enable more precise tumor phylogeny reconstruction for both alteration types. We demonstrate via simulations that Pharming outperforms state-of-the-art tumor phylogeny inference methods. Additionally, we apply Pharming to a triple-negative breast cancer case, achieving high-resolution, joint reconstruction of CNA and SNV evolution, including the de novo detection of a clonal whole-genome duplication event. Thus, Pharming offers the potential for more comprehensive and detailed tumor phylogeny inference for high-throughput, low-coverage single-cell DNA sequencing technologies compared with existing approaches.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"2-18"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145053702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0