Journal of Computational Biology最新文献

英文中文

Shortest Hyperpaths in Directed Hypergraphs for Reaction Pathway Inference. 用于反应路径推断的有向超图中的最短超路径。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-11-01 Epub Date: 2023-10-31 DOI: 10.1089/cmb.2023.0242

Spencer Krieger, John Kececioglu

Signaling and metabolic pathways, which consist of chains of reactions that produce target molecules from source compounds, are cornerstones of cellular biology. Properly modeling the reaction networks that represent such pathways requires directed hypergraphs, where each molecule or compound maps to a vertex, and each reaction maps to a hyperedge directed from its set of input reactants to its set of output products. Inferring the most likely series of reactions that produces a given set of targets from a given set of sources, where for each reaction its reactants are produced by prior reactions in the series, corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. We give the first exact algorithm for general shortest hyperpaths that can find provably optimal solutions for large, real-world, reaction networks. In particular, we derive a novel graph-theoretic characterization of hyperpaths, which we leverage in a new integer linear programming formulation of shortest hyperpaths that for the first time handles cycles, and develop a cutting-plane algorithm that can solve this integer linear program to optimality in practice. Through comprehensive experiments over all of the thousands of instances from the standard Reactome and NCI-PID reaction databases, we demonstrate that our cutting-plane algorithm quickly finds an optimal hyperpath-inferring the most likely pathway-with a median running time of under 10 seconds, and a maximum time of less than 30 minutes, even on instances with thousands of reactions. We also explore for the first time how well hyperpaths infer true pathways, and show that shortest hyperpaths accurately recover known pathways, typically with very high precision and recall. Source code implementing our cutting-plane algorithm for shortest hyperpaths is available free for research use in a new tool called Mmunin.

信号和代谢途径是细胞生物学的基石，它由从源化合物中产生靶分子的反应链组成。正确地建模代表这种途径的反应网络需要有向超图，其中每个分子或化合物映射到一个顶点，每个反应映射到从其输入反应物集指向其输出产物集的超边。从一组给定的源中推断出产生一组给定目标的最可能的一系列反应，其中对于每个反应，其反应物都是由该系列中的先前反应产生的，这对应于在有向超图中找到最短的超路径，这是NP完全的。我们给出了一般最短超路径的第一个精确算法，该算法可以为大型、真实世界的反应网络找到可证明的最优解。特别地，我们推导了超路径的一种新的图论表征，我们将其用于首次处理循环的最短超路径的新整数线性规划公式，并开发了一种切割平面算法，该算法可以在实践中将该整数线性规划解为最优性。通过对标准Reactome和NCI-PID反应数据库中的数千个实例进行全面实验，我们证明了我们的切割平面算法可以快速找到最优超路径，推断出最有可能的路径，中位运行时间不到10秒，最长时间不到30分钟，即使在有数千个反应的实例中也是如此。我们还首次探索了超通路推断真实通路的能力，并表明最短的超通路准确地恢复了已知通路，通常具有非常高的精确度和召回率。实现我们的最短超路径切割平面算法的源代码可在一种名为Mmunin的新工具中免费供研究使用。

{"title":"Shortest Hyperpaths in Directed Hypergraphs for Reaction Pathway Inference.","authors":"Spencer Krieger, John Kececioglu","doi":"10.1089/cmb.2023.0242","DOIUrl":"10.1089/cmb.2023.0242","url":null,"abstract":"Signaling and metabolic pathways, which consist of chains of reactions that produce target molecules from source compounds, are cornerstones of cellular biology. Properly modeling the reaction networks that represent such pathways requires directed hypergraphs, where each molecule or compound maps to a vertex, and each reaction maps to a hyperedge directed from its set of input reactants to its set of output products. Inferring the most likely series of reactions that produces a given set of targets from a given set of sources, where for each reaction its reactants are produced by prior reactions in the series, corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. We give the first exact algorithm for general shortest hyperpaths that can find provably optimal solutions for large, real-world, reaction networks. In particular, we derive a novel graph-theoretic characterization of hyperpaths, which we leverage in a new integer linear programming formulation of shortest hyperpaths that for the first time handles cycles, and develop a cutting-plane algorithm that can solve this integer linear program to optimality in practice. Through comprehensive experiments over all of the thousands of instances from the standard Reactome and NCI-PID reaction databases, we demonstrate that our cutting-plane algorithm quickly finds an optimal hyperpath-inferring the most likely pathway-with a median running time of under 10 seconds, and a maximum time of less than 30 minutes, even on instances with thousands of reactions. We also explore for the first time how well hyperpaths infer true pathways, and show that shortest hyperpaths accurately recover known pathways, typically with very high precision and recall. Source code implementing our cutting-plane algorithm for shortest hyperpaths is available free for research use in a new tool called Mmunin.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71412396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Python Package itca for Information-Theoretic Classification Accuracy: A Criterion That Guides Data-Driven Combination of Ambiguous Outcome Labels in Multiclass Classification. 信息论分类准确性的Python包itca：指导多类分类中模糊结果标签的数据驱动组合的标准。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-11-01 Epub Date: 2023-11-06 DOI: 10.1089/cmb.2023.0191

Chihao Zhang, Shihua Zhang, Jingyi Jessica Li

The itca Python package offers an information-theoretic criterion to assist practitioners in combining ambiguous outcome labels by balancing the tradeoff between prediction accuracy and classification resolution. This article provides instructions for installing the itca Python package, demonstrates how to evaluate the criterion, and showcases its application in real-world scenarios for guiding the combination of ambiguous outcome labels.

itcaPython包提供了一个信息论标准，通过平衡预测准确性和分类分辨率之间的权衡，帮助从业者组合模糊的结果标签。本文提供了安装itcaPython包的说明，演示了如何评估该标准，并展示了它在实际场景中的应用，以指导模糊结果标签的组合。

引用次数: 0

Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. 非循环泛基因组图的间隙敏感共线链算法。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-11-01 Epub Date: 2023-10-30 DOI: 10.1089/cmb.2023.0186

Ghanshyam Chandra, Chirag Jain

A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.

泛基因组图可以作为基因组研究的更好参考，因为它可以紧凑地表示一个物种内的多个基因组。将序列与图对齐对于基于泛基因组的重新排序至关重要。种子链扩展启发式通过在序列和图之间找到短的精确匹配来工作。在这种启发式方法中，共线链接有助于识别出一个良好的精确匹配集群，这些匹配可以组合起来形成对齐。共线链算法已被广泛研究，用于对齐具有各种间隙代价的两个序列，包括线性、凹和凸代价函数。然而，将这些算法扩展到序列到图对齐带来了重大挑战。最近，Makinen等人介绍了一种稀疏动态编程框架，该框架利用了非循环泛基因组图的小路径覆盖特性，实现了高效的链接。然而，这一框架没有考虑缺口成本，限制了其实际有效性。我们通过开发新的问题公式和可证明的良好链接算法来解决这一限制，这些算法支持各种间隙成本函数。这些函数经过精心设计，可以实现快速链接算法，其时间要求根据最小路径覆盖的大小进行参数化。通过实证评估，我们证明了与现有对准器相比，我们的算法具有优越的性能。当将模拟的长读数映射到包含95种人类单倍型的泛基因组图时，我们在离开时达到了98.7%的精度

{"title":"Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs.","authors":"Ghanshyam Chandra, Chirag Jain","doi":"10.1089/cmb.2023.0186","DOIUrl":"10.1089/cmb.2023.0186","url":null,"abstract":"\u0000 A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.\u0000 ","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71412463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Counting Distinguishable RNA Secondary Structures. 计数可辨别的RNA二级结构。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-10-01 Epub Date: 2023-10-09 DOI: 10.1089/cmb.2022.0501

Masaru Nakajima, Andrew D Smith

RNA secondary structures are essential abstractions for understanding spacial folding behaviors of those macromolecules. Many secondary structure algorithms involve a common dynamic programming setup to exploit the property that secondary structures can be decomposed into substructures. Dirks et al. noted that this setup cannot directly address an issue of distinguishability among secondary structures, which arises for classes of sequences that admit nontrivial symmetry. Circular sequences are among these. We examine the problem of counting distinguishable secondary structures. Drawing from elementary results in group theory, we identify useful subsets of secondary structures. We then extend an algorithm due to Hofacker et al. for computing the sizes of these subsets. This yields a cubic-time algorithm to count distinguishable structures compatible with a given circular sequence. Furthermore, this general approach may be used to solve similar problems for which the RNA structures of interest involve symmetries.

RNA二级结构是理解这些大分子空间折叠行为的重要抽象概念。许多二次结构算法涉及一个通用的动态规划设置，以利用二次结构可以分解为子结构的特性。Dirks等人指出，这种设置不能直接解决二级结构之间的可区分性问题，这是在允许非平凡对称的序列类中出现的。循环序列就是其中之一。我们研究了可区分二级结构的计数问题。根据群论的基本结果，我们确定了二级结构的有用子集。然后，我们扩展了Hofacker等人的算法，用于计算这些子集的大小。这产生了一种三次时间算法来计算与给定圆形序列兼容的可区分结构。此外，这种通用方法可用于解决感兴趣的RNA结构涉及对称性的类似问题。

引用次数: 0

Numerical Analysis of Split-Step Backward Euler Method with Truncated Wiener Process for a Stochastic Susceptible-Infected-Susceptible Model. 随机易感性-感染易感性模型的截断Wiener过程的分段后退Euler方法的数值分析。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-10-01 Epub Date: 2023-10-09 DOI: 10.1089/cmb.2022.0462

Xiaochen Yang, Zhanwen Yang, Chiping Zhang

This article deals with the numerical positivity, boundedness, convergence, and dynamical behaviors for stochastic susceptible-infected-susceptible (SIS) model. To guarantee the biological significance of the split-step backward Euler method applied to the stochastic SIS model, the numerical positivity and boundedness are investigated by the truncated Wiener process. Motivated by the almost sure boundedness of exact and numerical solutions, the convergence is discussed by the fundamental convergence theorem with a local Lipschitz condition. Moreover, the numerical extinction and persistence are initially obtained by an exponential presentation of the stochastic stability function and strong law of the large number for martingales, which reproduces the existing theoretical results. Finally, numerical examples are given to validate our numerical results for the stochastic SIS model.

本文研究随机易感感染易感（SIS）模型的数值正性、有界性、收敛性和动力学行为。为了保证分段后向欧拉方法应用于随机SIS模型的生物学意义，利用截断Wiener过程研究了数值的正性和有界性。受精确解和数值解的几乎肯定有界性的启发，利用具有局部Lipschitz条件的基本收敛定理讨论了收敛性。此外，数值消光和持久性最初是通过随机稳定函数的指数表示和鞅的强大数定律得到的，这再现了现有的理论结果。最后，通过算例验证了随机SIS模型的数值结果。

引用次数: 0

Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins. 整合低阶和高阶相关信息用于鉴定噬菌体病毒蛋白。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-10-01 Epub Date: 2023-09-20 DOI: 10.1089/cmb.2022.0237

Hongliang Zou, Wanting Yu

Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.

噬菌体病毒粒子蛋白（PVPs）在宿主细胞中起着重要作用。PVP的快速准确鉴定有利于相关药物的发现和开发。尽管湿法实验方法是鉴定PVP的首选方法，但它们成本高昂且耗时。因此，研究人员将注意力转向了计算模型，这可以加快相关研究的速度。因此，我们在当前的研究中提出了一种新的机器学习模型来识别PVP。首先，使用50种不同类型的物理化学性质来表示蛋白质序列。其次，采用皮尔逊相关系数（PCC）和最大信息系数（MIC）两种不同的方法提取判别信息。此外，为了捕获高阶相关信息，我们再次使用PCC和MIC。之后，我们采用最小绝对收缩和选择算子算法来选择最优特征子集。最后，将这些选择的特征输入到支持载体机器中，以区分PVP和噬菌体非病毒粒子蛋白。我们在两个不同的数据集上进行了实验，以验证我们提出的方法的有效性。实验结果表明，与最先进的方法相比，性能显著提高。这表明所提出的计算模型可能成为识别PVP的强大预测因子。

{"title":"Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins.","authors":"Hongliang Zou, Wanting Yu","doi":"10.1089/cmb.2022.0237","DOIUrl":"10.1089/cmb.2022.0237","url":null,"abstract":"Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41127589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weighted Selection Probability to Prioritize Susceptible Rare Variants in Multi-Phenotype Association Studies with Application to a Soybean Genetic Data Set. 在多表型关联研究中优先考虑易感稀有变异的加权选择概率及其在大豆遗传数据集中的应用。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-10-01 DOI: 10.1089/cmb.2022.0487

Xianglong Liang, Hokeun Sun

Rare variant association studies with multiple traits or diseases have drawn a lot of attention since association signals of rare variants can be boosted if more than one phenotype outcome is associated with the same rare variants. Most of the existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a pre-specified genetic region is tested one at a time. However, these methods are not designed to locate susceptible rare variants within the genetic region. In this article, we propose new statistical methods to prioritize rare variants within a genetic region when a group test for the genetic region identifies a statistical association with multiple phenotypes. It computes the weighted selection probability (WSP) of individual rare variants and ranks them from largest to smallest according to their WSP. In simulation studies, we demonstrated that the proposed method outperforms other statistical methods in terms of true positive selection, when multiple phenotypes are correlated with each other. We also applied it to our soybean single nucleotide polymorphism (SNP) data with 13 highly correlated amino acids, where we identified some potentially susceptible rare variants in chromosome 19.

具有多种性状或疾病的罕见变异关联研究引起了人们的广泛关注，因为如果同一罕见变异与多个表型结果相关，则罕见变异的关联信号可以增强。大多数现有的识别与多种表型相关的罕见变异的统计方法都是基于群体测试，即一次测试一个预先指定的遗传区域。然而，这些方法并不是为了在遗传区域内定位易感的罕见变异。在这篇文章中，我们提出了新的统计方法，当遗传区域的群体测试确定与多种表型的统计关联时，优先考虑遗传区域内的罕见变异。它计算单个稀有变体的加权选择概率（WSP），并根据其WSP从大到小进行排序。在模拟研究中，我们证明了当多种表型相互关联时，所提出的方法在真阳性选择方面优于其他统计方法。我们还将其应用于具有13个高度相关氨基酸的大豆单核苷酸多态性（SNP）数据，在19号染色体上发现了一些潜在的易感罕见变异。

{"title":"Weighted Selection Probability to Prioritize Susceptible Rare Variants in Multi-Phenotype Association Studies with Application to a Soybean Genetic Data Set.","authors":"Xianglong Liang, Hokeun Sun","doi":"10.1089/cmb.2022.0487","DOIUrl":"10.1089/cmb.2022.0487","url":null,"abstract":"Rare variant association studies with multiple traits or diseases have drawn a lot of attention since association signals of rare variants can be boosted if more than one phenotype outcome is associated with the same rare variants. Most of the existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a pre-specified genetic region is tested one at a time. However, these methods are not designed to locate susceptible rare variants within the genetic region. In this article, we propose new statistical methods to prioritize rare variants within a genetic region when a group test for the genetic region identifies a statistical association with multiple phenotypes. It computes the weighted selection probability (WSP) of individual rare variants and ranks them from largest to smallest according to their WSP. In simulation studies, we demonstrated that the proposed method outperforms other statistical methods in terms of true positive selection, when multiple phenotypes are correlated with each other. We also applied it to our soybean single nucleotide polymorphism (SNP) data with 13 highly correlated amino acids, where we identified some potentially susceptible rare variants in chromosome 19.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49690786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying Subpopulations of Cells in Single-Cell Transcriptomic Data: A Bayesian Mixture Modeling Approach to Zero Inflation of Counts. 识别单细胞转录组数据中的细胞亚群：计数零膨胀的贝叶斯混合建模方法。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-10-01 DOI: 10.1089/cmb.2022.0273

Tom Wilson, Duong H T Vo, Thomas Thorne

In the study of single-cell RNA-seq (scRNA-Seq) data, a key component of the analysis is to identify subpopulations of cells in the data. A variety of approaches to this have been considered, and although many machine learning-based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this, probabilistic models have been developed, but scRNA-Seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model that employs both a mixture at the cell level to model multiple populations of cells and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach, we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model scRNA-Seq counts and negative binomial models that do not take into account zero inflation. Applied to a publicly available data set of scRNA-Seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish subpopulations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a subpopulation.

在单细胞RNA-seq（scRNA-seq）数据的研究中，分析的一个关键组成部分是识别数据中的细胞亚群。已经考虑了多种方法，尽管已经开发了许多基于机器学习的方法，但这些方法很少对聚类分配的不确定性进行估计。为了实现这一点，已经开发了概率模型，但scRNA-Seq数据表现出一种称为脱落的现象，即观察到的读取计数中有很大一部分为零。这对开发对数据进行适当建模的概率模型提出了挑战。我们开发了一种新的狄利克雷过程混合物模型，该模型既采用细胞水平的混合物来模拟多个细胞群体，也采用转录物水平的计数的零膨胀负二项式混合物。通过采用贝叶斯方法，我们能够对聚类中基因的表达进行建模，并量化聚类分配中的不确定性。结果表明，该方法优于以前的方法，以前的方法将多项式分布应用于scRNA-Seq计数建模，而负二项式模型不考虑零通货膨胀。应用于来自小鼠皮层和海马体的多种细胞类型的scRNA-Seq计数的公开数据集，我们展示了我们的方法如何用于将细胞亚群区分为数据中的簇，并识别指示亚群成员身份的基因集。

{"title":"Identifying Subpopulations of Cells in Single-Cell Transcriptomic Data: A Bayesian Mixture Modeling Approach to Zero Inflation of Counts.","authors":"Tom Wilson, Duong H T Vo, Thomas Thorne","doi":"10.1089/cmb.2022.0273","DOIUrl":"10.1089/cmb.2022.0273","url":null,"abstract":"In the study of single-cell RNA-seq (scRNA-Seq) data, a key component of the analysis is to identify subpopulations of cells in the data. A variety of approaches to this have been considered, and although many machine learning-based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this, probabilistic models have been developed, but scRNA-Seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model that employs both a mixture at the cell level to model multiple populations of cells and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach, we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model scRNA-Seq counts and negative binomial models that do not take into account zero inflation. Applied to a publicly available data set of scRNA-Seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish subpopulations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a subpopulation.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49690785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Constructing Double Cyclic Codes over F2+uF2 for DNA Codes. 在F2+uF2上构造DNA编码的双循环码。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-10-01 DOI: 10.1089/cmb.2022.0151

Arunothai Kanlaya, Chakkrid Klin-Eam

In this article, we investigate the algebraic structure of double cyclic codes of length $(α, β)$ over $F_{2} + u F_{2}$ with $u^{2} = 0$ and construct DNA codes from these codes. The theory of constructing double cyclic codes suitable for DNA codes is studied. We provide the necessary and sufficient conditions for the double cyclic codes to be reversible and reversible-complement codes. As an illustration, we present some of the DNA codes generated from our results.

本文研究了u2=0的F2+uF2上长度为（α，β）的双循环码的代数结构，并用这些双循环码构造了DNA编码。研究了构建适用于DNA编码的双循环编码的理论。我们给出了双循环码是可逆和可逆补码的充要条件。作为一个例子，我们展示了从我们的结果中产生的一些DNA代码。

引用次数: 0

Reconstruction of Viral Variants via Monte Carlo Clustering. 通过蒙特卡罗聚类重建病毒变体。

IF 1.7 4区生物学 Q2 Mathematics

Journal of Computational Biology

Pub Date : 2023-09-01 Epub Date: 2023-09-11 DOI: 10.1089/cmb.2023.0154

Akshay Juyal, Roya Hosseini, Daniel Novikov, Mark Grinshpon, Alex Zelikovsky

Identifying viral variants through clustering is essential for understanding the composition and structure of viral populations within and between hosts, which play a crucial role in disease progression and epidemic spread. This article proposes and validates novel Monte Carlo (MC) methods for clustering aligned viral sequences by minimizing either entropy or Hamming distance from consensuses. We validate these methods on four benchmarks: two SARS-CoV-2 interhost data sets and two HIV intrahost data sets. A parallelized version of our tool is scalable to very large data sets. We show that both entropy and Hamming distance-based MC clusterings discern the meaningful information from sequencing data. The proposed clustering methods consistently converge to similar clusterings across different runs. Finally, we show that MC clustering improves reconstruction of intrahost viral population from sequencing data.

通过聚类识别病毒变体对于了解宿主内部和宿主之间病毒种群的组成和结构至关重要，宿主在疾病进展和流行病传播中发挥着至关重要的作用。本文提出并验证了通过最小化熵或与共识的汉明距离来聚类比对病毒序列的新蒙特卡罗（MC）方法。我们在四个基准上验证了这些方法：两个严重急性呼吸系统综合征冠状病毒2型宿主间数据集和两个艾滋病毒宿主内数据集。我们的工具的并行版本可以扩展到非常大的数据集。我们表明，基于熵和汉明距离的MC聚类都能从测序数据中识别出有意义的信息。所提出的聚类方法在不同的运行中一致地收敛到相似的聚类。最后，我们证明了MC聚类改进了从测序数据重建宿主内病毒群。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Computational Biology

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀