首页 > 最新文献

Journal of Computational Biology最新文献

英文 中文
A Computational Software for Training Robust Drug-Target Affinity Prediction Models: pydebiaseddta. 一种用于训练稳健药物-靶标亲和力预测模型的计算软件:pydebiasedta。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-11-01 DOI: 10.1089/cmb.2023.0194
Melİh Barsbey, Riza ÖZçelİk, Alperen Bağ, Berk Atil, Arzucan ÖZgür, Elif Ozkirimli

Robust generalization of drug-target affinity (DTA) prediction models is a notoriously difficult problem in computational drug discovery. In this article, we present pydebiaseddta: a computational software for improving the generalizability of DTA prediction models to novel ligands and/or proteins. pydebiaseddta serves as the practical implementation of the DebiasedDTA training framework, which advocates modifying the training distribution to mitigate the effect of spurious correlations in the training data set that leads to substantially degraded performance for novel ligands and proteins. Written in Python programming language, pydebiaseddta combines a user-friendly streamlined interface with a feature-rich and highly modifiable architecture. With this article we introduce our software, showcase its main functionalities, and describe practical ways for new users to engage with it.

药物靶标亲和力(DTA)预测模型的鲁棒泛化是计算药物发现中的一个众所周知的难题。在本文中,我们提出了pydebiaseddta:一个计算软件,用于提高DTA预测模型对新配体和/或蛋白质的通用性。pydebiaseddta是DebiasedDTA训练框架的实际实现,它主张修改训练分布,以减轻训练数据集中的虚假相关性的影响,这些虚假相关性会导致新配体和蛋白质的性能大幅下降。pydebiaseddta用Python编程语言编写,将用户友好的流线型界面与功能丰富且高度可修改的体系结构相结合。在本文中,我们介绍了我们的软件,展示了它的主要功能,并描述了新用户使用它的实用方法。
{"title":"A Computational Software for Training Robust Drug-Target Affinity Prediction Models: pydebiaseddta.","authors":"Melİh Barsbey, Riza ÖZçelİk, Alperen Bağ, Berk Atil, Arzucan ÖZgür, Elif Ozkirimli","doi":"10.1089/cmb.2023.0194","DOIUrl":"10.1089/cmb.2023.0194","url":null,"abstract":"<p><p>\u0000 <b>Robust generalization of drug-target affinity (DTA) prediction models is a notoriously difficult problem in computational drug discovery. In this article, we present pydebiaseddta: a computational software for improving the generalizability of DTA prediction models to novel ligands and/or proteins. pydebiaseddta serves as the practical implementation of the DebiasedDTA training framework, which advocates modifying the training distribution to mitigate the effect of spurious correlations in the training data set that leads to substantially degraded performance for novel ligands and proteins. Written in Python programming language, pydebiaseddta combines a user-friendly streamlined interface with a feature-rich and highly modifiable architecture. With this article we introduce our software, showcase its main functionalities, and describe practical ways for new users to engage with it.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138291125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models. 提高药物-靶标亲和力预测模型通用性的框架。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-11-01 DOI: 10.1089/cmb.2023.0208
Riza ÖZçelİk, Alperen Bağ, Berk Atil, Melİh Barsbey, Arzucan ÖZgür, Elif Ozkirimli

Statistical models that accurately predict the binding affinity of an input ligand-protein pair can greatly accelerate drug discovery. Such models are trained on available ligand-protein interaction data sets, which may contain biases that lead the predictor models to learn data set-specific, spurious patterns instead of generalizable relationships. This leads the prediction performances of these models to drop dramatically for previously unseen biomolecules. Various approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading overall prediction performance. In this article, we present DebiasedDTA, a novel training framework for drug-target affinity (DTA) prediction models that addresses data set biases to improve the generalizability of such models. DebiasedDTA relies on reweighting the training samples to achieve robust generalization, and is thus applicable to most DTA prediction models. Extensive experiments with different biomolecule representations, model architectures, and data sets demonstrate that DebiasedDTA achieves improved generalizability in predicting drug-target affinities.

准确预测输入配体-蛋白对结合亲和力的统计模型可以大大加快药物的发现。这些模型是在可用的配体-蛋白质相互作用数据集上进行训练的,这些数据集可能包含偏差,导致预测模型学习特定于数据集的虚假模式,而不是可推广的关系。这导致这些模型对以前看不见的生物分子的预测性能急剧下降。各种旨在提高模型泛化性的方法要么适用性有限,要么会引入降低整体预测性能的风险。在本文中,我们提出了DebiasedDTA,这是一种新的药物靶标亲和力(DTA)预测模型的训练框架,它解决了数据集偏差,以提高此类模型的泛化性。DebiasedDTA依赖于训练样本的重加权来实现鲁棒泛化,因此适用于大多数DTA预测模型。使用不同生物分子表示、模型架构和数据集进行的大量实验表明,DebiasedDTA在预测药物靶标亲和力方面实现了更好的通用性。
{"title":"A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models.","authors":"Riza ÖZçelİk, Alperen Bağ, Berk Atil, Melİh Barsbey, Arzucan ÖZgür, Elif Ozkirimli","doi":"10.1089/cmb.2023.0208","DOIUrl":"10.1089/cmb.2023.0208","url":null,"abstract":"<p><p>\u0000 <b>Statistical models that accurately predict the binding affinity of an input ligand-protein pair can greatly accelerate drug discovery. Such models are trained on available ligand-protein interaction data sets, which may contain biases that lead the predictor models to learn data set-specific, spurious patterns instead of generalizable relationships. This leads the prediction performances of these models to drop dramatically for previously unseen biomolecules. Various approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading overall prediction performance. In this article, we present DebiasedDTA, a novel training framework for drug-target affinity (DTA) prediction models that addresses data set biases to improve the generalizability of such models. DebiasedDTA relies on reweighting the training samples to achieve robust generalization, and is thus applicable to most DTA prediction models. Extensive experiments with different biomolecule representations, model architectures, and data sets demonstrate that DebiasedDTA achieves improved generalizability in predicting drug-target affinities.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138291126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent. QR-STAR:一种聚结状态下种树生根的多项式时间统计一致方法。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-11-01 Epub Date: 2023-10-30 DOI: 10.1089/cmb.2023.0185
Yasamin Tabatabaee, Sebastien Roch, Tandy Warnow

We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on "short quintets" has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.

我们在假设基因树在多物种联合(MSC)模型下的模型物种树内进化的情况下,在给定一组未标记基因树的情况下解决了未标记物种树的生根问题。五元寻根(QR)是最近针对这个问题提出的一种多项式时间算法,该算法基于Allman、Degnan和Rhodes开发的理论,该理论证明了在MSC下,有根的5分类单元树与未命名的基因树的可识别性。然而,尽管QR在模拟中具有良好的准确性,但其统计一致性仍然是一个悬而未决的问题。我们提出了QR-STAR,这是QR的一种变体,具有额外的步骤和不同的成本函数,并证明了它在MSC下的统计一致性。此外,我们推导了QR-STAR的样本复杂度边界,并证明了它的一个基于“短五元组”的特定变体具有多项式样本复杂度。最后,我们在各种模型条件下的仿真研究表明,QR-STAR匹配或提高了QR的准确性。QR-STAR在github上以开源形式提供。
{"title":"QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent.","authors":"Yasamin Tabatabaee, Sebastien Roch, Tandy Warnow","doi":"10.1089/cmb.2023.0185","DOIUrl":"10.1089/cmb.2023.0185","url":null,"abstract":"<p><p>\u0000 <b>We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on \"short quintets\" has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71412464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shortest Hyperpaths in Directed Hypergraphs for Reaction Pathway Inference. 用于反应路径推断的有向超图中的最短超路径。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-11-01 Epub Date: 2023-10-31 DOI: 10.1089/cmb.2023.0242
Spencer Krieger, John Kececioglu

Signaling and metabolic pathways, which consist of chains of reactions that produce target molecules from source compounds, are cornerstones of cellular biology. Properly modeling the reaction networks that represent such pathways requires directed hypergraphs, where each molecule or compound maps to a vertex, and each reaction maps to a hyperedge directed from its set of input reactants to its set of output products. Inferring the most likely series of reactions that produces a given set of targets from a given set of sources, where for each reaction its reactants are produced by prior reactions in the series, corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. We give the first exact algorithm for general shortest hyperpaths that can find provably optimal solutions for large, real-world, reaction networks. In particular, we derive a novel graph-theoretic characterization of hyperpaths, which we leverage in a new integer linear programming formulation of shortest hyperpaths that for the first time handles cycles, and develop a cutting-plane algorithm that can solve this integer linear program to optimality in practice. Through comprehensive experiments over all of the thousands of instances from the standard Reactome and NCI-PID reaction databases, we demonstrate that our cutting-plane algorithm quickly finds an optimal hyperpath-inferring the most likely pathway-with a median running time of under 10 seconds, and a maximum time of less than 30 minutes, even on instances with thousands of reactions. We also explore for the first time how well hyperpaths infer true pathways, and show that shortest hyperpaths accurately recover known pathways, typically with very high precision and recall. Source code implementing our cutting-plane algorithm for shortest hyperpaths is available free for research use in a new tool called Mmunin.

信号和代谢途径是细胞生物学的基石,它由从源化合物中产生靶分子的反应链组成。正确地建模代表这种途径的反应网络需要有向超图,其中每个分子或化合物映射到一个顶点,每个反应映射到从其输入反应物集指向其输出产物集的超边。从一组给定的源中推断出产生一组给定目标的最可能的一系列反应,其中对于每个反应,其反应物都是由该系列中的先前反应产生的,这对应于在有向超图中找到最短的超路径,这是NP完全的。我们给出了一般最短超路径的第一个精确算法,该算法可以为大型、真实世界的反应网络找到可证明的最优解。特别地,我们推导了超路径的一种新的图论表征,我们将其用于首次处理循环的最短超路径的新整数线性规划公式,并开发了一种切割平面算法,该算法可以在实践中将该整数线性规划解为最优性。通过对标准Reactome和NCI-PID反应数据库中的数千个实例进行全面实验,我们证明了我们的切割平面算法可以快速找到最优超路径,推断出最有可能的路径,中位运行时间不到10秒,最长时间不到30分钟,即使在有数千个反应的实例中也是如此。我们还首次探索了超通路推断真实通路的能力,并表明最短的超通路准确地恢复了已知通路,通常具有非常高的精确度和召回率。实现我们的最短超路径切割平面算法的源代码可在一种名为Mmunin的新工具中免费供研究使用。
{"title":"Shortest Hyperpaths in Directed Hypergraphs for Reaction Pathway Inference.","authors":"Spencer Krieger, John Kececioglu","doi":"10.1089/cmb.2023.0242","DOIUrl":"10.1089/cmb.2023.0242","url":null,"abstract":"<p><p><b>Signaling and metabolic pathways, which consist of chains of reactions that produce target molecules from source compounds, are cornerstones of cellular biology. Properly modeling the reaction networks that represent such pathways requires directed <i>hypergraphs</i>, where each molecule or compound maps to a vertex, and each reaction maps to a <i>hyperedge</i> directed from its set of input reactants to its set of output products. Inferring the most likely series of reactions that produces a given set of targets from a given set of sources, where for each reaction its reactants are produced by prior reactions in the series, corresponds to finding a <i>shortest hyperpath</i> in a directed hypergraph, which is NP-complete.</b> <b>We give the first <i>exact algorithm</i> for general shortest hyperpaths that can find provably optimal solutions for large, real-world, reaction networks. In particular, we derive a novel <i>graph-theoretic characterization</i> of hyperpaths, which we leverage in a new integer linear programming formulation of shortest hyperpaths that for the first time handles cycles, and develop a <i>cutting-plane algorithm</i> that can solve this integer linear program to optimality in practice. Through comprehensive experiments over all of the thousands of instances from the standard Reactome and NCI-PID reaction databases, we demonstrate that our cutting-plane algorithm quickly finds an optimal hyperpath-inferring the most likely pathway-with a median running time of under 10 seconds, and a maximum time of less than 30 minutes, even on instances with thousands of reactions. We also explore for the first time how well hyperpaths infer true pathways, and show that shortest hyperpaths <i>accurately recover</i> known pathways, typically with very high precision and recall.</b> <b>Source code implementing our cutting-plane algorithm for shortest hyperpaths is available free for research use in a new tool called</b> Mmunin.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71412396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Python Package itca for Information-Theoretic Classification Accuracy: A Criterion That Guides Data-Driven Combination of Ambiguous Outcome Labels in Multiclass Classification. 信息论分类准确性的Python包itca:指导多类分类中模糊结果标签的数据驱动组合的标准。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-11-01 Epub Date: 2023-11-06 DOI: 10.1089/cmb.2023.0191
Chihao Zhang, Shihua Zhang, Jingyi Jessica Li

The itca Python package offers an information-theoretic criterion to assist practitioners in combining ambiguous outcome labels by balancing the tradeoff between prediction accuracy and classification resolution. This article provides instructions for installing the itca Python package, demonstrates how to evaluate the criterion, and showcases its application in real-world scenarios for guiding the combination of ambiguous outcome labels.

itcaPython包提供了一个信息论标准,通过平衡预测准确性和分类分辨率之间的权衡,帮助从业者组合模糊的结果标签。本文提供了安装itcaPython包的说明,演示了如何评估该标准,并展示了它在实际场景中的应用,以指导模糊结果标签的组合。
{"title":"A Python Package <i>itca</i> for Information-Theoretic Classification Accuracy: A Criterion That Guides Data-Driven Combination of Ambiguous Outcome Labels in Multiclass Classification.","authors":"Chihao Zhang, Shihua Zhang, Jingyi Jessica Li","doi":"10.1089/cmb.2023.0191","DOIUrl":"10.1089/cmb.2023.0191","url":null,"abstract":"<p><p>\u0000 <b>The <i>itca</i> Python package offers an information-theoretic criterion to assist practitioners in combining ambiguous outcome labels by balancing the tradeoff between prediction accuracy and classification resolution. This article provides instructions for installing the <i>itca</i> Python package, demonstrates how to evaluate the criterion, and showcases its application in real-world scenarios for guiding the combination of ambiguous outcome labels.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71482196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. 非循环泛基因组图的间隙敏感共线链算法。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-11-01 Epub Date: 2023-10-30 DOI: 10.1089/cmb.2023.0186
Ghanshyam Chandra, Chirag Jain

A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.

泛基因组图可以作为基因组研究的更好参考,因为它可以紧凑地表示一个物种内的多个基因组。将序列与图对齐对于基于泛基因组的重新排序至关重要。种子链扩展启发式通过在序列和图之间找到短的精确匹配来工作。在这种启发式方法中,共线链接有助于识别出一个良好的精确匹配集群,这些匹配可以组合起来形成对齐。共线链算法已被广泛研究,用于对齐具有各种间隙代价的两个序列,包括线性、凹和凸代价函数。然而,将这些算法扩展到序列到图对齐带来了重大挑战。最近,Makinen等人介绍了一种稀疏动态编程框架,该框架利用了非循环泛基因组图的小路径覆盖特性,实现了高效的链接。然而,这一框架没有考虑缺口成本,限制了其实际有效性。我们通过开发新的问题公式和可证明的良好链接算法来解决这一限制,这些算法支持各种间隙成本函数。这些函数经过精心设计,可以实现快速链接算法,其时间要求根据最小路径覆盖的大小进行参数化。通过实证评估,我们证明了与现有对准器相比,我们的算法具有优越的性能。当将模拟的长读数映射到包含95种人类单倍型的泛基因组图时,我们在离开时达到了98.7%的精度
{"title":"Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs.","authors":"Ghanshyam Chandra, Chirag Jain","doi":"10.1089/cmb.2023.0186","DOIUrl":"10.1089/cmb.2023.0186","url":null,"abstract":"<p><p>\u0000 <b>A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71412463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Counting Distinguishable RNA Secondary Structures. 计数可辨别的RNA二级结构。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-10-01 Epub Date: 2023-10-09 DOI: 10.1089/cmb.2022.0501
Masaru Nakajima, Andrew D Smith

RNA secondary structures are essential abstractions for understanding spacial folding behaviors of those macromolecules. Many secondary structure algorithms involve a common dynamic programming setup to exploit the property that secondary structures can be decomposed into substructures. Dirks et al. noted that this setup cannot directly address an issue of distinguishability among secondary structures, which arises for classes of sequences that admit nontrivial symmetry. Circular sequences are among these. We examine the problem of counting distinguishable secondary structures. Drawing from elementary results in group theory, we identify useful subsets of secondary structures. We then extend an algorithm due to Hofacker et al. for computing the sizes of these subsets. This yields a cubic-time algorithm to count distinguishable structures compatible with a given circular sequence. Furthermore, this general approach may be used to solve similar problems for which the RNA structures of interest involve symmetries.

RNA二级结构是理解这些大分子空间折叠行为的重要抽象概念。许多二次结构算法涉及一个通用的动态规划设置,以利用二次结构可以分解为子结构的特性。Dirks等人指出,这种设置不能直接解决二级结构之间的可区分性问题,这是在允许非平凡对称的序列类中出现的。循环序列就是其中之一。我们研究了可区分二级结构的计数问题。根据群论的基本结果,我们确定了二级结构的有用子集。然后,我们扩展了Hofacker等人的算法,用于计算这些子集的大小。这产生了一种三次时间算法来计算与给定圆形序列兼容的可区分结构。此外,这种通用方法可用于解决感兴趣的RNA结构涉及对称性的类似问题。
{"title":"Counting Distinguishable RNA Secondary Structures.","authors":"Masaru Nakajima, Andrew D Smith","doi":"10.1089/cmb.2022.0501","DOIUrl":"10.1089/cmb.2022.0501","url":null,"abstract":"<p><p>RNA secondary structures are essential abstractions for understanding spacial folding behaviors of those macromolecules. Many secondary structure algorithms involve a common dynamic programming setup to exploit the property that secondary structures can be decomposed into substructures. Dirks et al. noted that this setup cannot directly address an issue of distinguishability among secondary structures, which arises for classes of sequences that admit nontrivial symmetry. Circular sequences are among these. We examine the problem of counting distinguishable secondary structures. Drawing from elementary results in group theory, we identify useful subsets of secondary structures. We then extend an algorithm due to Hofacker et al. for computing the sizes of these subsets. This yields a cubic-time algorithm to count distinguishable structures compatible with a given circular sequence. Furthermore, this general approach may be used to solve similar problems for which the RNA structures of interest involve symmetries.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41182720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Numerical Analysis of Split-Step Backward Euler Method with Truncated Wiener Process for a Stochastic Susceptible-Infected-Susceptible Model. 随机易感性-感染易感性模型的截断Wiener过程的分段后退Euler方法的数值分析。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-10-01 Epub Date: 2023-10-09 DOI: 10.1089/cmb.2022.0462
Xiaochen Yang, Zhanwen Yang, Chiping Zhang

This article deals with the numerical positivity, boundedness, convergence, and dynamical behaviors for stochastic susceptible-infected-susceptible (SIS) model. To guarantee the biological significance of the split-step backward Euler method applied to the stochastic SIS model, the numerical positivity and boundedness are investigated by the truncated Wiener process. Motivated by the almost sure boundedness of exact and numerical solutions, the convergence is discussed by the fundamental convergence theorem with a local Lipschitz condition. Moreover, the numerical extinction and persistence are initially obtained by an exponential presentation of the stochastic stability function and strong law of the large number for martingales, which reproduces the existing theoretical results. Finally, numerical examples are given to validate our numerical results for the stochastic SIS model.

本文研究随机易感感染易感(SIS)模型的数值正性、有界性、收敛性和动力学行为。为了保证分段后向欧拉方法应用于随机SIS模型的生物学意义,利用截断Wiener过程研究了数值的正性和有界性。受精确解和数值解的几乎肯定有界性的启发,利用具有局部Lipschitz条件的基本收敛定理讨论了收敛性。此外,数值消光和持久性最初是通过随机稳定函数的指数表示和鞅的强大数定律得到的,这再现了现有的理论结果。最后,通过算例验证了随机SIS模型的数值结果。
{"title":"Numerical Analysis of Split-Step Backward Euler Method with Truncated Wiener Process for a Stochastic Susceptible-Infected-Susceptible Model.","authors":"Xiaochen Yang, Zhanwen Yang, Chiping Zhang","doi":"10.1089/cmb.2022.0462","DOIUrl":"10.1089/cmb.2022.0462","url":null,"abstract":"<p><p>This article deals with the numerical positivity, boundedness, convergence, and dynamical behaviors for stochastic susceptible-infected-susceptible (SIS) model. To guarantee the biological significance of the split-step backward Euler method applied to the stochastic SIS model, the numerical positivity and boundedness are investigated by the truncated Wiener process. Motivated by the almost sure boundedness of exact and numerical solutions, the convergence is discussed by the fundamental convergence theorem with a local Lipschitz condition. Moreover, the numerical extinction and persistence are initially obtained by an exponential presentation of the stochastic stability function and strong law of the large number for martingales, which reproduces the existing theoretical results. Finally, numerical examples are given to validate our numerical results for the stochastic SIS model.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41182721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins. 整合低阶和高阶相关信息用于鉴定噬菌体病毒蛋白。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-10-01 Epub Date: 2023-09-20 DOI: 10.1089/cmb.2022.0237
Hongliang Zou, Wanting Yu

Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.

噬菌体病毒粒子蛋白(PVPs)在宿主细胞中起着重要作用。PVP的快速准确鉴定有利于相关药物的发现和开发。尽管湿法实验方法是鉴定PVP的首选方法,但它们成本高昂且耗时。因此,研究人员将注意力转向了计算模型,这可以加快相关研究的速度。因此,我们在当前的研究中提出了一种新的机器学习模型来识别PVP。首先,使用50种不同类型的物理化学性质来表示蛋白质序列。其次,采用皮尔逊相关系数(PCC)和最大信息系数(MIC)两种不同的方法提取判别信息。此外,为了捕获高阶相关信息,我们再次使用PCC和MIC。之后,我们采用最小绝对收缩和选择算子算法来选择最优特征子集。最后,将这些选择的特征输入到支持载体机器中,以区分PVP和噬菌体非病毒粒子蛋白。我们在两个不同的数据集上进行了实验,以验证我们提出的方法的有效性。实验结果表明,与最先进的方法相比,性能显著提高。这表明所提出的计算模型可能成为识别PVP的强大预测因子。
{"title":"Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins.","authors":"Hongliang Zou,&nbsp;Wanting Yu","doi":"10.1089/cmb.2022.0237","DOIUrl":"10.1089/cmb.2022.0237","url":null,"abstract":"<p><p>Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41127589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weighted Selection Probability to Prioritize Susceptible Rare Variants in Multi-Phenotype Association Studies with Application to a Soybean Genetic Data Set. 在多表型关联研究中优先考虑易感稀有变异的加权选择概率及其在大豆遗传数据集中的应用。
IF 1.7 4区 生物学 Q2 Mathematics Pub Date : 2023-10-01 DOI: 10.1089/cmb.2022.0487
Xianglong Liang, Hokeun Sun

Rare variant association studies with multiple traits or diseases have drawn a lot of attention since association signals of rare variants can be boosted if more than one phenotype outcome is associated with the same rare variants. Most of the existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a pre-specified genetic region is tested one at a time. However, these methods are not designed to locate susceptible rare variants within the genetic region. In this article, we propose new statistical methods to prioritize rare variants within a genetic region when a group test for the genetic region identifies a statistical association with multiple phenotypes. It computes the weighted selection probability (WSP) of individual rare variants and ranks them from largest to smallest according to their WSP. In simulation studies, we demonstrated that the proposed method outperforms other statistical methods in terms of true positive selection, when multiple phenotypes are correlated with each other. We also applied it to our soybean single nucleotide polymorphism (SNP) data with 13 highly correlated amino acids, where we identified some potentially susceptible rare variants in chromosome 19.

具有多种性状或疾病的罕见变异关联研究引起了人们的广泛关注,因为如果同一罕见变异与多个表型结果相关,则罕见变异的关联信号可以增强。大多数现有的识别与多种表型相关的罕见变异的统计方法都是基于群体测试,即一次测试一个预先指定的遗传区域。然而,这些方法并不是为了在遗传区域内定位易感的罕见变异。在这篇文章中,我们提出了新的统计方法,当遗传区域的群体测试确定与多种表型的统计关联时,优先考虑遗传区域内的罕见变异。它计算单个稀有变体的加权选择概率(WSP),并根据其WSP从大到小进行排序。在模拟研究中,我们证明了当多种表型相互关联时,所提出的方法在真阳性选择方面优于其他统计方法。我们还将其应用于具有13个高度相关氨基酸的大豆单核苷酸多态性(SNP)数据,在19号染色体上发现了一些潜在的易感罕见变异。
{"title":"Weighted Selection Probability to Prioritize Susceptible Rare Variants in Multi-Phenotype Association Studies with Application to a Soybean Genetic Data Set.","authors":"Xianglong Liang, Hokeun Sun","doi":"10.1089/cmb.2022.0487","DOIUrl":"10.1089/cmb.2022.0487","url":null,"abstract":"<p><p>Rare variant association studies with multiple traits or diseases have drawn a lot of attention since association signals of rare variants can be boosted if more than one phenotype outcome is associated with the same rare variants. Most of the existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a pre-specified genetic region is tested one at a time. However, these methods are not designed to locate susceptible rare variants within the genetic region. In this article, we propose new statistical methods to prioritize rare variants within a genetic region when a group test for the genetic region identifies a statistical association with multiple phenotypes. It computes the weighted selection probability (WSP) of individual rare variants and ranks them from largest to smallest according to their WSP. In simulation studies, we demonstrated that the proposed method outperforms other statistical methods in terms of true positive selection, when multiple phenotypes are correlated with each other. We also applied it to our soybean single nucleotide polymorphism (SNP) data with 13 highly correlated amino acids, where we identified some potentially susceptible rare variants in chromosome 19.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49690786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Computational Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1