Recombinations, chains and caps: resolving problems with the DCJ-indel model.

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS Algorithms for Molecular Biology Pub Date : 2024-02-27 DOI:10.1186/s13015-024-00253-7

Leonard Bohnenkämper

{"title":"Recombinations, chains and caps: resolving problems with the DCJ-indel model.","authors":"Leonard Bohnenkämper","doi":"10.1186/s13015-024-00253-7","DOIUrl":null,"url":null,"abstract":"<p><p>One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"8"},"PeriodicalIF":1.7000,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10900646/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-024-00253-7","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

重组、链和帽：解决 DCJ-indel 模型的问题。

基因组重排研究中最基本的问题之一是（基因组）距离问题。该问题通常被表述为寻找在一个模型下将一个基因组转化为另一个基因组所需的最小重排次数。双切和连接（DCJ）模型是一个强大的多染色体模型。虽然 DCJ 模型无法处理实际中出现的一些情况，如重复或丢失区域，但随着时间的推移，它被扩展以处理这些情况。首先，它被扩展为 DCJ-indel 模型，解决了丢失标记的问题。后来，针对每个基因组区域可能出现任意次数的所谓天然基因组开发了 ILP 解决方案，从理论上解决了任何一对基因组的距离问题。然而，一些理论和实践问题仍未得到解决。在理论方面，DCJ-indel 模型存在两种不同的观点，它们的动机相同，但概念不同，至今无法调和。在实际应用方面，虽然针对自然基因组的 ILP 解决方案通常在端粒到端粒解析基因组上表现良好，但近年来的研究表明，它们在具有大量等位基因或线性染色体的基因组上很快就会性能下降。这与一种特殊的技术有关，即封顶技术。简单地说，"封顶 "技术是在求解过程中通过串联线性染色体来实现线性染色体的循环，从而超指数地增加 ILP 的求解空间。最近，我们在另一个重排问题中引入了 DCJ-indel 模型的新概念。在本手稿中，我们将把这一新概念应用于距离问题。在此过程中，我们揭示了 DCJ-indel 模型不同概念之间的关系。我们还能为距离问题推导出一种不依赖封顶的 ILP 解决方案。这种解决方案大大提高了以前的解决方案在具有大量等位基因的基因组上的性能，同时还能精确地解决这个问题，并且在其他方面也具有竞争力。我们在模拟基因组上演示了这一性能优势，并在对 11 个果蝇基因组的分析中展示了它的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.