首页 > 最新文献

Algorithms for Molecular Biology最新文献

英文 中文
Bi-alignments with affine gaps costs 具有仿射间隙的双对齐代价
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2022-05-16 DOI: 10.1186/s13015-022-00219-7
Peter F. Stadler, S. Will
{"title":"Bi-alignments with affine gaps costs","authors":"Peter F. Stadler, S. Will","doi":"10.1186/s13015-022-00219-7","DOIUrl":"https://doi.org/10.1186/s13015-022-00219-7","url":null,"abstract":"","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82802988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient privacy-preserving variable-length substring match for genome sequence. 基因组序列的高效隐私保护变长子串匹配。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2022-04-26 DOI: 10.1186/s13015-022-00211-1
Yoshiki Nakagawa, Satsuya Ohata, Kana Shimizu

The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that [Formula: see text] is computed for a given depth of recursion where [Formula: see text] is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.

开发隐私保护技术对于加速基因组数据共享非常重要。本研究提出了一种算法,可以安全地搜索查询和数据库序列之间的可变长度子串匹配。我们的构想依赖于一种有效应用调频索引的保密共享方案技术。更确切地说,我们开发的算法可以实现安全查表,即在给定的递归深度下计算[公式:见正文],其中[公式:见正文]是初始位置,V是向量。我们对基于调频索引创建的向量使用了安全查表。安全查表的显著特点是,在查询输入后,时间、通信和轮次复杂度与表长 N 无关。因此,参考基于调频索引的表进行子串匹配也可以不受数据库长度的影响,与以前的方法相比,整个搜索时间大大缩短。我们使用长度为 1000 万的人类基因组序列作为数据库,长度为 100 的查询进行了实验,发现在现实计算/网络环境下,我们的协议的查询响应时间比非索引数据库搜索协议至少快三个数量级。
{"title":"Efficient privacy-preserving variable-length substring match for genome sequence.","authors":"Yoshiki Nakagawa, Satsuya Ohata, Kana Shimizu","doi":"10.1186/s13015-022-00211-1","DOIUrl":"10.1186/s13015-022-00211-1","url":null,"abstract":"<p><p>The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that [Formula: see text] is computed for a given depth of recursion where [Formula: see text] is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9040336/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74916061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adding hydrogen atoms to molecular models via fragment superimposition 通过片段叠加将氢原子添加到分子模型中
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2022-03-29 DOI: 10.1186/s13015-022-00215-x
Patrick Kunzmann, Jacob Marcel Anter, K. Hamacher
{"title":"Adding hydrogen atoms to molecular models via fragment superimposition","authors":"Patrick Kunzmann, Jacob Marcel Anter, K. Hamacher","doi":"10.1186/s13015-022-00215-x","DOIUrl":"https://doi.org/10.1186/s13015-022-00215-x","url":null,"abstract":"","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"65741668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Perplexity: evaluating transcript abundance estimation in the absence of ground truth. 困惑:在缺乏基本事实的情况下评估转录物丰度估计。
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2022-03-25 DOI: 10.1186/s13015-022-00214-y
Jason Fan, Skylar Chan, Rob Patro

Background: There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best.

Results: We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models.

Conclusions: Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.

背景:基于RNA-seq数据估计转录物丰度的概率模型和推理方法发展迅速。这些模型旨在准确估计转录水平丰度,考虑测量过程中的不同偏差,甚至评估结果估计中的不确定性,这些不确定性可以传播到后续分析中。通过这种方法推断出的估计的假定准确性支撑了在实验室中常规进行的基于基因表达的分析。虽然已知超参数选择会影响推断丰度的分布(例如产生平滑与稀疏估计),但在实验数据中执行模型选择的策略最多是非正式的。结果:给出了直接评价片段集丰度估计值的困惑度。我们从用于评估语言和主题模型的类似度量中调整了困惑,并扩展了度量,以仔细考虑RNA-seq独有的边缘情况。在实验数据中,具有最佳困惑度的估计值也与qPCR测量值最佳相关。在模拟数据中,困惑表现良好,与全基因组测量结果一致,与基础真理和差异表达分析一致。此外,我们从理论上和实验上证明,可以计算任意转录本丰度估计模型的困惑度。结论:除了推导和实现转录本丰度估计的困惑外,我们的研究首次在缺乏基础真理的情况下,对实验数据进行转录本丰度估计的模型选择。
{"title":"Perplexity: evaluating transcript abundance estimation in the absence of ground truth.","authors":"Jason Fan,&nbsp;Skylar Chan,&nbsp;Rob Patro","doi":"10.1186/s13015-022-00214-y","DOIUrl":"https://doi.org/10.1186/s13015-022-00214-y","url":null,"abstract":"<p><strong>Background: </strong>There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best.</p><p><strong>Results: </strong>We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models.</p><p><strong>Conclusions: </strong>Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8951746/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40326298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Space-efficient representation of genomic k-mer count tables. 具有空间效率的基因组 k-mer 计数表。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2022-03-21 DOI: 10.1186/s13015-022-00212-0
Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov

Motivation: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general.

Results: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.

动机:k-mer 计数是生物信息学管道中的一项常见任务,有许多专用工具可用。其中许多工具在输出时都会生成包含 k-聚合物和计数的 k-聚合物计数表,其容量可轻松达到数十 GB。此外,这类表格一般不支持高效的随机访问查询:在这项工作中,我们设计了一种高效的 k-mer 计数表,支持快速随机访问查询。我们建议应用压缩静态函数(CSF),其空间与计数的经验零阶熵成正比。对于像全基因组中 k-mer 计数这样的偏斜分布,目前唯一可用的 CSFs 实现并不能提供足够紧凑的表示。通过在 CSF 中添加布鲁姆过滤器,我们得到了布鲁姆增强 CSF(BCSF),有效地克服了这一限制。此外,通过将 BCSF 与基于最小化的 k-mers 桶相结合,我们建立了更小的表示法,在 k 足够大的情况下,打破了经验熵下限。我们在全基因组(E. Coli 和 C. Elegans)和未组装读数的 k-聚合物计数表以及 29 个 E. Coli 基因组的 k-聚合物文档频率表上对这些技术进行了实验验证。在精确计数的情况下,对于足够大的 k,我们的表示只需经验熵空间的一半左右。
{"title":"Space-efficient representation of genomic k-mer count tables.","authors":"Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov","doi":"10.1186/s13015-022-00212-0","DOIUrl":"10.1186/s13015-022-00212-0","url":null,"abstract":"<p><strong>Motivation: </strong>k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general.</p><p><strong>Results: </strong>In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2022-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939220/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40315007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parsimonious Clone Tree Integration in cancer 癌症中的简约克隆树整合
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2022-03-14 DOI: 10.1186/s13015-022-00209-9
P. Sashittal, Simone Zaccaria, M. El-Kebir
{"title":"Parsimonious Clone Tree Integration in cancer","authors":"P. Sashittal, Simone Zaccaria, M. El-Kebir","doi":"10.1186/s13015-022-00209-9","DOIUrl":"https://doi.org/10.1186/s13015-022-00209-9","url":null,"abstract":"","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86681252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Efficiently sparse listing of classes of optimal cophylogeny reconciliations. 最优亲缘关系协调类的高效稀疏列表。
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2022-02-15 DOI: 10.1186/s13015-022-00206-y
Yishu Wang, Arnaud Mary, Marie-France Sagot, Blerina Sinaimeri

Background: Cophylogeny reconciliation is a powerful method for analyzing host-parasite (or host-symbiont) co-evolution. It models co-evolution as an optimization problem where the set of all optimal solutions may represent different biological scenarios which thus need to be analyzed separately. Despite the significant research done in the area, few approaches have addressed the problem of helping the biologist deal with the often huge space of optimal solutions.

Results: In this paper, we propose a new approach to tackle this problem. We introduce three different criteria under which two solutions may be considered biologically equivalent, and then we propose polynomial-delay algorithms that enumerate only one representative per equivalence class (without listing all the solutions).

Conclusions: Our results are of both theoretical and practical importance. Indeed, as shown by the experiments, we are able to significantly reduce the space of optimal solutions while still maintaining important biological information about the whole space.

背景:共生体和解是分析宿主-寄生虫(或宿主-共生体)共同进化的有力方法。它将共同进化建模为一个优化问题,其中所有最优解的集合可能代表不同的生物场景,因此需要单独分析。尽管在这一领域进行了大量的研究,但很少有方法能够帮助生物学家处理通常巨大的最优解空间。结果:本文提出了一种解决这一问题的新方法。我们引入了三种不同的准则,在这些准则下,两个解可以被认为是生物等效的,然后我们提出了多项式延迟算法,每个等价类只枚举一个代表(不列出所有解)。结论:本研究结果具有一定的理论和实践意义。的确,如实验所示,我们能够显著减少最优解的空间,同时仍然保持整个空间的重要生物信息。
{"title":"Efficiently sparse listing of classes of optimal cophylogeny reconciliations.","authors":"Yishu Wang,&nbsp;Arnaud Mary,&nbsp;Marie-France Sagot,&nbsp;Blerina Sinaimeri","doi":"10.1186/s13015-022-00206-y","DOIUrl":"https://doi.org/10.1186/s13015-022-00206-y","url":null,"abstract":"<p><strong>Background: </strong>Cophylogeny reconciliation is a powerful method for analyzing host-parasite (or host-symbiont) co-evolution. It models co-evolution as an optimization problem where the set of all optimal solutions may represent different biological scenarios which thus need to be analyzed separately. Despite the significant research done in the area, few approaches have addressed the problem of helping the biologist deal with the often huge space of optimal solutions.</p><p><strong>Results: </strong>In this paper, we propose a new approach to tackle this problem. We introduce three different criteria under which two solutions may be considered biologically equivalent, and then we propose polynomial-delay algorithms that enumerate only one representative per equivalence class (without listing all the solutions).</p><p><strong>Conclusions: </strong>Our results are of both theoretical and practical importance. Indeed, as shown by the experiments, we are able to significantly reduce the space of optimal solutions while still maintaining important biological information about the whole space.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8845303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39788408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new 1.375-approximation algorithm for sorting by transpositions. 一种新的1.375-近似算法用于换位排序。
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2022-01-15 DOI: 10.1186/s13015-022-00205-z
Luiz Augusto G Silva, Luis Antonio B Kowada, Noraí Romeu Rocco, Maria Emília M T Walter

Background: SORTING BY TRANSPOSITIONS (SBT) is a classical problem in genome rearrangements. In 2012, SBT was proven to be [Formula: see text]-hard and the best approximation algorithm with a 1.375 ratio was proposed in 2006 by Elias and Hartman (EH algorithm). Their algorithm employs simplification, a technique used to transform an input permutation [Formula: see text] into a simple permutation [Formula: see text], presumably easier to handle with. The permutation [Formula: see text] is obtained by inserting new symbols into [Formula: see text] in a way that the lower bound of the transposition distance of [Formula: see text] is kept on [Formula: see text]. The simplification is guaranteed to keep the lower bound, not the transposition distance. A sequence of operations sorting [Formula: see text] can be mimicked to sort [Formula: see text].

Results and conclusions: First, using an algebraic approach, we propose a new upper bound for the transposition distance, which holds for all [Formula: see text]. Next, motivated by a problem identified in the EH algorithm, which causes it, in scenarios involving how the input permutation is simplified, to require one extra transposition above the 1.375-approximation ratio, we propose a new approximation algorithm to solve SBT ensuring the 1.375-approximation ratio for all [Formula: see text]. We implemented our algorithm and EH's. Regarding the implementation of the EH algorithm, two other issues were identified and needed to be fixed. We tested both algorithms against all permutations of size n, [Formula: see text]. The results show that the EH algorithm exceeds the approximation ratio of 1.375 for permutations with a size greater than 7. The percentage of computed distances that are equal to transposition distance, computed by the implemented algorithms are also compared with others available in the literature. Finally, we investigate the performance of both implementations on longer permutations of maximum length 500. From the experiments, we conclude that maximum and the average distances computed by our algorithm are a little better than the ones computed by the EH algorithm and the running times of both algorithms are similar, despite the time complexity of our algorithm being higher.

背景:转位排序(SBT)是基因组重排中的经典问题。2012年,SBT被证明为[公式:见文]-hard, 2006年,Elias和Hartman提出了1.375的最佳近似算法(EH算法)。他们的算法采用简化,一种将输入排列[公式:见文本]转换为简单排列[公式:见文本]的技术,想必更容易处理。通过在[Formula: see text]中插入新的符号,使[Formula: see text]的换位距离下界保持在[Formula: see text]上,得到[Formula: see text]的排列。简化保证了保留下界,而不是移位距离。排序的操作序列[公式:见文本]可以模拟排序[公式:见文本]。结果和结论:首先,使用代数方法,我们提出了一个新的移位距离上界,该上界适用于所有[公式:见文本]。接下来,在EH算法中发现的一个问题的激励下,在涉及如何简化输入排列的场景中,它需要在1.375近似比之上额外进行一次换位,我们提出了一种新的近似算法来解决SBT,确保所有的近似比都是1.375[公式:见文本]。我们实现了我们的算法和EH。关于EH算法的实现,还发现了另外两个需要解决的问题。我们针对大小为n的所有排列测试了这两种算法,[公式:见文本]。结果表明,EH算法对于大小大于7的排列超过了1.375的近似比。由实现的算法计算的与换位距离相等的计算距离的百分比也与文献中其他可用的算法进行了比较。最后,我们研究了两种实现在最大长度为500的更长的排列上的性能。实验结果表明,尽管算法的时间复杂度较高,但算法计算的最大距离和平均距离略优于EH算法,两种算法的运行时间相似。
{"title":"A new 1.375-approximation algorithm for sorting by transpositions.","authors":"Luiz Augusto G Silva,&nbsp;Luis Antonio B Kowada,&nbsp;Noraí Romeu Rocco,&nbsp;Maria Emília M T Walter","doi":"10.1186/s13015-022-00205-z","DOIUrl":"https://doi.org/10.1186/s13015-022-00205-z","url":null,"abstract":"<p><strong>Background: </strong>SORTING BY TRANSPOSITIONS (SBT) is a classical problem in genome rearrangements. In 2012, SBT was proven to be [Formula: see text]-hard and the best approximation algorithm with a 1.375 ratio was proposed in 2006 by Elias and Hartman (EH algorithm). Their algorithm employs simplification, a technique used to transform an input permutation [Formula: see text] into a simple permutation [Formula: see text], presumably easier to handle with. The permutation [Formula: see text] is obtained by inserting new symbols into [Formula: see text] in a way that the lower bound of the transposition distance of [Formula: see text] is kept on [Formula: see text]. The simplification is guaranteed to keep the lower bound, not the transposition distance. A sequence of operations sorting [Formula: see text] can be mimicked to sort [Formula: see text].</p><p><strong>Results and conclusions: </strong>First, using an algebraic approach, we propose a new upper bound for the transposition distance, which holds for all [Formula: see text]. Next, motivated by a problem identified in the EH algorithm, which causes it, in scenarios involving how the input permutation is simplified, to require one extra transposition above the 1.375-approximation ratio, we propose a new approximation algorithm to solve SBT ensuring the 1.375-approximation ratio for all [Formula: see text]. We implemented our algorithm and EH's. Regarding the implementation of the EH algorithm, two other issues were identified and needed to be fixed. We tested both algorithms against all permutations of size n, [Formula: see text]. The results show that the EH algorithm exceeds the approximation ratio of 1.375 for permutations with a size greater than 7. The percentage of computed distances that are equal to transposition distance, computed by the implemented algorithms are also compared with others available in the literature. Finally, we investigate the performance of both implementations on longer permutations of maximum length 500. From the experiments, we conclude that maximum and the average distances computed by our algorithm are a little better than the ones computed by the EH algorithm and the running times of both algorithms are similar, despite the time complexity of our algorithm being higher.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8760837/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39913478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An optimized FM-index library for nucleotide and amino acid search. 用于核苷酸和氨基酸搜索的优化FM索引库。
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2021-12-31 DOI: 10.1186/s13015-021-00204-6
Tim Anderson, Travis J Wheeler

Background: Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.

Results: We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.

Conclusions: AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

背景:模式匹配是各种生物序列分析管道中的关键步骤。调频索引是一种用于模式匹配的压缩数据结构,其搜索运行时间与数据库文本的长度无关。FM-index 的实现相当复杂,因此,快速灵活的 FM-index 库的出现将有助于提高 FM-index 的采用率:我们介绍了 AvxWindowedFMindex(AWFM-index),它是一个用 C 语言编写的轻量级、开源、线程并行调频索引库,针对核苷酸和氨基酸序列的索引进行了优化。AWFM-index 引入了一种新的方法,将调频索引数据存储为分层位矢量格式,通过 AVX2 bitwise 指令实现极高效的调频索引出现函数计算,并将其与索引后缀数组的可选磁盘存储和用于部分 k-mer 搜索的高速缓存高效查找表相结合。与 SeqAn3 的 FM 索引实现相比,AWFM 索引在一系列可比内存占用范围内执行精确匹配计数和定位查询的速度更快。经过速度优化后,AWFM-index 的核苷酸搜索速度比 SeqAn3 快 2-4 倍,氨基酸搜索速度比 SeqAn3 快 2-6 倍:AWFM-index很容易集成到生物信息学软件中,提供运行时性能参数化,并在高层(统计或定位查询字符串的所有实例)和底层(逐步控制FM-index后向搜索过程)为客户提供FM-index功能。该开源库可在 https://github.com/TravisWheelerLab/AvxWindowFmIndex 上下载。
{"title":"An optimized FM-index library for nucleotide and amino acid search.","authors":"Tim Anderson, Travis J Wheeler","doi":"10.1186/s13015-021-00204-6","DOIUrl":"10.1186/s13015-021-00204-6","url":null,"abstract":"<p><strong>Background: </strong>Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.</p><p><strong>Results: </strong>We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.</p><p><strong>Conclusions: </strong>AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8719400/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39653092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes. 一种考虑基因顺序和基因间大小的反转和转位距离的改进近似算法。
IF 1 4区 生物学 Q2 Mathematics Pub Date : 2021-12-29 DOI: 10.1186/s13015-021-00203-7
Klairton L Brito, Andre R Oliveira, Alexsandro O Alexandrino, Ulisses Dias, Zanoni Dias

Background: In the comparative genomics field, one of the goals is to estimate a sequence of genetic changes capable of transforming a genome into another. Genome rearrangement events are mutations that can alter the genetic content or the arrangement of elements from the genome. Reversal and transposition are two of the most studied genome rearrangement events. A reversal inverts a segment of a genome while a transposition swaps two consecutive segments. Initial studies in the area considered only the order of the genes. Recent works have incorporated other genetic information in the model. In particular, the information regarding the size of intergenic regions, which are structures between each pair of genes and in the extremities of a linear genome.

Results and conclusions: In this work, we investigate the SORTING BY INTERGENIC REVERSALS AND TRANSPOSITIONS problem on genomes sharing the same set of genes, considering the cases where the orientation of genes is known and unknown. Besides, we explored a variant of the problem, which generalizes the transposition event. As a result, we present an approximation algorithm that guarantees an approximation factor of 4 for both cases considering the reversal and transposition (classic definition) events, an improvement from the 4.5-approximation previously known for the scenario where the orientation of the genes is unknown. We also present a 3-approximation algorithm by incorporating the generalized transposition event, and we propose a greedy strategy to improve the performance of the algorithms. We performed practical tests adopting simulated data which indicated that the algorithms, in both cases, tend to perform better when compared with the best-known algorithms for the problem. Lastly, we conducted experiments using real genomes to demonstrate the applicability of the algorithms.

背景:在比较基因组学领域,目标之一是估计能够将基因组转化为另一个基因组的遗传变化序列。基因组重排事件是可以改变基因内容或基因组元素排列的突变。反转和转位是研究最多的两个基因组重排事件。反转反转基因组的一个片段,而转位互换两个连续的片段。该领域的初步研究只考虑了基因的顺序。最近的研究在模型中加入了其他遗传信息。特别是关于基因间区域大小的信息,基因间区域是每对基因之间和线性基因组末端的结构。结果和结论:在这项工作中,我们研究了共享同一组基因的基因组的基因间反转和转位排序问题,考虑了基因取向已知和未知的情况。此外,我们还探讨了该问题的一个变体,它推广了换位事件。因此,我们提出了一种近似算法,在考虑反转和转置(经典定义)事件的情况下,保证近似因子为4,这是先前已知的基因取向未知情况下的4.5近似的改进。我们还提出了一种结合广义转置事件的3-逼近算法,并提出了一种贪婪策略来提高算法的性能。我们采用模拟数据进行了实际测试,结果表明,在这两种情况下,与解决该问题的最知名算法相比,算法往往表现得更好。最后,我们使用真实基因组进行了实验,以证明算法的适用性。
{"title":"An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes.","authors":"Klairton L Brito,&nbsp;Andre R Oliveira,&nbsp;Alexsandro O Alexandrino,&nbsp;Ulisses Dias,&nbsp;Zanoni Dias","doi":"10.1186/s13015-021-00203-7","DOIUrl":"https://doi.org/10.1186/s13015-021-00203-7","url":null,"abstract":"<p><strong>Background: </strong>In the comparative genomics field, one of the goals is to estimate a sequence of genetic changes capable of transforming a genome into another. Genome rearrangement events are mutations that can alter the genetic content or the arrangement of elements from the genome. Reversal and transposition are two of the most studied genome rearrangement events. A reversal inverts a segment of a genome while a transposition swaps two consecutive segments. Initial studies in the area considered only the order of the genes. Recent works have incorporated other genetic information in the model. In particular, the information regarding the size of intergenic regions, which are structures between each pair of genes and in the extremities of a linear genome.</p><p><strong>Results and conclusions: </strong>In this work, we investigate the SORTING BY INTERGENIC REVERSALS AND TRANSPOSITIONS problem on genomes sharing the same set of genes, considering the cases where the orientation of genes is known and unknown. Besides, we explored a variant of the problem, which generalizes the transposition event. As a result, we present an approximation algorithm that guarantees an approximation factor of 4 for both cases considering the reversal and transposition (classic definition) events, an improvement from the 4.5-approximation previously known for the scenario where the orientation of the genes is unknown. We also present a 3-approximation algorithm by incorporating the generalized transposition event, and we propose a greedy strategy to improve the performance of the algorithms. We performed practical tests adopting simulated data which indicated that the algorithms, in both cases, tend to perform better when compared with the best-known algorithms for the problem. Lastly, we conducted experiments using real genomes to demonstrate the applicability of the algorithms.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8717661/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39773174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Algorithms for Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1