dwMLCS: An Efficient MLCS Algorithm Based on Dynamic and Weighted Directed Acyclic Graph

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-07-22 DOI:10.1109/TCBB.2024.3431558

Changyong Yu;Dekuan Gao;Xu Guo;Haitao Ma;Yuhai Zhao;Guoren Wang

{"title":"dwMLCS: An Efficient MLCS Algorithm Based on Dynamic and Weighted Directed Acyclic Graph","authors":"Changyong Yu;Dekuan Gao;Xu Guo;Haitao Ma;Yuhai Zhao;Guoren Wang","doi":"10.1109/TCBB.2024.3431558","DOIUrl":null,"url":null,"abstract":"The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1987-1999"},"PeriodicalIF":3.6000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10606065/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

dwMLCS：基于动态加权有向无环图的高效 MLCS 算法。

为多个序列寻找最长公共子序列（MLCS）是一个计算密集且极具挑战性的问题，在文本比较、模式识别和基因诊断等多个领域都有重要应用。目前，基于点的主流 MLCS 算法已成为流行算法并得到广泛研究。一般来说，这些算法会构建匹配点的有向无环图（DAG），并将 MLCS 问题转换为搜索 DAG 中的最长路径。目前已做了一些改进，主要是减小模型大小和减少冗余计算。这些改进包括：1）消除重复节点的哈希方法；2）支持较小 DAG 的动态结构；3）路径剪枝策略等。然而，在面对大规模 MLCS 问题时，这些算法的局限性仍然很大，原因在于：1）动态结构的维护过于耗时；2）路径剪枝在很大程度上依赖于 MLCS 下界和上界的紧密性。这些因素导致大规模 MLCS 问题仍然是一个难题。我们针对大规模 MLCS 问题提出了一种新算法，命名为 dwMLCS。它基于两个模型：一个是既节省空间又节省时间的动态 DAG 模型。它能显著减少 DAG 的大小。另一个是带有新后继策略的加权 DAG 模型。利用该模型，我们设计了一种算法，用于找到更严格的 MLCS 下限。然后，进行路径剪枝以进一步缩小 DAG 的大小并消除冗余计算。此外，我们还提出了一种提高路径剪枝策略效率的上界方法。实验结果表明，所提出的模型和算法的有效性和效率均优于最先进的算法。dwMLCS 的源代码可从网站 https://github.com/BioLab310/dwMLCS 下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Computational Biology and Bioinformatics 工程技术-计算机：跨学科应用

CiteScore

7.50

自引率

6.70%

发文量

479

审稿时长

3 months

期刊介绍： IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system