dwMLCS: An Efficient MLCS Algorithm Based on Dynamic and Weighted Directed Acyclic Graph

IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-07-22 DOI:10.1109/TCBB.2024.3431558
Changyong Yu;Dekuan Gao;Xu Guo;Haitao Ma;Yuhai Zhao;Guoren Wang
{"title":"dwMLCS: An Efficient MLCS Algorithm Based on Dynamic and Weighted Directed Acyclic Graph","authors":"Changyong Yu;Dekuan Gao;Xu Guo;Haitao Ma;Yuhai Zhao;Guoren Wang","doi":"10.1109/TCBB.2024.3431558","DOIUrl":null,"url":null,"abstract":"The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1987-1999"},"PeriodicalIF":3.6000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10606065/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
dwMLCS:基于动态加权有向无环图的高效 MLCS 算法。
为多个序列寻找最长公共子序列(MLCS)是一个计算密集且极具挑战性的问题,在文本比较、模式识别和基因诊断等多个领域都有重要应用。目前,基于点的主流 MLCS 算法已成为流行算法并得到广泛研究。一般来说,这些算法会构建匹配点的有向无环图(DAG),并将 MLCS 问题转换为搜索 DAG 中的最长路径。目前已做了一些改进,主要是减小模型大小和减少冗余计算。这些改进包括:1)消除重复节点的哈希方法;2)支持较小 DAG 的动态结构;3)路径剪枝策略等。然而,在面对大规模 MLCS 问题时,这些算法的局限性仍然很大,原因在于:1)动态结构的维护过于耗时;2)路径剪枝在很大程度上依赖于 MLCS 下界和上界的紧密性。这些因素导致大规模 MLCS 问题仍然是一个难题。我们针对大规模 MLCS 问题提出了一种新算法,命名为 dwMLCS。它基于两个模型:一个是既节省空间又节省时间的动态 DAG 模型。它能显著减少 DAG 的大小。另一个是带有新后继策略的加权 DAG 模型。利用该模型,我们设计了一种算法,用于找到更严格的 MLCS 下限。然后,进行路径剪枝以进一步缩小 DAG 的大小并消除冗余计算。此外,我们还提出了一种提高路径剪枝策略效率的上界方法。实验结果表明,所提出的模型和算法的有效性和效率均优于最先进的算法。dwMLCS 的源代码可从网站 https://github.com/BioLab310/dwMLCS 下载。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.50
自引率
6.70%
发文量
479
审稿时长
3 months
期刊介绍: IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system
期刊最新文献
Guest Editorial Guest Editorial for the 20th Asia Pacific Bioinformatics Conference iAnOxPep: a machine learning model for the identification of anti-oxidative peptides using ensemble learning. DeepLigType: Predicting Ligand Types of Protein-Ligand Binding Sites Using a Deep Learning Model. Performance Comparison between Deep Neural Network and Machine Learning based Classifiers for Huntington Disease Prediction from Human DNA Sequence. AI-based Computational Methods in Early Drug Discovery and Post Market Drug Assessment: A Survey.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1