A Linear Time Algorithm that Infers Hidden Strings from Their Concatenations

Q3 Biochemistry, Genetics and Molecular Biology IPSJ Transactions on Bioinformatics Pub Date : 2008-01-01 DOI:10.2197/IPSJTBIO.1.13
Tomohiro Yasuda
{"title":"A Linear Time Algorithm that Infers Hidden Strings from Their Concatenations","authors":"Tomohiro Yasuda","doi":"10.2197/IPSJTBIO.1.13","DOIUrl":null,"url":null,"abstract":"Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.1.13","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IPSJ Transactions on Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/IPSJTBIO.1.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
引用次数: 0

Abstract

Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从字符串的连接中推断隐藏字符串的线性时间算法
设T是隐藏字符串的集合,S是它们的连接的集合。我们解决了从S中推断T的问题,任何将问题形式化为优化问题的计算都是困难的,因为即使确定是否存在小于S的T也是np完全的,并且因为仅将两个字符串划分为最小的公共子字符串集合也是np完全的。在本文中,我们设计了一种新的算法,通过在S中寻找公共子串并拆分它们来推断T。该算法具有可扩展性,无论S的基数如何,都可以在O(L)时间内完成,其中L是S中所有字符串长度的总和。在计算实验中,我们成功地分解了40,000个随机生成的字符串的随机连接,并将我们的方法与多个序列对齐程序的有效性进行了比较。我们还介绍了针对智人转录组的初步实验结果,并描述了在分析真正大规模cDNA序列的应用中存在的问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IPSJ Transactions on Bioinformatics
IPSJ Transactions on Bioinformatics Biochemistry, Genetics and Molecular Biology-Biochemistry, Genetics and Molecular Biology (miscellaneous)
CiteScore
1.90
自引率
0.00%
发文量
3
期刊最新文献
A High-speed Measurement System for Treadmill Spherical Motion in Virtual Reality for Mice and a Robust Rotation Axis Estimation Algorithm Based on Spherical Geometry Metabolic Network Analysis by Time-series Causal Inference Using the Multi-dimensional Space of Prediction Errors AtLASS: A Scheme for End-to-End Prediction of Splice Sites Using Attention-based Bi-LSTM Erratum: A High-speed Measurement System for Treadmill Spherical Motion in Virtual Reality for Mice and a Robust Rotation Axis Estimation Algorithm Based on Spherical Geometry [IPSJ Transactions on Bioinformatics Vol.16 pp.1-12] A Novel Metagenomic Binning Framework Using NLP Techniques in Feature Extraction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1