Merging Sorted Lists of Similar Strings

E. Myers
{"title":"Merging Sorted Lists of Similar Strings","authors":"E. Myers","doi":"10.48550/arXiv.2208.09351","DOIUrl":null,"url":null,"abstract":"Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N \\ge M/T$ is a classic problem typically solved practically in $O(M \\log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M \\log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M \\log (T/ \\bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $\\bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium on Combinatorial Pattern Matching","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.09351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N \ge M/T$ is a classic problem typically solved practically in $O(M \log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M \log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M \log (T/ \bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $\bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
合并相似字符串的排序列表
将包含$M$元素的$T$排序、非冗余列表合并到大小为$N \ge M/T$的单个排序、非冗余结果中是一个经典问题,通常在$O(M \log T)$时间内实际解决,使用优先级队列数据结构,其中最基本的是简单的*堆*。在列表元素是“字符串”并且列表包含许多“相同或几乎相同的元素”的情况下,我们重新审视这个问题。通过保留每个堆节点的简单辅助信息,我们设计了一种$O(M \log T+S)$最坏情况方法,它执行的字符比较不超过所有字符串长度之和$S$,而另一种$O(M \log (T/ \bar e)+S)$方法作为输入列表之间相等元素的分数的函数变得越来越高效$\bar e = M/N$,当列表都相同时达到线性时间。与基于试验的替代配方相比,所述方法在实践中表现良好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimal LZ-End Parsing is Hard From Bit-Parallelism to Quantum String Matching for Labelled Graphs Order-Preserving Squares in Strings Sliding Window String Indexing in Streams Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1