后缀树的并行构造与全最近邻小值问题

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI:10.1109/IPDPS.2017.62

P. Flick, S. Aluru

{"title":"后缀树的并行构造与全最近邻小值问题","authors":"P. Flick, S. Aluru","doi":"10.1109/IPDPS.2017.62","DOIUrl":null,"url":null,"abstract":"A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem\",\"authors\":\"P. Flick, S. Aluru\",\"doi\":\"10.1109/IPDPS.2017.62\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.\",\"PeriodicalId\":209524,\"journal\":{\"name\":\"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2017.62\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2017.62","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

后缀树是一种基本的、通用的字符串数据结构，经常用于重要的应用领域，如文本处理、信息检索和计算生物学。从序列上看，后缀树的构建需要线性时间，并且只有PRAM模型才存在最优并行算法。最近的工作主要针对低核数共享内存实现，但实现了次优复杂度，而先前的分布式内存并行算法具有二次最坏复杂度。通过求解全最近邻最小值(ANSV)问题，可以从后缀数组和最长公共前缀(LCP)数组构建后缀树。本文提出了一种广义的ANSV问题，并提出了一种分布式内存并行算法，在O(n/p +p)时间内求解该问题。我们的算法最大限度地减少了总体和每个节点的通信量。在此基础上，我们提出了一种用于构建后缀树的分布式表示的并行算法，与以前的分布式内存算法相比，它既具有优越的理论复杂性，又具有更好的实际性能。我们演示了在1024个Intel Xeon内核上构建人类基因组的后缀树和LCP阵列，用时不到2秒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem

A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量