SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Journal of Parallel and Distributed Computing Pub Date : 2024-07-01 Epub Date: 2024-03-25 DOI:10.1016/j.jpdc.2024.104878

Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu

{"title":"SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems","authors":"Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu","doi":"10.1016/j.jpdc.2024.104878","DOIUrl":null,"url":null,"abstract":"<div><p>While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104878"},"PeriodicalIF":3.4000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S074373152400042X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SCIPIS：高端计算系统中的可扩展并发持续索引和搜索

在个人电脑上搜索数据或在线发现数据现在已是家常便饭，但在大型并行和分布式文件系统上发现数据却没有类似的方法，这些系统通常部署在高性能计算系统上。与必须处理大量相对较小文件的网络搜索不同，在高性能计算应用中，还需要支持高效的大文件索引。我们提出的 SCIPIS 是一个索引和搜索框架，可以利用多核架构、多 NUMA 节点和多 NVMe 存储设备等现代高端计算系统的特性。SCIPIS 支持构建和搜索 TFIDF 持久性索引，其性能比最先进的方法高出几个数量级。我们通过以下方法实现了索引的可扩展性和性能：将索引过程分解为可独立优化的单独组件；在内存中构建磁盘友好型数据结构（可在长时间顺序写入中持久化）；避免索引线程之间的通信（这些线程在大型文件集合上协作构建索引）。我们使用三种类型的数据集（日志、科学数据和元数据）对 SCIPIS 进行了评估，系统配置高达 192 核、768GB 内存、8 个 NUMA 节点和多达 16 个 NVMe 驱动器，与 Apache Lucene 相比，索引效果提高了 29 倍，同时保持了类似的搜索延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.