SCIPIS:高端计算系统中的可扩展并发持续索引和搜索

IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Journal of Parallel and Distributed Computing Pub Date : 2024-03-25 DOI:10.1016/j.jpdc.2024.104878
Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu
{"title":"SCIPIS:高端计算系统中的可扩展并发持续索引和搜索","authors":"Alexandru Iulian Orhean ,&nbsp;Anna Giannakou ,&nbsp;Lavanya Ramakrishnan ,&nbsp;Kyle Chard ,&nbsp;Boris Glavic ,&nbsp;Ioan Raicu","doi":"10.1016/j.jpdc.2024.104878","DOIUrl":null,"url":null,"abstract":"<div><p>While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104878"},"PeriodicalIF":3.4000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems\",\"authors\":\"Alexandru Iulian Orhean ,&nbsp;Anna Giannakou ,&nbsp;Lavanya Ramakrishnan ,&nbsp;Kyle Chard ,&nbsp;Boris Glavic ,&nbsp;Ioan Raicu\",\"doi\":\"10.1016/j.jpdc.2024.104878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.</p></div>\",\"PeriodicalId\":54775,\"journal\":{\"name\":\"Journal of Parallel and Distributed Computing\",\"volume\":\"189 \",\"pages\":\"Article 104878\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Parallel and Distributed Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S074373152400042X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S074373152400042X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

在个人电脑上搜索数据或在线发现数据现在已是家常便饭,但在大型并行和分布式文件系统上发现数据却没有类似的方法,这些系统通常部署在高性能计算系统上。与必须处理大量相对较小文件的网络搜索不同,在高性能计算应用中,还需要支持高效的大文件索引。我们提出的 SCIPIS 是一个索引和搜索框架,可以利用多核架构、多 NUMA 节点和多 NVMe 存储设备等现代高端计算系统的特性。SCIPIS 支持构建和搜索 TFIDF 持久性索引,其性能比最先进的方法高出几个数量级。我们通过以下方法实现了索引的可扩展性和性能:将索引过程分解为可独立优化的单独组件;在内存中构建磁盘友好型数据结构(可在长时间顺序写入中持久化);避免索引线程之间的通信(这些线程在大型文件集合上协作构建索引)。我们使用三种类型的数据集(日志、科学数据和元数据)对 SCIPIS 进行了评估,系统配置高达 192 核、768GB 内存、8 个 NUMA 节点和多达 16 个 NVMe 驱动器,与 Apache Lucene 相比,索引效果提高了 29 倍,同时保持了类似的搜索延迟。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems

While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing 工程技术-计算机:理论方法
CiteScore
10.30
自引率
2.60%
发文量
172
审稿时长
12 months
期刊介绍: This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.
期刊最新文献
Content delivery network solutions for the CMS experiment: The evolution towards HL-LHC An efficient conference key agreement protocol suited for resource constrained devices Enabling semi-supervised learning in intrusion detection systems Fault-tolerance in biswapped multiprocessor interconnection networks Editorial Board
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1