ACM Transactions on Storage最新文献_第2页

Tarazu: An Adaptive End-to-End I/O Load Balancing Framework for Large-Scale Parallel File Systems Tarazu：面向大规模并行文件系统的自适应端到端 I/O 负载平衡框架

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2024-02-01 DOI: 10.1145/3641885

Arnab K. Paul, Sarah Neuwirth, Bharti Wadhwa, Feiyi Wang, Sarp Oral, Ali R. Butt

The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load balancing techniques should be transparent to client applications. To address these issues, we propose Tarazu, an end-to-end control plane where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages real-time load statistics for global data placement on distributed storage servers, while our design model employs trace-based optimization techniques to minimize latency for I/O load requests between clients and servers and to handle multiple striping patterns in files. We evaluate our proposed system on an experimental cluster for two common use cases: the synthetic I/O benchmark IOR and the scientific application I/O kernel HACC-I/O. We also use a discrete-time simulator with real HPC application traces from emerging workloads running on the Summit supercomputer to validate the effectiveness and scalability of Tarazu in large-scale storage environments. The results show improvements in load balancing and read performance of up to (33% ) and (43% ) percent, respectively, compared to the state of the art.

大型并行文件系统的 I/O 负载不平衡会影响高性能计算（HPC）应用的并行 I/O 性能。造成 I/O 不平衡的主要原因之一是缺乏对全系统资源消耗的全局了解。虽然解决这一问题的方法已经存在，但高性能计算工作负载的多样性以及不同的文件条带模式阻碍了这些方法的广泛采用。此外，负载平衡技术应该对客户端应用程序透明。为了解决这些问题，我们提出了端到端控制平面 Tarazu，在这个控制平面上，客户端可以透明、自适应地向一组选定的 I/O 服务器写入数据，以实现均衡的数据放置。我们的控制平面利用实时负载统计在分布式存储服务器上进行全局数据放置，而我们的设计模型则采用了基于跟踪的优化技术，以最大限度地减少客户端和服务器之间 I/O 负载请求的延迟，并处理文件中的多种条带模式。我们在一个实验集群上对我们提出的系统进行了评估，评估了两种常见的使用情况：合成 I/O 基准 IOR 和科学应用 I/O 内核 HACC-I/O。我们还使用离散时间模拟器和来自 Summit 超级计算机上运行的新兴工作负载的真实 HPC 应用程序跟踪来验证 Tarazu 在大规模存储环境中的有效性和可扩展性。结果显示，与现有技术相比，负载平衡和读取性能分别提高了33%和43%。

{"title":"Tarazu: An Adaptive End-to-End I/O Load Balancing Framework for Large-Scale Parallel File Systems","authors":"Arnab K. Paul, Sarah Neuwirth, Bharti Wadhwa, Feiyi Wang, Sarp Oral, Ali R. Butt","doi":"10.1145/3641885","DOIUrl":"https://doi.org/10.1145/3641885","url":null,"abstract":"The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load balancing techniques should be transparent to client applications. To address these issues, we propose Tarazu, an end-to-end control plane where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages real-time load statistics for global data placement on distributed storage servers, while our design model employs trace-based optimization techniques to minimize latency for I/O load requests between clients and servers and to handle multiple striping patterns in files. We evaluate our proposed system on an experimental cluster for two common use cases: the synthetic I/O benchmark IOR and the scientific application I/O kernel HACC-I/O. We also use a discrete-time simulator with real HPC application traces from emerging workloads running on the Summit supercomputer to validate the effectiveness and scalability of Tarazu in large-scale storage environments. The results show improvements in load balancing and read performance of up to (33% ) and (43% ) percent, respectively, compared to the state of the art.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"61 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139657276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An End-to-End High-Performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems 针对 Docker 注册表和 Docker 容器存储系统的端到端高性能重复数据删除方案

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2024-01-30 DOI: 10.1145/3643819

Nannan Zhao, Muhui Lin, Hadeel Albahar, Arnab K. Paul, Zhijie Huang, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Ali Anwar, Ali R. Butt

The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost.

We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9 × and reduces the GET layer latency up to 2.8 × compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.

由于广泛采用 Docker 容器来支持敏捷、灵活的企业应用，导致容器映像大量涌现。相关的存储性能和容量要求给存储和分发映像的容器注册中心以及管理映像层和存储容器运行时生成的短暂数据的 Docker 客户端容器存储系统等基础设施带来了巨大压力。镜像中的大量重复数据加剧了存储需求。此外，使用写入复制（CoW）文件系统作为存储驱动程序的容器存储系统会加剧冗余。利用现实世界图像中的高文件冗余度是一种很有前途的方法，可以大幅降低容器注册表不断增长的存储需求，提高容器存储系统的空间效率。然而，由于数据重建开销和重复数据删除成本，现有的重复数据删除技术大大降低了注册表和容器存储系统的性能。我们提出的 DupHunter 是一种端到端的重复数据删除方案，它可以为 Docker 注册表和容器存储系统重复数据删除层，同时保持较高的镜像分发速度和容器 I/O 性能。DupHunter 分为 3 层：注册表层、中间层和客户端层。具体来说，我们首先在注册层构建了一个高性能重复数据删除引擎，它不仅能原生重复数据删除层以节省空间，还能减少层还原开销。然后，我们在中间层使用重复数据删除卸载来消除客户端层的冗余文件，避免给客户端带来重复数据删除开销。为了进一步减少 CoW 造成的数据重复并提高容器的 I/O 性能，我们在客户端层使用了容器感知存储系统，为每个容器预留空间，并在磁盘上安排文件及其修改的位置，以保持本地性。在实际工作负载下，与最先进的技术相比，DupHunter 最多可减少 6.9 倍的存储空间，最多可减少 2.8 倍的 GET 层延迟。此外，DupHunter 还能将容器的 I/O 性能提高 93%（读取）和 64%（写入）。

{"title":"An End-to-End High-Performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems","authors":"Nannan Zhao, Muhui Lin, Hadeel Albahar, Arnab K. Paul, Zhijie Huang, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Ali Anwar, Ali R. Butt","doi":"10.1145/3643819","DOIUrl":"https://doi.org/10.1145/3643819","url":null,"abstract":"The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9 × and reduces the <monospace>GET</monospace> layer latency up to 2.8 × compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"94 18 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139646165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log Storage 利用数据模式感知垂直分区实现快速、低成本的云日志存储

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2024-01-29 DOI: 10.1145/3643641

Junyu Wei, Guangyan Zhang, Junchao Chen, Yang Wang, Weimin Zheng, Tingtao Sun, Jiesheng Wu, Jiangwei Jiang

Cloud logs can be categorized into on-line, off-line, and near-line logs based on the access frequency. Among them, near-line logs are mainly used for debugging, which means they prefer a low query latency for better user experience. Besides, the storage system for near-line logs prefers a low overall cost including the storage cost to store compressed logs, and the computation cost to compress logs and execute queries. These requirements pose challenges to achieving fast and cheap cloud log storage.

This paper proposes LogGrep, the first log compression and query tool that exploits both static and runtime patterns to properly structurize and organize log data in fine-grained units. The key idea of LogGrep is “vertical partitioning”: it stores each log entry into multiple partitions by first parsing logs into variable vectors according to static patterns and then extracting runtime pattern(s) automatically within each variable vector. Based on such runtime patterns, LogGrep further decomposes the variable vectors into fine-grained units called “Capsules” and stamps each Capsule with a summary of its values. During the query process, LogGrep can avoid decompressing and scanning Capsules that cannot match the keywords, with the help of the extracted runtime patterns and the Capsule stamps. We further show that the interactive debugging can well utilize the advantages of the vertical-partitioning-based method and mitigate its weaknesses as well. To this end, LogGrep integrates incremental locating and partial reconstruction to mitigate the read amplification incurred by vertical-partitioning-based method.

We evaluate LogGrep on 37 cloud logs from the production environment of Alibaba Cloud and the public datasets. The results show that LogGrep can reduce the query latency and the overall cost by an order of magnitude compared with state-of-the-art works. Such results have confirmed that it is worthwhile applying a more sophisticated vertical-partitioning-based method to accelerate queries on compressed cloud logs.

根据访问频率，云日志可分为在线日志、离线日志和近线日志。其中，近线日志主要用于调试，这意味着它们更倾向于低查询延迟，以获得更好的用户体验。此外，近线日志的存储系统希望总体成本低，包括存储压缩日志的存储成本，以及压缩日志和执行查询的计算成本。这些要求对实现快速、廉价的云日志存储提出了挑战。本文提出的 LogGrep 是第一款日志压缩和查询工具，它利用静态和运行时模式，以细粒度单元对日志数据进行适当的结构化和组织。LogGrep 的关键理念是 "垂直分区"：它首先根据静态模式将日志解析为变量向量，然后在每个变量向量中自动提取运行时模式，从而将每个日志条目存储为多个分区。根据这些运行时模式，LogGrep 进一步将变量向量分解成称为 "胶囊 "的细粒度单元，并为每个胶囊打上其值摘要的印记。在查询过程中，LogGrep 可以借助提取的运行时模式和 Capsule 标记，避免解压缩和扫描与关键字不匹配的 Capsules。我们进一步证明，交互式调试能很好地利用基于垂直分区方法的优势，同时也能减轻其弱点。为此，LogGrep 集成了增量定位和部分重建功能，以减轻基于垂直分区方法的读取放大。我们在阿里云生产环境的 37 个云日志和公共数据集上对 LogGrep 进行了评估。结果表明，与最先进的方法相比，LogGrep 可以将查询延迟和总体成本降低一个数量级。这些结果证实，值得采用更复杂的基于垂直分区的方法来加速压缩云日志的查询。

{"title":"Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log Storage","authors":"Junyu Wei, Guangyan Zhang, Junchao Chen, Yang Wang, Weimin Zheng, Tingtao Sun, Jiesheng Wu, Jiangwei Jiang","doi":"10.1145/3643641","DOIUrl":"https://doi.org/10.1145/3643641","url":null,"abstract":"Cloud logs can be categorized into on-line, off-line, and near-line logs based on the access frequency. Among them, near-line logs are mainly used for debugging, which means they prefer a low query latency for better user experience. Besides, the storage system for near-line logs prefers a low overall cost including the storage cost to store compressed logs, and the computation cost to compress logs and execute queries. These requirements pose challenges to achieving fast and cheap cloud log storage. This paper proposes LogGrep, the first log compression and query tool that exploits both static and runtime patterns to properly structurize and organize log data in fine-grained units. The key idea of LogGrep is “vertical partitioning”: it stores each log entry into multiple partitions by first parsing logs into variable vectors according to static patterns and then extracting runtime pattern(s) automatically within each variable vector. Based on such runtime patterns, LogGrep further decomposes the variable vectors into fine-grained units called “Capsules” and stamps each Capsule with a summary of its values. During the query process, LogGrep can avoid decompressing and scanning Capsules that cannot match the keywords, with the help of the extracted runtime patterns and the Capsule stamps. We further show that the interactive debugging can well utilize the advantages of the vertical-partitioning-based method and mitigate its weaknesses as well. To this end, LogGrep integrates incremental locating and partial reconstruction to mitigate the read amplification incurred by vertical-partitioning-based method. We evaluate LogGrep on 37 cloud logs from the production environment of Alibaba Cloud and the public datasets. The results show that LogGrep can reduce the query latency and the overall cost by an order of magnitude compared with state-of-the-art works. Such results have confirmed that it is worthwhile applying a more sophisticated vertical-partitioning-based method to accelerate queries on compressed cloud logs.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"334 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139585234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bridging Software-Hardware for CXL Memory Disaggregation in Billion-Scale Nearest Neighbor Search 为十亿级近邻搜索中的 CXL 内存分解架设软硬件桥梁

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2024-01-06 DOI: 10.1145/3639471

Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, Myoungsoo Jung

We propose CXL-ANNS, a software-hardware collaborative approach to enable scalable approximate nearest neighbor search (ANNS) services. To this end, we first disaggregate DRAM from the host via compute express link (CXL) and place all essential datasets into its memory pool. While this CXL memory pool allows ANNS to handle billion-point graphs without an accuracy loss, we observe that the search performance significantly degrades because of CXL’s far-memory-like characteristics. To address this, CXL-ANNS considers the node-level relationship and caches the neighbors in local memory, which are expected to visit most frequently. For the uncached nodes, CXL-ANNS prefetches a set of nodes most likely to visit soon by understanding the graph traversing behaviors of ANNS. CXL-ANNS is also aware of the architectural structures of the CXL interconnect network and lets different hardware components collaborate with each other for the search. Further, it relaxes the execution dependency of neighbor search tasks and allows ANNS to utilize all hardware in the CXL network in parallel.

Our evaluation shows that CXL-ANNS exhibits 93.3% lower query latency than state-of-the-art ANNS platforms that we tested. CXL-ANNS also outperforms an oracle ANNS system that has unlimited local DRAM capacity by 68.0%, in terms of latency.

我们提出的 CXL-ANNS 是一种软硬件协作方法，用于提供可扩展的近似近邻搜索（ANNS）服务。为此，我们首先通过计算快速链接（CXL）将 DRAM 从主机中分离出来，并将所有重要数据集放入其内存池中。虽然 CXL 内存池允许 ANNS 在不损失精度的情况下处理十亿点图，但我们观察到，由于 CXL 类似于远端内存的特性，搜索性能明显下降。为了解决这个问题，CXL-ANNS 考虑了节点级关系，并将邻居缓存在本地内存中，因为这些节点的访问频率最高。对于未缓存的节点，CXL-ANNS 会通过了解 ANNS 的图遍历行为，预设一组最有可能很快访问的节点。CXL-ANNS 还了解 CXL 互连网络的架构结构，并允许不同的硬件组件相互协作进行搜索。此外，它还放宽了相邻搜索任务的执行依赖性，允许 ANNS 并行利用 CXL 网络中的所有硬件。我们的评估显示，CXL-ANNS 的查询延迟比我们测试过的最先进 ANNS 平台低 93.3%。就延迟而言，CXL-ANNS 还比本地 DRAM 容量无限的甲骨文 ANNS 系统高出 68.0%。

{"title":"Bridging Software-Hardware for CXL Memory Disaggregation in Billion-Scale Nearest Neighbor Search","authors":"Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, Myoungsoo Jung","doi":"10.1145/3639471","DOIUrl":"https://doi.org/10.1145/3639471","url":null,"abstract":"We propose CXL-ANNS, a software-hardware collaborative approach to enable scalable approximate nearest neighbor search (ANNS) services. To this end, we first disaggregate DRAM from the host via compute express link (CXL) and place all essential datasets into its memory pool. While this CXL memory pool allows ANNS to handle billion-point graphs without an accuracy loss, we observe that the search performance significantly degrades because of CXL’s far-memory-like characteristics. To address this, CXL-ANNS considers the node-level relationship and caches the neighbors in local memory, which are expected to visit most frequently. For the uncached nodes, CXL-ANNS prefetches a set of nodes most likely to visit soon by understanding the graph traversing behaviors of ANNS. CXL-ANNS is also aware of the architectural structures of the CXL interconnect network and lets different hardware components collaborate with each other for the search. Further, it relaxes the execution dependency of neighbor search tasks and allows ANNS to utilize all hardware in the CXL network in parallel. Our evaluation shows that CXL-ANNS exhibits 93.3% lower query latency than state-of-the-art ANNS platforms that we tested. CXL-ANNS also outperforms an oracle ANNS system that has unlimited local DRAM capacity by 68.0%, in terms of latency.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"10 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139372911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Polling Sanitization to Balance I/O Latency and Data Security of High-density SSDs 轮询净化，平衡高密度固态硬盘的 I/O 延迟和数据安全

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2024-01-06 DOI: 10.1145/3639826

Jiaojiao Wu, Zhigang Cai, Fan Yang, Jun Li, Francois Trahay, Zheng Yang, Chao Wang, Jianwei Liao

Sanitization is an effective approach for ensuring data security through scrubbing invalid but sensitive data pages, with the cost of impacts on storage performance due to moving out valid pages from the sanitization-required wordline, which is a logical read/write unit and consists of multiple pages in high-density SSDs. To minimize the impacts on I/O latency and data security, this paper proposes a polling-based scheduling approach for data sanitization in high-density SSDs. Our method polls a specific SSD channel for completing data sanitization at the block granularity, meanwhile other channels can still service I/O requests. Furthermore, our method assigns a low priority to the blocks that are more likely to have future adjacent page invalidations inside sanitization-required wordlines, while selecting the sanitization block, to minimize the negative impacts of moving valid pages. Through a series of emulation experiments on several disk traces of real-world applications, we show that our proposal can decrease the negative effects of data sanitization in terms of the risk-performance index, which is a united time metric of I/O responsiveness and the unsafe time interval, by 16.34% on average, compared to related sanitization methods.

在高密度固态硬盘中，字线是一个逻辑读/写单元，由多个页面组成，因此需要将字线中的有效页面移出，从而影响存储性能。为了尽量减少对 I/O 延迟和数据安全性的影响，本文提出了一种基于轮询的调度方法，用于在高密度固态硬盘中进行数据清理。我们的方法轮询特定的固态硬盘通道，以完成块粒度的数据清理，同时其他通道仍可服务于 I/O 请求。此外，我们的方法在选择消毒块时，会为那些未来更有可能在需要消毒的字线内出现相邻页面无效的块分配较低的优先级，以尽量减少移动有效页面带来的负面影响。通过在多个真实应用的磁盘跟踪上进行的一系列模拟实验，我们表明，与相关的消毒方法相比，我们的建议可以在风险性能指数（I/O 响应性和不安全时间间隔的统一时间指标）方面将数据消毒的负面影响平均降低 16.34%。

{"title":"Polling Sanitization to Balance I/O Latency and Data Security of High-density SSDs","authors":"Jiaojiao Wu, Zhigang Cai, Fan Yang, Jun Li, Francois Trahay, Zheng Yang, Chao Wang, Jianwei Liao","doi":"10.1145/3639826","DOIUrl":"https://doi.org/10.1145/3639826","url":null,"abstract":"Sanitization is an effective approach for ensuring data security through scrubbing invalid but sensitive data pages, with the cost of impacts on storage performance due to moving out valid pages from the sanitization-required wordline, which is a logical read/write unit and consists of multiple pages in high-density SSDs. To minimize the impacts on I/O latency and data security, this paper proposes a polling-based scheduling approach for data sanitization in high-density SSDs. Our method polls a specific SSD channel for completing data sanitization at the block granularity, meanwhile other channels can still service I/O requests. Furthermore, our method assigns a low priority to the blocks that are more likely to have future adjacent page invalidations inside sanitization-required wordlines, while selecting the sanitization block, to minimize the negative impacts of moving valid pages. Through a series of emulation experiments on several disk traces of real-world applications, we show that our proposal can decrease the negative effects of data sanitization in terms of the risk-performance index, which is a united time metric of I/O responsiveness and the unsafe time interval, by <monospace>16.34%</monospace> on average, compared to related sanitization methods.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"26 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139376310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An LSM Tree Augmented with B+ Tree on Nonvolatile Memory 非易失性存储器上扩充B+树的LSM树

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-12-02 DOI: 10.1145/3633475

Donguk Kim, Jongsung Lee, Keun Soo Lim, Jun Heo, Tae Jun Ham, Jae W. Lee

Modern log-structured merge (LSM) tree-based key-value stores are widely used to process update-heavy workloads effectively as the LSM tree sequentializes write requests to a storage device to maximize storage performance. However, this append-only approach leaves many outdated copies of frequently updated key-value pairs, which need to be routinely cleaned up through the operation called compaction. When the system load is modest, compaction happens in background. However, at a high system load it can quickly become the major performance bottleneck. To address this compaction bottleneck and further improve the write throughput of LSM tree-based key-value stores, we propose LAB-DB, which augments the existing LSM tree with a pair of B⁺ trees on byte-addressable nonvolatile memory (NVM). The auxiliary B⁺ trees on NVM reduce both compaction frequency and compaction time, hence leading to lower compaction overhead for writes and fewer storage accesses for reads. According to our evaluation of LAB-DB on RocksDB with YCSB benchmarks, LAB-DB achieves 94% and 67% speedups on two write-intensive workloads (Workload A and F), and also a 43% geomean speedup on read-intensive YCSB Workload B, C, D, and E. This performance gain comes with a low cost of NVM whose size is just 0.6% of the entire dataset to demonstrate the scalability of LAB-DB with an ever increasing volume of future datasets.

现代基于日志结构合并(LSM)树的键值存储被广泛用于有效地处理更新繁重的工作负载，因为LSM树对存储设备的写请求进行顺序处理，以最大限度地提高存储性能。然而，这种只追加的方法留下了许多经常更新的键值对的过时副本，这些副本需要通过称为压缩的操作进行常规清理。当系统负载适中时，压缩在后台进行。但是，在高系统负载时，它可能很快成为主要的性能瓶颈。为了解决这个压缩瓶颈并进一步提高基于LSM树的键值存储的写吞吐量，我们提出了LAB-DB，它在字节可寻址非易失性存储器(NVM)上使用一对B+树来扩展现有的LSM树。NVM上的辅助B+树减少了压缩频率和压缩时间，从而降低了写操作的压缩开销，减少了读操作的存储访问。根据我们在RocksDB和YCSB基准测试上对LAB-DB的评估，LAB-DB在两个写密集型工作负载(工作负载A和F)上实现了94%和67%的加速，在读密集型工作负载B、C、D和e上也实现了43%的几何加速。这种性能提升伴随着低成本的NVM，其大小仅占整个数据集的0.6%，以证明LAB-DB在未来数据集数量不断增加时的可扩展性。

{"title":"An LSM Tree Augmented with B+ Tree on Nonvolatile Memory","authors":"Donguk Kim, Jongsung Lee, Keun Soo Lim, Jun Heo, Tae Jun Ham, Jae W. Lee","doi":"10.1145/3633475","DOIUrl":"https://doi.org/10.1145/3633475","url":null,"abstract":"Modern log-structured merge (LSM) tree-based key-value stores are widely used to process update-heavy workloads effectively as the LSM tree sequentializes write requests to a storage device to maximize storage performance. However, this append-only approach leaves many outdated copies of frequently updated key-value pairs, which need to be routinely cleaned up through the operation called compaction. When the system load is modest, compaction happens in background. However, at a high system load it can quickly become the major performance bottleneck. To address this compaction bottleneck and further improve the write throughput of LSM tree-based key-value stores, we propose LAB-DB, which augments the existing LSM tree with a pair of B+ trees on byte-addressable nonvolatile memory (NVM). The auxiliary B+ trees on NVM reduce both compaction frequency and compaction time, hence leading to lower compaction overhead for writes and fewer storage accesses for reads. According to our evaluation of LAB-DB on RocksDB with YCSB benchmarks, LAB-DB achieves 94% and 67% speedups on two write-intensive workloads (Workload A and F), and also a 43% geomean speedup on read-intensive YCSB Workload B, C, D, and E. This performance gain comes with a low cost of NVM whose size is just 0.6% of the entire dataset to demonstrate the scalability of LAB-DB with an ever increasing volume of future datasets.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"44 12","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

gLSM: Using GPGPU to Accelerate Compactions in LSM-tree-based Key-value Stores gLSM:使用GPGPU加速基于lsm树的键值存储的压缩

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-11-24 DOI: 10.1145/3633782

Hui Sun, Jinfeng Xu, Xiangxiang Jiang, Guanzhong Chen, Yinliang Yue, Xiao Qin

Log-structured-merge tree or LSM-tree is a technological underpinning in key-value (KV) stores to support a wide range of performance-critical applications. By conducting data re-organization in the background by virtue of compaction operations, the KV stores have the potential to swiftly service write requests with sequential batched disk writes and read requests for KV items constantly sorted by the compaction. Compaction demands high I/O bandwidth and CPU speed to facilitate quality service to user read/write requests. With the emergence of high-speed SSDs, CPUs are increasingly becoming a performance bottleneck. To mitigate the bottleneck limiting the KV-store’s performance and that of the applications supported by the store, we propose a system - gLSM - to leverage GPGPU to remarkably accelerate the compaction operations. gLSM fully utilizes the parallelism and computational capability inside GPGPUs to improve the compaction performance. We design a driver framework to parallelize compaction operations handled between a pair of CPU and GPGPU. We employ data independence and GPGPU-orient radix-sorting algorithm to concurrently conduct compaction. A key-value separation method is devised to slash the transfer of data volume from CPU-side memory to the GPGPU counterpart. The results reveal that gLSM improves the throughput and compaction bandwidth by up to a factor of 2.9 and 26.0, respectively, compared with the four state-of-the-art KV stores. gLSM also reduces the write latency by 73.3%. gLSM exhibits a performance improvement by up to 45% compared against its variant where there are no KV separation and collaboration sort modules.

日志结构的合并树或lsm树是键值(KV)存储中的一种技术基础，用于支持广泛的性能关键型应用程序。通过压缩操作在后台进行数据重组，KV存储有可能通过顺序批处理磁盘写入和读取请求来快速服务KV项目，并不断按压缩排序。压缩要求较高的I/O带宽和CPU速度，以保证对用户读/写请求的高质量服务。随着高速ssd的出现，cpu日益成为性能瓶颈。为了缓解限制KV-store性能和该store支持的应用程序性能的瓶颈，我们提出了一个系统- gLSM -利用GPGPU显著加速压缩操作。gLSM充分利用了gpgpu内部的并行性和计算能力来提高压缩性能。我们设计了一个驱动框架来并行处理一对CPU和GPGPU之间的压缩操作。我们采用数据独立性和面向gpgpu的基数排序算法并行地进行压缩。设计了一种键值分离方法来减少从cpu端内存到GPGPU端的数据传输量。结果表明，与四种最先进的KV存储相比，gLSM将吞吐量和压缩带宽分别提高了2.9和26.0倍。gLSM还将写延迟减少了73.3%。与没有KV分离和协作排序模块的变体相比，gLSM的性能提高了45%。

{"title":"gLSM: Using GPGPU to Accelerate Compactions in LSM-tree-based Key-value Stores","authors":"Hui Sun, Jinfeng Xu, Xiangxiang Jiang, Guanzhong Chen, Yinliang Yue, Xiao Qin","doi":"10.1145/3633782","DOIUrl":"https://doi.org/10.1145/3633782","url":null,"abstract":"Log-structured-merge tree or LSM-tree is a technological underpinning in key-value (KV) stores to support a wide range of performance-critical applications. By conducting data re-organization in the background by virtue of compaction operations, the KV stores have the potential to swiftly service write requests with sequential batched disk writes and read requests for KV items constantly sorted by the compaction. Compaction demands high I/O bandwidth and CPU speed to facilitate quality service to user read/write requests. With the emergence of high-speed SSDs, CPUs are increasingly becoming a performance bottleneck. To mitigate the bottleneck limiting the KV-store’s performance and that of the applications supported by the store, we propose a system - gLSM - to leverage GPGPU to remarkably accelerate the compaction operations. gLSM fully utilizes the parallelism and computational capability inside GPGPUs to improve the compaction performance. We design a driver framework to parallelize compaction operations handled between a pair of CPU and GPGPU. We employ data independence and GPGPU-orient radix-sorting algorithm to concurrently conduct compaction. A key-value separation method is devised to slash the transfer of data volume from CPU-side memory to the GPGPU counterpart. The results reveal that gLSM improves the throughput and compaction bandwidth by up to a factor of 2.9 and 26.0, respectively, compared with the four state-of-the-art KV stores. gLSM also reduces the write latency by 73.3%. gLSM exhibits a performance improvement by up to 45% compared against its variant where there are no KV separation and collaboration sort modules.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"72 5","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Perseid: A Secondary Indexing Mechanism for LSM-based Storage Systems Perseid:基于lsm的存储系统的二级索引机制

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-11-17 DOI: 10.1145/3633285

Jing Wang, Youyou Lu, Qing Wang, Yuhao Zhang, Jiwu Shu

LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3-7 × and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.

基于lsm的存储系统由于在块设备上具有优异的写性能而得到了广泛的应用。但是，它们目前不能有效地支持二级索引，因为二级索引查询操作通常需要检索分散在多个LSM组件中的多个小值。在这项工作中，我们重新审视基于lsm的存储系统中具有字节可寻址持久内存(PM)的二级索引。现有的基于pm的索引不能直接胜任高效的二级索引。针对基于lsm的存储系统，我们提出了一种高效的基于PM的二次索引机制Perseid，该机制兼顾了PM和二次索引的特点。Perseid包括(1)一个专门设计的二级索引结构，实现高性能的插入和查询;(2)一个轻量级的混合PM-DRAM和基于哈希的验证方法，过滤掉开销很小的过时值;(3)对二级索引发出的主表搜索进行了两个自适应优化，以加速非仅索引的查询。我们的评估表明，即使在PM而不是磁盘上，Perseid比现有的基于PM的索引要好3-7倍，并且达到了最先进的基于lsm的二级索引技术的两个数量级的性能。

{"title":"Perseid: A Secondary Indexing Mechanism for LSM-based Storage Systems","authors":"Jing Wang, Youyou Lu, Qing Wang, Yuhao Zhang, Jiwu Shu","doi":"10.1145/3633285","DOIUrl":"https://doi.org/10.1145/3633285","url":null,"abstract":"LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3-7 × and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"45 2","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the Special Section on USENIX FAST 2023 介绍USENIX FAST 2023的特殊部分

3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-11-14 DOI: 10.1145/3612820

Ashvin Goel, Dalit Naor

This special section of the IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) presents the five most highly rated papers from the 2023 IEEE Pacific Visualization Symposium (IEEE PacificVis), hosted in Seoul, Korea from ...

IEEE可视化与计算机图形学汇刊(IEEE TVCG)的这个特别部分介绍了来自2023年IEEE太平洋可视化研讨会(IEEE PacificVis)的五篇评价最高的论文，该研讨会在韩国首尔举办。

引用次数: 0

Practical Design Considerations for Wide Locally Recoverable Codes (LRCs) 宽局部可恢复码(lrc)的实际设计考虑

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-11-14 DOI: 10.1145/3626198

Saurabh Kadekodi, Shashwat Silas, David Clausen, Arif Merchant

Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work, we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.

We conduct a practically minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance-optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.

大规模存储集群中的大部分数据都是擦除编码。在百亿亿级，优化擦除码以实现低存储开销、高效重构和易于部署是至关重要的。本地可恢复代码(lrc)在这个领域理所当然地获得了中心重要性，因为它们可以平衡许多这些需求。在我们的工作中，我们研究了广泛的lrc;每个条带具有大量块和低存储开销的lrc。这些代码是从业者解锁更高存储节省的自然下一步，但它们也带来了自己的挑战。特别令人感兴趣的是它们的可靠性，因为更宽的条纹更容易同时发生故障。我们对几种流行的和新颖的lrc进行了实际的分析。我们发现，宽LRC可靠性是一种微妙的现象，它对几种设计选择很敏感，其中一些被理论家忽视，而另一些被实践者忽视。基于这些见解，我们构建了一种新的LRC，称为统一柯西LRC，它在模拟中表现出优异的性能，在谷歌存储集群中部署的广泛LRC观察到的不可用事件的可靠性提高了33%。我们还展示了这些代码很容易部署，从而提高了它们对常见维护事件的健壮性。在此过程中，我们还给出了距离最优lrc的一个非常简单和新颖的结构(其他结构也已知)，这可能会引起有理论头脑的读者的兴趣。

{"title":"Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)","authors":"Saurabh Kadekodi, Shashwat Silas, David Clausen, Arif Merchant","doi":"10.1145/3626198","DOIUrl":"https://doi.org/10.1145/3626198","url":null,"abstract":"Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work, we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.We conduct a practically minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance-optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"71 7","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1