ACM Transactions on Storage (TOS)最新文献_第2页

IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure IS-HBase:在计算-存储分解基础设施中实现I/O卸载和自适应缓存的存储计算优化HBase

ACM Transactions on Storage (TOS)

Pub Date : 2022-03-29 DOI: 10.1145/3488368

Zhichao Cao, Huibing Dong, Yixun Wei, Shiyong Liu, D. Du

Active storage devices and in-storage computing are proposed and developed in recent years to effectively reduce the amount of required data traffic and to improve the overall application performance. They are especially preferred in the compute-storage disaggregated infrastructure. In both techniques, a simple computing module is added to storage devices/servers such that some stored data can be processed in the storage devices/servers before being transmitted to application servers. This can reduce the required network bandwidth and offload certain computing requirements from application servers to storage devices/servers. However, several challenges exist when designing an in-storage computing- based architecture for applications. These include what computing functions need to be offloaded, how to design the protocol between in-storage modules and application servers, and how to deal with the caching issue in application servers. HBase is an important and widely used distributed Key-Value Store. It stores and indexes key-value pairs in large files in a storage system like HDFS. However, its performance especially read performance, is impacted by the heavy traffics between HBase RegionServers and storage servers in the compute-storage disaggregated infrastructure when the available network bandwidth is limited. We propose an In- Storage-based HBase architecture, called IS-HBase, to improve the overall performance and to address the aforementioned challenges. First, IS-HBase executes a data pre-processing module (In-Storage ScanNer, called ISSN) for some read queries and returns the requested key-value pairs to RegionServers instead of returning data blocks in HFile. IS-HBase carries out compactions in storage servers to reduce the large amount of data being transmitted through the network and thus the compaction execution time is effectively reduced. Second, a set of new protocols is proposed to address the communication and coordination between HBase RegionServers at computing nodes and ISSNs at storage nodes. Third, a new self-adaptive caching scheme is proposed to better serve the read queries with fewer I/O operations and less network traffic. According to our experiments, the IS-HBase can reduce up to 97% network traffic for read queries and the throughput (queries per second) is significantly less affected by the fluctuation of available network bandwidth. The execution time of compaction in IS-HBase is only about 6.31% – 41.84% of the execution time of legacy HBase. In general, IS-HBase demonstrates the potential of adopting in-storage computing for other data-intensive distributed applications to significantly improve performance in compute-storage disaggregated infrastructure.

主动存储设备和存储内计算是近年来提出和发展起来的，目的是有效地降低所需的数据流量，提高整体应用性能。它们在计算存储分解基础设施中特别受欢迎。在这两种技术中，在存储设备/服务器上添加一个简单的计算模块，使得一些存储的数据在传输到应用服务器之前可以在存储设备/服务器中进行处理。这可以减少所需的网络带宽，并将某些计算需求从应用服务器转移到存储设备/服务器。然而，在为应用程序设计基于存储计算的体系结构时，存在一些挑战。这些问题包括需要卸载哪些计算功能，如何设计存储模块和应用服务器之间的协议，以及如何处理应用服务器中的缓存问题。HBase是一个重要且应用广泛的分布式键值存储。它在像HDFS这样的存储系统中存储和索引大文件中的键值对。但是，在计算-存储分离的基础架构中，当可用的网络带宽有限时，HBase regionserver与存储服务器之间的流量过大，会影响HBase的性能，尤其是读性能。我们提出了一个基于In- storage的HBase架构，称为IS-HBase，以提高整体性能并解决上述挑战。首先，IS-HBase对一些读查询执行一个数据预处理模块(in - storage ScanNer，称为ISSN)，并将请求的键值对返回给regionserver，而不是在HFile中返回数据块。is - hbase在存储服务器上进行压缩，减少大量数据通过网络传输，从而有效减少压缩执行时间。其次，提出了一套新的协议来解决计算节点上的HBase regionserver和存储节点上的issn之间的通信和协调问题。第三，提出了一种新的自适应缓存方案，以更少的I/O操作和更小的网络流量更好地服务于读查询。根据我们的实验，is - hbase可以减少高达97%的读查询网络流量，并且吞吐量(每秒查询数)受可用网络带宽波动的影响明显较小。压缩在is -HBase中的执行时间仅为传统HBase的6.31% - 41.84%。总的来说，IS-HBase展示了为其他数据密集型分布式应用程序采用存储内计算的潜力，从而显著提高计算-存储分解基础设施的性能。

{"title":"IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure","authors":"Zhichao Cao, Huibing Dong, Yixun Wei, Shiyong Liu, D. Du","doi":"10.1145/3488368","DOIUrl":"https://doi.org/10.1145/3488368","url":null,"abstract":"Active storage devices and in-storage computing are proposed and developed in recent years to effectively reduce the amount of required data traffic and to improve the overall application performance. They are especially preferred in the compute-storage disaggregated infrastructure. In both techniques, a simple computing module is added to storage devices/servers such that some stored data can be processed in the storage devices/servers before being transmitted to application servers. This can reduce the required network bandwidth and offload certain computing requirements from application servers to storage devices/servers. However, several challenges exist when designing an in-storage computing- based architecture for applications. These include what computing functions need to be offloaded, how to design the protocol between in-storage modules and application servers, and how to deal with the caching issue in application servers. HBase is an important and widely used distributed Key-Value Store. It stores and indexes key-value pairs in large files in a storage system like HDFS. However, its performance especially read performance, is impacted by the heavy traffics between HBase RegionServers and storage servers in the compute-storage disaggregated infrastructure when the available network bandwidth is limited. We propose an In- Storage-based HBase architecture, called IS-HBase, to improve the overall performance and to address the aforementioned challenges. First, IS-HBase executes a data pre-processing module (In-Storage ScanNer, called ISSN) for some read queries and returns the requested key-value pairs to RegionServers instead of returning data blocks in HFile. IS-HBase carries out compactions in storage servers to reduce the large amount of data being transmitted through the network and thus the compaction execution time is effectively reduced. Second, a set of new protocols is proposed to address the communication and coordination between HBase RegionServers at computing nodes and ISSNs at storage nodes. Third, a new self-adaptive caching scheme is proposed to better serve the read queries with fewer I/O operations and less network traffic. According to our experiments, the IS-HBase can reduce up to 97% network traffic for read queries and the throughput (queries per second) is significantly less affected by the fluctuation of available network bandwidth. The execution time of compaction in IS-HBase is only about 6.31% – 41.84% of the execution time of legacy HBase. In general, IS-HBase demonstrates the potential of adopting in-storage computing for other data-intensive distributed applications to significantly improve performance in compute-storage disaggregated infrastructure.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128000871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems 高性能并行文件系统故障恢复与日志记录研究

ACM Transactions on Storage (TOS)

Pub Date : 2022-03-29 DOI: 10.1145/3483447

Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, J. Cook

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault, which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.

大规模并行文件系统(pfs)在高性能计算(HPC)中扮演着重要的角色。然而，尽管它们很重要，但与本地存储系统或云存储系统相比，它们的可靠性的研究或理解要少得多。最近在实际高性能计算中心发生的故障事件暴露了PFS集群的潜在缺陷，迫切需要进行系统分析。为了应对这一挑战，我们在本文中对pfs的故障恢复和日志记录机制进行了研究。首先，为了触发目标PFS的故障恢复和日志操作，我们引入了一个名为PFault的黑箱故障注入工具，该工具对PFS是透明的，并且易于在实践中部署。PFault基于一组预先定义的故障模型，模拟PFS中单个存储节点的故障状态，能够系统地检查PFS在故障下的行为。接下来，我们将PFault应用于两种广泛使用的pfs: Lustre和BeeGFS。我们的分析揭示了目标pfs的独特故障恢复和日志模式，并确定了pfs在故障处理方面不完美的多种情况。例如，Lustre包含一个名为LFSCK的恢复组件，用于检测和修复pfs级别的不一致性，但是我们发现LFSCK本身在扫描损坏的Lustre时可能会挂起或触发内核恐慌。即使在LFSCK尝试恢复之后，应用于Lustre的后续工作负载仍然可能表现异常(例如挂起或报告I/O错误)。在BeeGFS及其恢复部分BeeGFS- fsck中也观察到类似的问题。我们深入分析了观察到的异常症状的根本原因，这导致将一个新的补丁集合并到即将发布的Lustre版本中。此外，我们详细描述了实验中生成的大量日志，并确定了pfs在故障记录方面的独特模式和局限性。我们希望这项研究以及由此产生的工具和数据集可以促进社区的后续研究，并帮助改进pfs以实现可靠的高性能计算。

{"title":"A Study of Failure Recovery and Logging of High-Performance Parallel File Systems","authors":"Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, J. Cook","doi":"10.1145/3483447","DOIUrl":"https://doi.org/10.1145/3483447","url":null,"abstract":"Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault, which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125285041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Automatic Stream Identification to Improve Flash Endurance in Data Centers 自动流识别提高数据中心闪存寿命

ACM Transactions on Storage (TOS)

Pub Date : 2022-03-29 DOI: 10.1145/3470007

J. Bhimani, Zhengyu Yang, Jingpei Yang, Adnan Maruf, N. Mi, R. Pandurangan, Changho Choi, V. Balakrishnan

The demand for high performance I/O in Storage-as-a-Service (SaaS) is increasing day by day. To address this demand, NAND Flash-based Solid-state Drives (SSDs) are commonly used in data centers as cache- or top-tiers in the storage rack ascribe to their superior performance compared to traditional hard disk drives (HDDs). Meanwhile, with the capital expenditure of SSDs declining and the storage capacity of SSDs increasing, all-flash data centers are evolving to serve cloud services better than SSD-HDD hybrid data centers. During this transition, the biggest challenge is how to reduce the Write Amplification Factor (WAF) as well as to improve the endurance of SSD since this device has a limited program/erase cycles. A specified case is that storing data with different lifetimes (i.e., I/O streams with similar temporal fetching patterns such as reaccess frequency) in one single SSD can cause high WAF, reduce the endurance, and downgrade the performance of SSDs. Motivated by this, multi-stream SSDs have been developed to enable data with a different lifetime to be stored in different SSD regions. The logic behind this is to reduce the internal movement of data—when garbage collection is triggered, there are high chances of having data blocks with either all the pages being invalid or valid. However, the limitation of this technology is that the system needs to manually assign the same streamID to data with a similar lifetime. Unfortunately, when data arrives, it is not known how important this data is and how long this data will stay unmodified. Moreover, according to our observation, with different definitions of a lifetime (i.e., different calculation formulas based on selected features previously exhibited by data, such as sequentiality, and frequency), streamID identification may have varying impacts on the final WAF of multi-stream SSDs. Thus, in this article, we first develop a portable and adaptable framework to study the impacts of different workload features and their combinations on write amplification. We then propose a feature-based stream identification approach, which automatically co-relates the measurable workload attributes (such as I/O size, I/O rate, and so on.) with high-level workload features (such as frequency, sequentiality, and so on.) and determines a right combination of workload features for assigning streamIDs. Finally, we develop an adaptable stream assignment technique to assign streamID for changing workloads dynamically. Our evaluation results show that our automation approach of stream detection and separation can effectively reduce the WAF by using appropriate features for stream assignment with minimal implementation overhead.

存储即服务(SaaS)对高性能I/O的需求日益增长。为了满足这一需求，基于NAND闪存的固态硬盘(ssd)通常被用作数据中心的缓存或存储机架中的顶层，因为与传统硬盘驱动器(hdd)相比，它们具有卓越的性能。同时，随着ssd资本支出的下降和ssd存储容量的增加，全闪存数据中心正在向比SSD-HDD混合数据中心更好地服务云服务的方向发展。在这一转变过程中，最大的挑战是如何降低写入放大系数(WAF)以及提高SSD的耐用性，因为该设备具有有限的程序/擦除周期。一个特定的情况是，在一个SSD中存储具有不同生命周期的数据(即具有类似的临时获取模式(如重访问频率)的I/O流)可能会导致高WAF，降低持久时间，并降低SSD的性能。受此启发，多流SSD被开发出来，使不同生命周期的数据能够存储在不同的SSD区域中。这背后的逻辑是减少数据的内部移动——当触发垃圾收集时，很有可能出现所有页面都无效或有效的数据块。然而，这种技术的局限性是系统需要手动为具有相似生命周期的数据分配相同的流。不幸的是，当数据到达时，不知道这些数据有多重要，也不知道这些数据将保持多长时间不被修改。此外，根据我们的观察，使用不同的寿命定义(即基于数据先前显示的选择特征(如顺序和频率)的不同计算公式)，流识别可能会对多流ssd的最终WAF产生不同的影响。因此，在本文中，我们首先开发了一个可移植且适应性强的框架，以研究不同工作负载特征及其组合对写放大的影响。然后，我们提出了一种基于特征的流识别方法，该方法自动将可测量的工作负载属性(如I/O大小、I/O速率等)与高级工作负载特征(如频率、顺序性等)关联起来，并确定分配流id的工作负载特征的正确组合。最后，我们开发了一种自适应流分配技术，可以动态地为不断变化的工作负载分配流。我们的评估结果表明，我们的流检测和分离自动化方法可以通过使用适当的流分配特征来有效地减少WAF，并且实现开销最小。

{"title":"Automatic Stream Identification to Improve Flash Endurance in Data Centers","authors":"J. Bhimani, Zhengyu Yang, Jingpei Yang, Adnan Maruf, N. Mi, R. Pandurangan, Changho Choi, V. Balakrishnan","doi":"10.1145/3470007","DOIUrl":"https://doi.org/10.1145/3470007","url":null,"abstract":"The demand for high performance I/O in Storage-as-a-Service (SaaS) is increasing day by day. To address this demand, NAND Flash-based Solid-state Drives (SSDs) are commonly used in data centers as cache- or top-tiers in the storage rack ascribe to their superior performance compared to traditional hard disk drives (HDDs). Meanwhile, with the capital expenditure of SSDs declining and the storage capacity of SSDs increasing, all-flash data centers are evolving to serve cloud services better than SSD-HDD hybrid data centers. During this transition, the biggest challenge is how to reduce the Write Amplification Factor (WAF) as well as to improve the endurance of SSD since this device has a limited program/erase cycles. A specified case is that storing data with different lifetimes (i.e., I/O streams with similar temporal fetching patterns such as reaccess frequency) in one single SSD can cause high WAF, reduce the endurance, and downgrade the performance of SSDs. Motivated by this, multi-stream SSDs have been developed to enable data with a different lifetime to be stored in different SSD regions. The logic behind this is to reduce the internal movement of data—when garbage collection is triggered, there are high chances of having data blocks with either all the pages being invalid or valid. However, the limitation of this technology is that the system needs to manually assign the same streamID to data with a similar lifetime. Unfortunately, when data arrives, it is not known how important this data is and how long this data will stay unmodified. Moreover, according to our observation, with different definitions of a lifetime (i.e., different calculation formulas based on selected features previously exhibited by data, such as sequentiality, and frequency), streamID identification may have varying impacts on the final WAF of multi-stream SSDs. Thus, in this article, we first develop a portable and adaptable framework to study the impacts of different workload features and their combinations on write amplification. We then propose a feature-based stream identification approach, which automatically co-relates the measurable workload attributes (such as I/O size, I/O rate, and so on.) with high-level workload features (such as frequency, sequentiality, and so on.) and determines a right combination of workload features for assigning streamIDs. Finally, we develop an adaptable stream assignment technique to assign streamID for changing workloads dynamically. Our evaluation results show that our automation approach of stream detection and separation can effectively reduce the WAF by using appropriate features for stream assignment with minimal implementation overhead.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"12 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132562513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Characterization Summary of Performance, Reliability, and Threshold Voltage Distribution of 3D Charge-Trap NAND Flash Memory 三维电荷阱NAND闪存的性能、可靠性和阈值电压分布的特性综述

ACM Transactions on Storage (TOS)

Pub Date : 2022-03-10 DOI: 10.1145/3491230

Weihua Liu, Fei Wu, Xiang Chen, Meng Zhang, Yu Wang, Xiangfeng Lu, Changsheng Xie

Solid-state drive (SSD) gradually dominates in the high-performance storage scenarios. Three-dimension (3D) NAND flash memory owning high-storage capacity is becoming a mainstream storage component of SSD. However, the interferences of the new 3D charge-trap (CT) NAND flash are getting unprecedentedly complicated, yielding to many problems regarding reliability and performance. Alleviating these problems needs to understand the characteristics of 3D CT NAND flash memory deeply. To facilitate such understanding, in this article, we delve into characterizing the performance, reliability, and threshold voltage (Vth) distribution of 3D CT NAND flash memory. We make a summary of these characteristics with multiple interferences and variations and give several new insights and a characterization methodology. Especially, we characterize the skewed (Vth) distribution, (Vth) shift laws, and the exclusive layer variation in 3D NAND flash memory. The characterization is the backbone of designing more reliable and efficient flash-based storage solutions.

在高性能存储场景中，SSD (Solid-state drive)逐渐占据主导地位。具有高存储容量的三维NAND闪存正在成为SSD的主流存储组件。然而，新型三维电荷阱(CT) NAND闪存的干扰变得前所未有的复杂，产生了许多关于可靠性和性能的问题。解决这些问题需要深入了解3D CT NAND闪存的特性。为了便于理解，在本文中，我们深入研究了3D CT NAND闪存的性能、可靠性和阈值电压(Vth)分布。我们总结了这些具有多重干扰和变化的特征，并给出了一些新的见解和表征方法。特别地，我们描述了3D NAND闪存的歪斜(Vth)分布、(Vth)移位规律和独占层变化。表征是设计更可靠和高效的基于闪存的存储解决方案的支柱。

引用次数: 8

HintStor: A Framework to Study I/O Hints in Heterogeneous Storage HintStor:一个研究异构存储I/O提示的框架

ACM Transactions on Storage (TOS)

Pub Date : 2022-03-10 DOI: 10.1145/3489143

Xiongzi Ge, Zhichao Cao, D. Du, P. Ganesan, Dennis Hahn

To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information, called I/O access hints, from upper layers to the storage layer may greatly improve application performance and ease data management in storage systems. This is especially true for heterogeneous storage systems that consist of multiple types of storage devices. Since ingesting external access hints will likely involve laborious modifications of legacy I/O stacks, it is very hard to evaluate the effect and take advantages of access hints. In this article, we design a generic and flexible framework, called HintStor, to quickly play with a set of I/O access hints and evaluate their impacts on heterogeneous storage systems. HintStor provides a new application/user-level interface, a file system plugin, and performs data management with a generic block storage data manager. We demonstrate the flexibility of HintStor by evaluating four types of access hints: file system data classification, stream ID, cloud prefetch, and I/O task scheduling on a Linux platform. The results show that HintStor can execute and evaluate various I/O access hints under different scenarios with minor modifications to the kernel and applications.

为了弥合应用程序和现代存储系统之间巨大的语义鸿沟，将一段微小而有用的信息(称为I/O访问提示)从上层传递到存储层，可能会大大提高应用程序的性能，并简化存储系统中的数据管理。对于由多种类型的存储设备组成的异构存储系统尤其如此。由于摄取外部访问提示可能涉及对遗留I/O堆栈进行费力的修改，因此很难评估其效果并利用访问提示。在本文中，我们设计了一个通用且灵活的框架，称为HintStor，用于快速处理一组I/O访问提示，并评估它们对异构存储系统的影响。HintStor提供了一个新的应用程序/用户级界面，一个文件系统插件，并使用通用块存储数据管理器执行数据管理。我们通过评估四种类型的访问提示来演示HintStor的灵活性:文件系统数据分类、流ID、云预取和Linux平台上的I/O任务调度。结果表明，HintStor可以在不同场景下执行和评估各种I/O访问提示，只需对内核和应用程序进行少量修改。

{"title":"HintStor: A Framework to Study I/O Hints in Heterogeneous Storage","authors":"Xiongzi Ge, Zhichao Cao, D. Du, P. Ganesan, Dennis Hahn","doi":"10.1145/3489143","DOIUrl":"https://doi.org/10.1145/3489143","url":null,"abstract":"To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information, called I/O access hints, from upper layers to the storage layer may greatly improve application performance and ease data management in storage systems. This is especially true for heterogeneous storage systems that consist of multiple types of storage devices. Since ingesting external access hints will likely involve laborious modifications of legacy I/O stacks, it is very hard to evaluate the effect and take advantages of access hints. In this article, we design a generic and flexible framework, called HintStor, to quickly play with a set of I/O access hints and evaluate their impacts on heterogeneous storage systems. HintStor provides a new application/user-level interface, a file system plugin, and performs data management with a generic block storage data manager. We demonstrate the flexibility of HintStor by evaluating four types of access hints: file system data classification, stream ID, cloud prefetch, and I/O task scheduling on a Linux platform. The results show that HintStor can execute and evaluate various I/O access hints under different scenarios with minor modifications to the kernel and applications.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132985625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Power-optimized Deployment of Key-value Stores Using Storage Class Memory 使用存储类内存的键值存储的功耗优化部署

ACM Transactions on Storage (TOS)

Pub Date : 2022-03-10 DOI: 10.1145/3511905

H. Kassa, Jason B. Akers, Mrinmoy Ghosh, Zhichao Cao, V. Gogte, R. Dreslinski

High-performance flash-based key-value stores in data-centers utilize large amounts of DRAM to cache hot data. However, motivated by the high cost and power consumption of DRAM, server designs with lower DRAM-per-compute ratio are becoming popular. These low-cost servers enable scale-out services by reducing server workload densities. This results in improvements to overall service reliability, leading to a decrease in the total cost of ownership (TCO) for scalable workloads. Nevertheless, for key-value stores with large memory footprints, these reduced DRAM servers degrade performance due to an increase in both IO utilization and data access latency. In this scenario, a standard practice to improve performance for sharded databases is to reduce the number of shards per machine, which degrades the TCO benefits of reduced DRAM low-cost servers. In this work, we explore a practical solution to improve performance and reduce the costs and power consumption of key-value stores running on DRAM-constrained servers by using Storage Class Memories (SCM). SCMs in a DIMM form factor, although slower than DRAM, are sufficiently faster than flash when serving as a large extension to DRAM. With new technologies like Compute Express Link, we can expand the memory capacity of servers with high bandwidth and low latency connectivity with SCM. In this article, we use Intel Optane PMem 100 Series SCMs (DCPMM) in AppDirect mode to extend the available memory of our existing single-socket platform deployment of RocksDB (one of the largest key-value stores at Meta). We first designed a hybrid cache in RocksDB to harness both DRAM and SCM hierarchically. We then characterized the performance of the hybrid cache for three of the largest RocksDB use cases at Meta (ChatApp, BLOB Metadata, and Hive Cache). Our results demonstrate that we can achieve up to 80% improvement in throughput and 20% improvement in P95 latency over the existing small DRAM single-socket platform, while maintaining a 43–48% cost improvement over our large DRAM dual-socket platform. To the best of our knowledge, this is the first study of the DCPMM platform in a commercial data center.

数据中心中基于闪存的高性能键值存储利用大量的DRAM来缓存热数据。然而，由于DRAM的高成本和高功耗，低每计算DRAM比的服务器设计越来越受欢迎。这些低成本服务器通过降低服务器工作负载密度来支持向外扩展服务。这将提高整体服务可靠性，从而降低可扩展工作负载的总拥有成本(TCO)。然而，对于占用大量内存的键值存储，由于IO利用率和数据访问延迟的增加，这些减少的DRAM服务器会降低性能。在这种情况下，提高分片数据库性能的标准做法是减少每台机器的分片数量，这会降低减少DRAM低成本服务器的TCO优势。在这项工作中，我们探索了一种实用的解决方案，通过使用存储类内存(SCM)来提高性能并降低运行在dram受限服务器上的键值存储的成本和功耗。DIMM形式的scm虽然比DRAM慢，但在作为DRAM的大型扩展时，比闪存足够快。通过Compute Express Link等新技术，我们可以通过高带宽和低延迟连接SCM来扩展服务器的内存容量。在本文中，我们在AppDirect模式下使用英特尔Optane PMem 100系列scm (DCPMM)来扩展现有的单套接字平台部署的RocksDB (Meta中最大的键值存储之一)的可用内存。我们首先在RocksDB中设计了一个混合缓存，以分层利用DRAM和SCM。然后，我们对Meta上三个最大的RocksDB用例(ChatApp、BLOB元数据和Hive缓存)的混合缓存性能进行了表征。我们的结果表明，与现有的小型DRAM单插槽平台相比，我们可以实现高达80%的吞吐量改进和20%的P95延迟改进，同时与我们的大型DRAM双插槽平台相比，我们可以保持43-48%的成本改进。据我们所知，这是首次在商业数据中心中研究DCPMM平台。

{"title":"Power-optimized Deployment of Key-value Stores Using Storage Class Memory","authors":"H. Kassa, Jason B. Akers, Mrinmoy Ghosh, Zhichao Cao, V. Gogte, R. Dreslinski","doi":"10.1145/3511905","DOIUrl":"https://doi.org/10.1145/3511905","url":null,"abstract":"High-performance flash-based key-value stores in data-centers utilize large amounts of DRAM to cache hot data. However, motivated by the high cost and power consumption of DRAM, server designs with lower DRAM-per-compute ratio are becoming popular. These low-cost servers enable scale-out services by reducing server workload densities. This results in improvements to overall service reliability, leading to a decrease in the total cost of ownership (TCO) for scalable workloads. Nevertheless, for key-value stores with large memory footprints, these reduced DRAM servers degrade performance due to an increase in both IO utilization and data access latency. In this scenario, a standard practice to improve performance for sharded databases is to reduce the number of shards per machine, which degrades the TCO benefits of reduced DRAM low-cost servers. In this work, we explore a practical solution to improve performance and reduce the costs and power consumption of key-value stores running on DRAM-constrained servers by using Storage Class Memories (SCM). SCMs in a DIMM form factor, although slower than DRAM, are sufficiently faster than flash when serving as a large extension to DRAM. With new technologies like Compute Express Link, we can expand the memory capacity of servers with high bandwidth and low latency connectivity with SCM. In this article, we use Intel Optane PMem 100 Series SCMs (DCPMM) in AppDirect mode to extend the available memory of our existing single-socket platform deployment of RocksDB (one of the largest key-value stores at Meta). We first designed a hybrid cache in RocksDB to harness both DRAM and SCM hierarchically. We then characterized the performance of the hybrid cache for three of the largest RocksDB use cases at Meta (ChatApp, BLOB Metadata, and Hive Cache). Our results demonstrate that we can achieve up to 80% improvement in throughput and 20% improvement in P95 latency over the existing small DRAM single-socket platform, while maintaining a 43–48% cost improvement over our large DRAM dual-socket platform. To the best of our knowledge, this is the first study of the DCPMM platform in a commercial data center.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115624454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Survey of Distributed File System Design Choices 分布式文件系统设计选择综述

ACM Transactions on Storage (TOS)

Pub Date : 2022-02-28 DOI: 10.1145/3465405

P. Macko, Jason Hennessey

Decades of research on distributed file systems and storage systems exists. New researchers and engineers have a lot of literature to study, but only a comparatively small number of high-level design choices are available when creating a distributed file system. And within each aspect of the system, typically several common approaches are used. So, rather than surveying distributed file systems, this article presents a survey of important design decisions and, within those decisions, the most commonly used options. It also presents a qualitative exploration of their tradeoffs. We include several relatively recent designs and their variations that illustrate other tradeoff choices in the design space, despite being underexplored. In doing so, we provide a primer on distributed file systems, and we also show areas that are overexplored and underexplored, in the hopes of inspiring new research.

对分布式文件系统和存储系统的研究已经进行了几十年。新的研究人员和工程师有很多文献要研究，但是在创建分布式文件系统时，只有相对较少的高级设计选择可用。在系统的每个方面，通常使用几种常用方法。因此，本文不是考察分布式文件系统，而是考察重要的设计决策，以及在这些决策中最常用的选项。它还对它们的权衡进行了定性探索。我们包括了几个相对较新的设计和它们的变化，说明了设计空间中的其他权衡选择，尽管尚未得到充分的探索。在这样做的过程中，我们提供了分布式文件系统的入门，我们还展示了被过度探索和未被充分探索的领域，希望能激发新的研究。

引用次数: 4

Optimizing Storage Performance with Calibrated Interrupts 通过校准中断优化存储性能

ACM Transactions on Storage (TOS)

Pub Date : 2022-02-28 DOI: 10.1145/3505139

Amy Tai, I. Smolyar, M. Wei, Dan Tsafrir

After request completion, an I/O device must decide whether to minimize latency by immediately firing an interrupt or to optimize for throughput by delaying the interrupt, anticipating that more requests will complete soon and help amortize the interrupt cost. Devices employ adaptive interrupt coalescing heuristics that try to balance between these opposing goals. Unfortunately, because devices lack the semantic information about which I/O requests are latency-sensitive, these heuristics can sometimes lead to disastrous results. Instead, we propose addressing the root cause of the heuristics problem by allowing software to explicitly specify to the device if submitted requests are latency-sensitive. The device then “calibrates” its interrupts to completions of latency-sensitive requests. We focus on NVMe storage devices and show that it is natural to express these semantics in the kernel and the application and only requires a modest two-bit change to the device interface. Calibrated interrupts increase throughput by up to 35%, reduce CPU consumption by as much as 30%, and achieve up to 37% lower latency when interrupts are coalesced.

请求完成后，I/O设备必须决定是通过立即触发中断来最小化延迟，还是通过延迟中断来优化吞吐量，预计更多的请求将很快完成，并帮助分摊中断成本。设备采用自适应中断合并启发式，试图在这些相反的目标之间取得平衡。不幸的是，由于设备缺乏关于哪些I/O请求对延迟敏感的语义信息，这些启发式方法有时会导致灾难性的结果。相反，我们建议通过允许软件显式地指定设备提交的请求是否对延迟敏感来解决启发式问题的根本原因。然后，设备“校准”其中断，以完成对延迟敏感的请求。我们将重点讨论NVMe存储设备，并说明在内核和应用程序中表达这些语义是很自然的，并且只需要对设备接口进行适度的两位更改。校准后的中断可将吞吐量提高35%，将CPU消耗降低30%，并在合并中断时将延迟降低37%。

引用次数: 11

Introduction to the Special Section on USENIX OSDI 2021 USENIX OSDI 2021特别章节简介

ACM Transactions on Storage (TOS)

Pub Date : 2022-01-29 DOI: 10.1145/3507950

Angela Demke Brown, Jacob R. Lorch

This special section of the ACM Transactions on Storage presents some of the highlights of the storage-related papers published in the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’21). The OSDI Symposium emphasizes innovative research as well as quantified or insightful experiences in systems design and implementation. Despite OSDI ’s broad view of the systems area, the design and implementation of storage systems have always been important topics for OSDI. In particular, out of the 165 OSDI ’21 submissions, 26 of them (=16%) addressed various storage-related aspects; out of its 31 accepted papers, 6 (=19%) addressed storage-related themes, constituting a significant part of the OSDI ’21 program. Out of the above, for this special section of ACM Transactions on Storage, we selected two high-quality papers. Each includes some additional material, which has been reviewed (in fasttrack mode) by a subset of its original OSDI ’21 reviewers. The first article is “Nap: Persistent Memory Indexes for NUMA Architectures” by Qing Wang, Youyou Lu, Junru Li, and Jiwu Shu. This is an expanded version of the OSDI ’21 paper “NAP: A Black-Box Approach to NUMA-Aware Persistent Memory Indexes.” It introduces a NUMA-aware layer above existing persistent memory indexes, consisting of a volatile DRAM component and persistent, crash-consistent per-NUMA-node components. Reads and writes to hot items are handled via the NUMA-aware layer, alleviating the performance issues with cross-node access to persistent memory. The second article is “Optimizing Storage Performance with Calibrated Interrupts” by Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. This paper presents a new interface to help devices make the tradeoff between triggering an interrupt immediately after an I/O request completes, reducing latency, and waiting until multiple interrupts can be coalesced, reducing interrupt-handling overhead. This interface allows applications to inform devices about latency-sensitive requests, allowing interrupt generation to be aligned with application requirements. We hope you enjoy these expanded versions and find both papers interesting and insightful.

ACM存储事务的这个特殊部分介绍了在第15届USENIX操作系统设计与实现研讨会(OSDI ' 21)上发表的一些与存储相关的论文的亮点。OSDI研讨会强调在系统设计和实施方面的创新研究以及量化或深刻的经验。尽管OSDI对系统领域有着广阔的视野，但存储系统的设计和实现一直是OSDI的重要主题。特别是，在OSDI的21份提交的165份文件中，有26份(=16%)涉及各种与存储相关的方面;在31篇被接受的论文中，有6篇(=19%)讨论了与存储相关的主题，构成了OSDI ' 21计划的重要组成部分。其中，在ACM Transactions on Storage的这个特殊部分，我们选择了两篇高质量的论文。每个版本都包含一些额外的材料，这些材料已经由原始OSDI ' 21审稿人的一个子集进行了审查(以快速跟踪模式)。第一篇文章是《Nap: NUMA架构的持久内存索引》，作者是王庆、吕又优、李君如和舒继武。这是OSDI ' 21论文“NAP: numa感知持久内存索引的黑盒方法”的扩展版本。它在现有的持久内存索引之上引入了numa感知层，该层由易失性DRAM组件和持久的、每个numa节点的崩溃一致性组件组成。对热项的读写是通过numa感知层处理的，从而减轻了跨节点访问持久性内存的性能问题。第二篇文章是Amy Tai、Igor Smolyar、Michael Wei和Dan Tsafrir撰写的“使用校准中断优化存储性能”。本文提出了一个新的接口，帮助设备在I/O请求完成后立即触发中断，减少延迟，以及等待多个中断可以合并，减少中断处理开销之间进行权衡。这个接口允许应用程序通知设备有关延迟敏感的请求，允许中断生成与应用程序需求保持一致。我们希望你喜欢这些扩展版本，并发现这两篇论文都很有趣和有见地。

{"title":"Introduction to the Special Section on USENIX OSDI 2021","authors":"Angela Demke Brown, Jacob R. Lorch","doi":"10.1145/3507950","DOIUrl":"https://doi.org/10.1145/3507950","url":null,"abstract":"This special section of the ACM Transactions on Storage presents some of the highlights of the storage-related papers published in the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’21). The OSDI Symposium emphasizes innovative research as well as quantified or insightful experiences in systems design and implementation. Despite OSDI ’s broad view of the systems area, the design and implementation of storage systems have always been important topics for OSDI. In particular, out of the 165 OSDI ’21 submissions, 26 of them (=16%) addressed various storage-related aspects; out of its 31 accepted papers, 6 (=19%) addressed storage-related themes, constituting a significant part of the OSDI ’21 program. Out of the above, for this special section of ACM Transactions on Storage, we selected two high-quality papers. Each includes some additional material, which has been reviewed (in fasttrack mode) by a subset of its original OSDI ’21 reviewers. The first article is “Nap: Persistent Memory Indexes for NUMA Architectures” by Qing Wang, Youyou Lu, Junru Li, and Jiwu Shu. This is an expanded version of the OSDI ’21 paper “NAP: A Black-Box Approach to NUMA-Aware Persistent Memory Indexes.” It introduces a NUMA-aware layer above existing persistent memory indexes, consisting of a volatile DRAM component and persistent, crash-consistent per-NUMA-node components. Reads and writes to hot items are handled via the NUMA-aware layer, alleviating the performance issues with cross-node access to persistent memory. The second article is “Optimizing Storage Performance with Calibrated Interrupts” by Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. This paper presents a new interface to help devices make the tradeoff between triggering an interrupt immediately after an I/O request completes, reducing latency, and waiting until multiple interrupts can be coalesced, reducing interrupt-handling overhead. This interface allows applications to inform devices about latency-sensitive requests, allowing interrupt generation to be aligned with application requirements. We hope you enjoy these expanded versions and find both papers interesting and insightful.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124029644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nap: Persistent Memory Indexes for NUMA Architectures Nap: NUMA架构的持久内存索引

ACM Transactions on Storage (TOS)

Pub Date : 2022-01-29 DOI: 10.1145/3507922

Qing Wang, Youyou Lu, Junru Li, Minhui Xie, J. Shu

We present Nap, a black-box approach that converts concurrent persistent memory (PM) indexes into non-uniform memory access (NUMA)-aware counterparts. Based on the observation that real-world workloads always feature skewed access patterns, Nap introduces a NUMA-aware layer (NAL) on the top of existing concurrent PM indexes, and steers accesses to hot items to this layer. The NAL maintains (1) per-node partial views in PM for serving insert/update/delete operations with failure atomicity and (2) a global view in DRAM for serving lookup operations. The NAL eliminates remote PM accesses to hot items without inducing extra local PM accesses. Moreover, to handle dynamic workloads, Nap adopts a fast NAL switch mechanism. We convert five state-of-the-art PM indexes using Nap. Evaluation on a four-node machine with Optane DC Persistent Memory shows that Nap can improve the throughput by up to 2.3× and 1.56× under write-intensive and read-intensive workloads, respectively.

我们提出了Nap，这是一种黑盒方法，它将并发持久内存(PM)索引转换为非统一内存访问(NUMA)感知的对应物。基于对现实工作负载总是具有倾斜访问模式的观察，Nap在现有并发PM索引的顶部引入了numa感知层(NAL)，并将对热门项目的访问引导到该层。NAL在PM中维护(1)每个节点的部分视图，用于服务具有故障原子性的插入/更新/删除操作;(2)在DRAM中维护全局视图，用于服务查找操作。NAL消除了对热点项目的远程PM访问，而不会引起额外的本地PM访问。此外，为了处理动态工作负载，Nap采用了快速NAL切换机制。我们使用Nap转换五个最先进的PM索引。在使用Optane DC Persistent Memory的四节点机器上进行的评估表明，在写密集型和读密集型工作负载下，Nap可以分别将吞吐量提高2.3倍和1.56倍。

引用次数: 4