Zhichao Cao, Huibing Dong, Yixun Wei, Shiyong Liu, D. Du
Active storage devices and in-storage computing are proposed and developed in recent years to effectively reduce the amount of required data traffic and to improve the overall application performance. They are especially preferred in the compute-storage disaggregated infrastructure. In both techniques, a simple computing module is added to storage devices/servers such that some stored data can be processed in the storage devices/servers before being transmitted to application servers. This can reduce the required network bandwidth and offload certain computing requirements from application servers to storage devices/servers. However, several challenges exist when designing an in-storage computing- based architecture for applications. These include what computing functions need to be offloaded, how to design the protocol between in-storage modules and application servers, and how to deal with the caching issue in application servers. HBase is an important and widely used distributed Key-Value Store. It stores and indexes key-value pairs in large files in a storage system like HDFS. However, its performance especially read performance, is impacted by the heavy traffics between HBase RegionServers and storage servers in the compute-storage disaggregated infrastructure when the available network bandwidth is limited. We propose an In- Storage-based HBase architecture, called IS-HBase, to improve the overall performance and to address the aforementioned challenges. First, IS-HBase executes a data pre-processing module (In-Storage ScanNer, called ISSN) for some read queries and returns the requested key-value pairs to RegionServers instead of returning data blocks in HFile. IS-HBase carries out compactions in storage servers to reduce the large amount of data being transmitted through the network and thus the compaction execution time is effectively reduced. Second, a set of new protocols is proposed to address the communication and coordination between HBase RegionServers at computing nodes and ISSNs at storage nodes. Third, a new self-adaptive caching scheme is proposed to better serve the read queries with fewer I/O operations and less network traffic. According to our experiments, the IS-HBase can reduce up to 97% network traffic for read queries and the throughput (queries per second) is significantly less affected by the fluctuation of available network bandwidth. The execution time of compaction in IS-HBase is only about 6.31% – 41.84% of the execution time of legacy HBase. In general, IS-HBase demonstrates the potential of adopting in-storage computing for other data-intensive distributed applications to significantly improve performance in compute-storage disaggregated infrastructure.
{"title":"IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure","authors":"Zhichao Cao, Huibing Dong, Yixun Wei, Shiyong Liu, D. Du","doi":"10.1145/3488368","DOIUrl":"https://doi.org/10.1145/3488368","url":null,"abstract":"Active storage devices and in-storage computing are proposed and developed in recent years to effectively reduce the amount of required data traffic and to improve the overall application performance. They are especially preferred in the compute-storage disaggregated infrastructure. In both techniques, a simple computing module is added to storage devices/servers such that some stored data can be processed in the storage devices/servers before being transmitted to application servers. This can reduce the required network bandwidth and offload certain computing requirements from application servers to storage devices/servers. However, several challenges exist when designing an in-storage computing- based architecture for applications. These include what computing functions need to be offloaded, how to design the protocol between in-storage modules and application servers, and how to deal with the caching issue in application servers. HBase is an important and widely used distributed Key-Value Store. It stores and indexes key-value pairs in large files in a storage system like HDFS. However, its performance especially read performance, is impacted by the heavy traffics between HBase RegionServers and storage servers in the compute-storage disaggregated infrastructure when the available network bandwidth is limited. We propose an In- Storage-based HBase architecture, called IS-HBase, to improve the overall performance and to address the aforementioned challenges. First, IS-HBase executes a data pre-processing module (In-Storage ScanNer, called ISSN) for some read queries and returns the requested key-value pairs to RegionServers instead of returning data blocks in HFile. IS-HBase carries out compactions in storage servers to reduce the large amount of data being transmitted through the network and thus the compaction execution time is effectively reduced. Second, a set of new protocols is proposed to address the communication and coordination between HBase RegionServers at computing nodes and ISSNs at storage nodes. Third, a new self-adaptive caching scheme is proposed to better serve the read queries with fewer I/O operations and less network traffic. According to our experiments, the IS-HBase can reduce up to 97% network traffic for read queries and the throughput (queries per second) is significantly less affected by the fluctuation of available network bandwidth. The execution time of compaction in IS-HBase is only about 6.31% – 41.84% of the execution time of legacy HBase. In general, IS-HBase demonstrates the potential of adopting in-storage computing for other data-intensive distributed applications to significantly improve performance in compute-storage disaggregated infrastructure.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128000871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, J. Cook
Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault, which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.
{"title":"A Study of Failure Recovery and Logging of High-Performance Parallel File Systems","authors":"Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, J. Cook","doi":"10.1145/3483447","DOIUrl":"https://doi.org/10.1145/3483447","url":null,"abstract":"Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault, which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125285041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Bhimani, Zhengyu Yang, Jingpei Yang, Adnan Maruf, N. Mi, R. Pandurangan, Changho Choi, V. Balakrishnan
The demand for high performance I/O in Storage-as-a-Service (SaaS) is increasing day by day. To address this demand, NAND Flash-based Solid-state Drives (SSDs) are commonly used in data centers as cache- or top-tiers in the storage rack ascribe to their superior performance compared to traditional hard disk drives (HDDs). Meanwhile, with the capital expenditure of SSDs declining and the storage capacity of SSDs increasing, all-flash data centers are evolving to serve cloud services better than SSD-HDD hybrid data centers. During this transition, the biggest challenge is how to reduce the Write Amplification Factor (WAF) as well as to improve the endurance of SSD since this device has a limited program/erase cycles. A specified case is that storing data with different lifetimes (i.e., I/O streams with similar temporal fetching patterns such as reaccess frequency) in one single SSD can cause high WAF, reduce the endurance, and downgrade the performance of SSDs. Motivated by this, multi-stream SSDs have been developed to enable data with a different lifetime to be stored in different SSD regions. The logic behind this is to reduce the internal movement of data—when garbage collection is triggered, there are high chances of having data blocks with either all the pages being invalid or valid. However, the limitation of this technology is that the system needs to manually assign the same streamID to data with a similar lifetime. Unfortunately, when data arrives, it is not known how important this data is and how long this data will stay unmodified. Moreover, according to our observation, with different definitions of a lifetime (i.e., different calculation formulas based on selected features previously exhibited by data, such as sequentiality, and frequency), streamID identification may have varying impacts on the final WAF of multi-stream SSDs. Thus, in this article, we first develop a portable and adaptable framework to study the impacts of different workload features and their combinations on write amplification. We then propose a feature-based stream identification approach, which automatically co-relates the measurable workload attributes (such as I/O size, I/O rate, and so on.) with high-level workload features (such as frequency, sequentiality, and so on.) and determines a right combination of workload features for assigning streamIDs. Finally, we develop an adaptable stream assignment technique to assign streamID for changing workloads dynamically. Our evaluation results show that our automation approach of stream detection and separation can effectively reduce the WAF by using appropriate features for stream assignment with minimal implementation overhead.
{"title":"Automatic Stream Identification to Improve Flash Endurance in Data Centers","authors":"J. Bhimani, Zhengyu Yang, Jingpei Yang, Adnan Maruf, N. Mi, R. Pandurangan, Changho Choi, V. Balakrishnan","doi":"10.1145/3470007","DOIUrl":"https://doi.org/10.1145/3470007","url":null,"abstract":"The demand for high performance I/O in Storage-as-a-Service (SaaS) is increasing day by day. To address this demand, NAND Flash-based Solid-state Drives (SSDs) are commonly used in data centers as cache- or top-tiers in the storage rack ascribe to their superior performance compared to traditional hard disk drives (HDDs). Meanwhile, with the capital expenditure of SSDs declining and the storage capacity of SSDs increasing, all-flash data centers are evolving to serve cloud services better than SSD-HDD hybrid data centers. During this transition, the biggest challenge is how to reduce the Write Amplification Factor (WAF) as well as to improve the endurance of SSD since this device has a limited program/erase cycles. A specified case is that storing data with different lifetimes (i.e., I/O streams with similar temporal fetching patterns such as reaccess frequency) in one single SSD can cause high WAF, reduce the endurance, and downgrade the performance of SSDs. Motivated by this, multi-stream SSDs have been developed to enable data with a different lifetime to be stored in different SSD regions. The logic behind this is to reduce the internal movement of data—when garbage collection is triggered, there are high chances of having data blocks with either all the pages being invalid or valid. However, the limitation of this technology is that the system needs to manually assign the same streamID to data with a similar lifetime. Unfortunately, when data arrives, it is not known how important this data is and how long this data will stay unmodified. Moreover, according to our observation, with different definitions of a lifetime (i.e., different calculation formulas based on selected features previously exhibited by data, such as sequentiality, and frequency), streamID identification may have varying impacts on the final WAF of multi-stream SSDs. Thus, in this article, we first develop a portable and adaptable framework to study the impacts of different workload features and their combinations on write amplification. We then propose a feature-based stream identification approach, which automatically co-relates the measurable workload attributes (such as I/O size, I/O rate, and so on.) with high-level workload features (such as frequency, sequentiality, and so on.) and determines a right combination of workload features for assigning streamIDs. Finally, we develop an adaptable stream assignment technique to assign streamID for changing workloads dynamically. Our evaluation results show that our automation approach of stream detection and separation can effectively reduce the WAF by using appropriate features for stream assignment with minimal implementation overhead.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"12 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132562513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Solid-state drive (SSD) gradually dominates in the high-performance storage scenarios. Three-dimension (3D) NAND flash memory owning high-storage capacity is becoming a mainstream storage component of SSD. However, the interferences of the new 3D charge-trap (CT) NAND flash are getting unprecedentedly complicated, yielding to many problems regarding reliability and performance. Alleviating these problems needs to understand the characteristics of 3D CT NAND flash memory deeply. To facilitate such understanding, in this article, we delve into characterizing the performance, reliability, and threshold voltage (Vth) distribution of 3D CT NAND flash memory. We make a summary of these characteristics with multiple interferences and variations and give several new insights and a characterization methodology. Especially, we characterize the skewed (Vth) distribution, (Vth) shift laws, and the exclusive layer variation in 3D NAND flash memory. The characterization is the backbone of designing more reliable and efficient flash-based storage solutions.
{"title":"Characterization Summary of Performance, Reliability, and Threshold Voltage Distribution of 3D Charge-Trap NAND Flash Memory","authors":"Weihua Liu, Fei Wu, Xiang Chen, Meng Zhang, Yu Wang, Xiangfeng Lu, Changsheng Xie","doi":"10.1145/3491230","DOIUrl":"https://doi.org/10.1145/3491230","url":null,"abstract":"Solid-state drive (SSD) gradually dominates in the high-performance storage scenarios. Three-dimension (3D) NAND flash memory owning high-storage capacity is becoming a mainstream storage component of SSD. However, the interferences of the new 3D charge-trap (CT) NAND flash are getting unprecedentedly complicated, yielding to many problems regarding reliability and performance. Alleviating these problems needs to understand the characteristics of 3D CT NAND flash memory deeply. To facilitate such understanding, in this article, we delve into characterizing the performance, reliability, and threshold voltage (Vth) distribution of 3D CT NAND flash memory. We make a summary of these characteristics with multiple interferences and variations and give several new insights and a characterization methodology. Especially, we characterize the skewed (Vth) distribution, (Vth) shift laws, and the exclusive layer variation in 3D NAND flash memory. The characterization is the backbone of designing more reliable and efficient flash-based storage solutions.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126127621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiongzi Ge, Zhichao Cao, D. Du, P. Ganesan, Dennis Hahn
To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information, called I/O access hints, from upper layers to the storage layer may greatly improve application performance and ease data management in storage systems. This is especially true for heterogeneous storage systems that consist of multiple types of storage devices. Since ingesting external access hints will likely involve laborious modifications of legacy I/O stacks, it is very hard to evaluate the effect and take advantages of access hints. In this article, we design a generic and flexible framework, called HintStor, to quickly play with a set of I/O access hints and evaluate their impacts on heterogeneous storage systems. HintStor provides a new application/user-level interface, a file system plugin, and performs data management with a generic block storage data manager. We demonstrate the flexibility of HintStor by evaluating four types of access hints: file system data classification, stream ID, cloud prefetch, and I/O task scheduling on a Linux platform. The results show that HintStor can execute and evaluate various I/O access hints under different scenarios with minor modifications to the kernel and applications.
{"title":"HintStor: A Framework to Study I/O Hints in Heterogeneous Storage","authors":"Xiongzi Ge, Zhichao Cao, D. Du, P. Ganesan, Dennis Hahn","doi":"10.1145/3489143","DOIUrl":"https://doi.org/10.1145/3489143","url":null,"abstract":"To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information, called I/O access hints, from upper layers to the storage layer may greatly improve application performance and ease data management in storage systems. This is especially true for heterogeneous storage systems that consist of multiple types of storage devices. Since ingesting external access hints will likely involve laborious modifications of legacy I/O stacks, it is very hard to evaluate the effect and take advantages of access hints. In this article, we design a generic and flexible framework, called HintStor, to quickly play with a set of I/O access hints and evaluate their impacts on heterogeneous storage systems. HintStor provides a new application/user-level interface, a file system plugin, and performs data management with a generic block storage data manager. We demonstrate the flexibility of HintStor by evaluating four types of access hints: file system data classification, stream ID, cloud prefetch, and I/O task scheduling on a Linux platform. The results show that HintStor can execute and evaluate various I/O access hints under different scenarios with minor modifications to the kernel and applications.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132985625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Kassa, Jason B. Akers, Mrinmoy Ghosh, Zhichao Cao, V. Gogte, R. Dreslinski
High-performance flash-based key-value stores in data-centers utilize large amounts of DRAM to cache hot data. However, motivated by the high cost and power consumption of DRAM, server designs with lower DRAM-per-compute ratio are becoming popular. These low-cost servers enable scale-out services by reducing server workload densities. This results in improvements to overall service reliability, leading to a decrease in the total cost of ownership (TCO) for scalable workloads. Nevertheless, for key-value stores with large memory footprints, these reduced DRAM servers degrade performance due to an increase in both IO utilization and data access latency. In this scenario, a standard practice to improve performance for sharded databases is to reduce the number of shards per machine, which degrades the TCO benefits of reduced DRAM low-cost servers. In this work, we explore a practical solution to improve performance and reduce the costs and power consumption of key-value stores running on DRAM-constrained servers by using Storage Class Memories (SCM). SCMs in a DIMM form factor, although slower than DRAM, are sufficiently faster than flash when serving as a large extension to DRAM. With new technologies like Compute Express Link, we can expand the memory capacity of servers with high bandwidth and low latency connectivity with SCM. In this article, we use Intel Optane PMem 100 Series SCMs (DCPMM) in AppDirect mode to extend the available memory of our existing single-socket platform deployment of RocksDB (one of the largest key-value stores at Meta). We first designed a hybrid cache in RocksDB to harness both DRAM and SCM hierarchically. We then characterized the performance of the hybrid cache for three of the largest RocksDB use cases at Meta (ChatApp, BLOB Metadata, and Hive Cache). Our results demonstrate that we can achieve up to 80% improvement in throughput and 20% improvement in P95 latency over the existing small DRAM single-socket platform, while maintaining a 43–48% cost improvement over our large DRAM dual-socket platform. To the best of our knowledge, this is the first study of the DCPMM platform in a commercial data center.
{"title":"Power-optimized Deployment of Key-value Stores Using Storage Class Memory","authors":"H. Kassa, Jason B. Akers, Mrinmoy Ghosh, Zhichao Cao, V. Gogte, R. Dreslinski","doi":"10.1145/3511905","DOIUrl":"https://doi.org/10.1145/3511905","url":null,"abstract":"High-performance flash-based key-value stores in data-centers utilize large amounts of DRAM to cache hot data. However, motivated by the high cost and power consumption of DRAM, server designs with lower DRAM-per-compute ratio are becoming popular. These low-cost servers enable scale-out services by reducing server workload densities. This results in improvements to overall service reliability, leading to a decrease in the total cost of ownership (TCO) for scalable workloads. Nevertheless, for key-value stores with large memory footprints, these reduced DRAM servers degrade performance due to an increase in both IO utilization and data access latency. In this scenario, a standard practice to improve performance for sharded databases is to reduce the number of shards per machine, which degrades the TCO benefits of reduced DRAM low-cost servers. In this work, we explore a practical solution to improve performance and reduce the costs and power consumption of key-value stores running on DRAM-constrained servers by using Storage Class Memories (SCM). SCMs in a DIMM form factor, although slower than DRAM, are sufficiently faster than flash when serving as a large extension to DRAM. With new technologies like Compute Express Link, we can expand the memory capacity of servers with high bandwidth and low latency connectivity with SCM. In this article, we use Intel Optane PMem 100 Series SCMs (DCPMM) in AppDirect mode to extend the available memory of our existing single-socket platform deployment of RocksDB (one of the largest key-value stores at Meta). We first designed a hybrid cache in RocksDB to harness both DRAM and SCM hierarchically. We then characterized the performance of the hybrid cache for three of the largest RocksDB use cases at Meta (ChatApp, BLOB Metadata, and Hive Cache). Our results demonstrate that we can achieve up to 80% improvement in throughput and 20% improvement in P95 latency over the existing small DRAM single-socket platform, while maintaining a 43–48% cost improvement over our large DRAM dual-socket platform. To the best of our knowledge, this is the first study of the DCPMM platform in a commercial data center.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115624454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Decades of research on distributed file systems and storage systems exists. New researchers and engineers have a lot of literature to study, but only a comparatively small number of high-level design choices are available when creating a distributed file system. And within each aspect of the system, typically several common approaches are used. So, rather than surveying distributed file systems, this article presents a survey of important design decisions and, within those decisions, the most commonly used options. It also presents a qualitative exploration of their tradeoffs. We include several relatively recent designs and their variations that illustrate other tradeoff choices in the design space, despite being underexplored. In doing so, we provide a primer on distributed file systems, and we also show areas that are overexplored and underexplored, in the hopes of inspiring new research.
{"title":"Survey of Distributed File System Design Choices","authors":"P. Macko, Jason Hennessey","doi":"10.1145/3465405","DOIUrl":"https://doi.org/10.1145/3465405","url":null,"abstract":"Decades of research on distributed file systems and storage systems exists. New researchers and engineers have a lot of literature to study, but only a comparatively small number of high-level design choices are available when creating a distributed file system. And within each aspect of the system, typically several common approaches are used. So, rather than surveying distributed file systems, this article presents a survey of important design decisions and, within those decisions, the most commonly used options. It also presents a qualitative exploration of their tradeoffs. We include several relatively recent designs and their variations that illustrate other tradeoff choices in the design space, despite being underexplored. In doing so, we provide a primer on distributed file systems, and we also show areas that are overexplored and underexplored, in the hopes of inspiring new research.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114517568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
After request completion, an I/O device must decide whether to minimize latency by immediately firing an interrupt or to optimize for throughput by delaying the interrupt, anticipating that more requests will complete soon and help amortize the interrupt cost. Devices employ adaptive interrupt coalescing heuristics that try to balance between these opposing goals. Unfortunately, because devices lack the semantic information about which I/O requests are latency-sensitive, these heuristics can sometimes lead to disastrous results. Instead, we propose addressing the root cause of the heuristics problem by allowing software to explicitly specify to the device if submitted requests are latency-sensitive. The device then “calibrates” its interrupts to completions of latency-sensitive requests. We focus on NVMe storage devices and show that it is natural to express these semantics in the kernel and the application and only requires a modest two-bit change to the device interface. Calibrated interrupts increase throughput by up to 35%, reduce CPU consumption by as much as 30%, and achieve up to 37% lower latency when interrupts are coalesced.
{"title":"Optimizing Storage Performance with Calibrated Interrupts","authors":"Amy Tai, I. Smolyar, M. Wei, Dan Tsafrir","doi":"10.1145/3505139","DOIUrl":"https://doi.org/10.1145/3505139","url":null,"abstract":"After request completion, an I/O device must decide whether to minimize latency by immediately firing an interrupt or to optimize for throughput by delaying the interrupt, anticipating that more requests will complete soon and help amortize the interrupt cost. Devices employ adaptive interrupt coalescing heuristics that try to balance between these opposing goals. Unfortunately, because devices lack the semantic information about which I/O requests are latency-sensitive, these heuristics can sometimes lead to disastrous results. Instead, we propose addressing the root cause of the heuristics problem by allowing software to explicitly specify to the device if submitted requests are latency-sensitive. The device then “calibrates” its interrupts to completions of latency-sensitive requests. We focus on NVMe storage devices and show that it is natural to express these semantics in the kernel and the application and only requires a modest two-bit change to the device interface. Calibrated interrupts increase throughput by up to 35%, reduce CPU consumption by as much as 30%, and achieve up to 37% lower latency when interrupts are coalesced.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"218 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122514319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special section of the ACM Transactions on Storage presents some of the highlights of the storage-related papers published in the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’21). The OSDI Symposium emphasizes innovative research as well as quantified or insightful experiences in systems design and implementation. Despite OSDI ’s broad view of the systems area, the design and implementation of storage systems have always been important topics for OSDI. In particular, out of the 165 OSDI ’21 submissions, 26 of them (=16%) addressed various storage-related aspects; out of its 31 accepted papers, 6 (=19%) addressed storage-related themes, constituting a significant part of the OSDI ’21 program. Out of the above, for this special section of ACM Transactions on Storage, we selected two high-quality papers. Each includes some additional material, which has been reviewed (in fasttrack mode) by a subset of its original OSDI ’21 reviewers. The first article is “Nap: Persistent Memory Indexes for NUMA Architectures” by Qing Wang, Youyou Lu, Junru Li, and Jiwu Shu. This is an expanded version of the OSDI ’21 paper “NAP: A Black-Box Approach to NUMA-Aware Persistent Memory Indexes.” It introduces a NUMA-aware layer above existing persistent memory indexes, consisting of a volatile DRAM component and persistent, crash-consistent per-NUMA-node components. Reads and writes to hot items are handled via the NUMA-aware layer, alleviating the performance issues with cross-node access to persistent memory. The second article is “Optimizing Storage Performance with Calibrated Interrupts” by Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. This paper presents a new interface to help devices make the tradeoff between triggering an interrupt immediately after an I/O request completes, reducing latency, and waiting until multiple interrupts can be coalesced, reducing interrupt-handling overhead. This interface allows applications to inform devices about latency-sensitive requests, allowing interrupt generation to be aligned with application requirements. We hope you enjoy these expanded versions and find both papers interesting and insightful.
{"title":"Introduction to the Special Section on USENIX OSDI 2021","authors":"Angela Demke Brown, Jacob R. Lorch","doi":"10.1145/3507950","DOIUrl":"https://doi.org/10.1145/3507950","url":null,"abstract":"This special section of the ACM Transactions on Storage presents some of the highlights of the storage-related papers published in the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’21). The OSDI Symposium emphasizes innovative research as well as quantified or insightful experiences in systems design and implementation. Despite OSDI ’s broad view of the systems area, the design and implementation of storage systems have always been important topics for OSDI. In particular, out of the 165 OSDI ’21 submissions, 26 of them (=16%) addressed various storage-related aspects; out of its 31 accepted papers, 6 (=19%) addressed storage-related themes, constituting a significant part of the OSDI ’21 program. Out of the above, for this special section of ACM Transactions on Storage, we selected two high-quality papers. Each includes some additional material, which has been reviewed (in fasttrack mode) by a subset of its original OSDI ’21 reviewers. The first article is “Nap: Persistent Memory Indexes for NUMA Architectures” by Qing Wang, Youyou Lu, Junru Li, and Jiwu Shu. This is an expanded version of the OSDI ’21 paper “NAP: A Black-Box Approach to NUMA-Aware Persistent Memory Indexes.” It introduces a NUMA-aware layer above existing persistent memory indexes, consisting of a volatile DRAM component and persistent, crash-consistent per-NUMA-node components. Reads and writes to hot items are handled via the NUMA-aware layer, alleviating the performance issues with cross-node access to persistent memory. The second article is “Optimizing Storage Performance with Calibrated Interrupts” by Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. This paper presents a new interface to help devices make the tradeoff between triggering an interrupt immediately after an I/O request completes, reducing latency, and waiting until multiple interrupts can be coalesced, reducing interrupt-handling overhead. This interface allows applications to inform devices about latency-sensitive requests, allowing interrupt generation to be aligned with application requirements. We hope you enjoy these expanded versions and find both papers interesting and insightful.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124029644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Wang, Youyou Lu, Junru Li, Minhui Xie, J. Shu
We present Nap, a black-box approach that converts concurrent persistent memory (PM) indexes into non-uniform memory access (NUMA)-aware counterparts. Based on the observation that real-world workloads always feature skewed access patterns, Nap introduces a NUMA-aware layer (NAL) on the top of existing concurrent PM indexes, and steers accesses to hot items to this layer. The NAL maintains (1) per-node partial views in PM for serving insert/update/delete operations with failure atomicity and (2) a global view in DRAM for serving lookup operations. The NAL eliminates remote PM accesses to hot items without inducing extra local PM accesses. Moreover, to handle dynamic workloads, Nap adopts a fast NAL switch mechanism. We convert five state-of-the-art PM indexes using Nap. Evaluation on a four-node machine with Optane DC Persistent Memory shows that Nap can improve the throughput by up to 2.3× and 1.56× under write-intensive and read-intensive workloads, respectively.
我们提出了Nap,这是一种黑盒方法,它将并发持久内存(PM)索引转换为非统一内存访问(NUMA)感知的对应物。基于对现实工作负载总是具有倾斜访问模式的观察,Nap在现有并发PM索引的顶部引入了numa感知层(NAL),并将对热门项目的访问引导到该层。NAL在PM中维护(1)每个节点的部分视图,用于服务具有故障原子性的插入/更新/删除操作;(2)在DRAM中维护全局视图,用于服务查找操作。NAL消除了对热点项目的远程PM访问,而不会引起额外的本地PM访问。此外,为了处理动态工作负载,Nap采用了快速NAL切换机制。我们使用Nap转换五个最先进的PM索引。在使用Optane DC Persistent Memory的四节点机器上进行的评估表明,在写密集型和读密集型工作负载下,Nap可以分别将吞吐量提高2.3倍和1.56倍。
{"title":"Nap: Persistent Memory Indexes for NUMA Architectures","authors":"Qing Wang, Youyou Lu, Junru Li, Minhui Xie, J. Shu","doi":"10.1145/3507922","DOIUrl":"https://doi.org/10.1145/3507922","url":null,"abstract":"We present Nap, a black-box approach that converts concurrent persistent memory (PM) indexes into non-uniform memory access (NUMA)-aware counterparts. Based on the observation that real-world workloads always feature skewed access patterns, Nap introduces a NUMA-aware layer (NAL) on the top of existing concurrent PM indexes, and steers accesses to hot items to this layer. The NAL maintains (1) per-node partial views in PM for serving insert/update/delete operations with failure atomicity and (2) a global view in DRAM for serving lookup operations. The NAL eliminates remote PM accesses to hot items without inducing extra local PM accesses. Moreover, to handle dynamic workloads, Nap adopts a fast NAL switch mechanism. We convert five state-of-the-art PM indexes using Nap. Evaluation on a four-node machine with Optane DC Persistent Memory shows that Nap can improve the throughput by up to 2.3× and 1.56× under write-intensive and read-intensive workloads, respectively.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129395638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}