Modern key-value stores, object stores, Internet proxy caches, and Content Delivery Networks (CDN) often manage objects of diverse sizes, e.g., blobs, video files of different lengths, images with varying resolutions, and small documents. In such workloads, size-aware cache policies outperform size-oblivious algorithms. Unfortunately, existing size-aware algorithms tend to be overly complicated and computationally expensive. Our work follows a more approachable pattern; we extend the prevalent (size-oblivious) TinyLFU cache admission policy to handle variable-sized items. Implementing our approach inside two popular caching libraries only requires minor changes. We show that our algorithms yield competitive or better hit-ratios and byte hit-ratios compared to the state-of-the-art size-aware algorithms such as AdaptSize, LHD, LRB, and GDSF. Further, a runtime comparison indicates that our implementation is faster by up to 3× compared to the best alternative, i.e., it imposes a much lower CPU overhead.
{"title":"Lightweight Robust Size Aware Cache Management","authors":"Gil Einziger, Ohad Eytan, R. Friedman, Ben Manes","doi":"10.1145/3507920","DOIUrl":"https://doi.org/10.1145/3507920","url":null,"abstract":"Modern key-value stores, object stores, Internet proxy caches, and Content Delivery Networks (CDN) often manage objects of diverse sizes, e.g., blobs, video files of different lengths, images with varying resolutions, and small documents. In such workloads, size-aware cache policies outperform size-oblivious algorithms. Unfortunately, existing size-aware algorithms tend to be overly complicated and computationally expensive. Our work follows a more approachable pattern; we extend the prevalent (size-oblivious) TinyLFU cache admission policy to handle variable-sized items. Implementing our approach inside two popular caching libraries only requires minor changes. We show that our algorithms yield competitive or better hit-ratios and byte hit-ratios compared to the state-of-the-art size-aware algorithms such as AdaptSize, LHD, LRB, and GDSF. Further, a runtime comparison indicates that our implementation is faster by up to 3× compared to the best alternative, i.e., it imposes a much lower CPU overhead.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132147365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, M. A. Bender, Martín Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, Jun Yuan
Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem. This article describes nimble clones in B-ε-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write. We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.
{"title":"Copy-on-Abundant-Write for Nimble File System Clones","authors":"Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, M. A. Bender, Martín Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, Jun Yuan","doi":"10.1145/3423495","DOIUrl":"https://doi.org/10.1145/3423495","url":null,"abstract":"Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem. This article describes nimble clones in B-ε-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write. We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129690589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To many of us, 2020 may be a year that we would like to forget. Our lives have been immensely altered by the COVID-19 pandemic, with some of us having to suffer losses of our close ones. But life moves on, and as we publish our first issue of ACM Transactions on Storage for 2021, we look for hope and encouragement. In this light, I take this opportunity to express my appreciation to all those who have worked to make ACM TOS, the premier journal it is today. In particular, I thank the Associate Editors and the reviewers who have voluntarily devoted their time and effort to serve the community. Our entire community is indebted to these volunteers, who have generously shared their expertise to handle and thoroughly review the articles that were submitted. In the past two years, ACM TOS received the help of over 31 Editorial Board members along with 178 invited reviewers to curate nearly 117 submissions and publish more than 48 articles with meaningful and impactful results. While the Associate Editors and all the reviewers over the past two years have been listed in our website, https://tos.acm.org/, I take this opportunity to list our distinguished reviewers, who went out of their way to provide careful, thorough, and timely reviews. These names are based on the recommendations of the Associate Editors, through whom the reviews were solicited. Again, we thank all the reviewers for their dedication and support to ACM TOS and the computer system storage community as a whole. Thank you all.
{"title":"Thanking the TOS Associated Editors and Reviewers","authors":"S. Noh","doi":"10.1145/3442683","DOIUrl":"https://doi.org/10.1145/3442683","url":null,"abstract":"To many of us, 2020 may be a year that we would like to forget. Our lives have been immensely altered by the COVID-19 pandemic, with some of us having to suffer losses of our close ones. But life moves on, and as we publish our first issue of ACM Transactions on Storage for 2021, we look for hope and encouragement. In this light, I take this opportunity to express my appreciation to all those who have worked to make ACM TOS, the premier journal it is today. In particular, I thank the Associate Editors and the reviewers who have voluntarily devoted their time and effort to serve the community. Our entire community is indebted to these volunteers, who have generously shared their expertise to handle and thoroughly review the articles that were submitted. In the past two years, ACM TOS received the help of over 31 Editorial Board members along with 178 invited reviewers to curate nearly 117 submissions and publish more than 48 articles with meaningful and impactful results. While the Associate Editors and all the reviewers over the past two years have been listed in our website, https://tos.acm.org/, I take this opportunity to list our distinguished reviewers, who went out of their way to provide careful, thorough, and timely reviews. These names are based on the recommendations of the Associate Editors, through whom the reviews were solicited. Again, we thank all the reviewers for their dedication and support to ACM TOS and the computer system storage community as a whole. Thank you all.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127155514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Every year, the storage and file system community convene at the USENIX Conference on File and Storage Technologies (FAST) to present and discuss the best of the exciting research activities that are shaping the area. In February of 2020, luckily just before the rampant spread of COVID-19, we were able to do the same for the 18th USENIX Conference on File and Storage Technologies (FAST’20) at Santa Clara, CA. This year, we received 138 exciting papers, out of which 23 papers were selected for publication. As in previous years, the program covered a wide range of topics ranging from cloud and HPC storage, key-value stores, flash and non-volatile memory, as well as long standing traditional topics, such as file systems, consistency, and reliability. In this Special Section of the ACM Transactions on Storage, we highlight three high-quality articles that were selected by the program chairs. These select articles are expanded versions of the FAST publications (and re-reviewed by the original reviewers of the submission) that include material that had to be excluded due to the space limitation of conference papers, allowing for a more comprehensive discussion of the topic. We are confident that you will enjoy these articles even more. The first article is “Reliability of SSDs in Enterprise Storage Systems: A Large-scale Field Study” (titled “A Study of SSD Reliability in Large-scale Enterprise Storage Deployments” in the FAST’20 Proceedings) by Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. This article was submitted as a Deployed systems paper, and it presents a large-scale field study of 1.6 million NAND-based SSDs deployed at NetApp. This article is the first study of an enterprise storage system, which comprises a diverse set of SSDs from three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies from SLC to 3D-TLC. The second article is “Strong and Efficient Consistency with Consistency-aware Durability” by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. ArpaciDusseau. This article introduces consistency-aware durability or CAD, which is a new approach to durability in distributed storage, and a novel and strong consistency property called cross-client monotonic reads. The authors show that this new consistency property can be satisfied with CAD by shifting the point of durability from writes to reads. Through an implementation study, the authors show that the two notions combined can bring about performance significantly higher than immediately durable and strongly consistent ZooKeeper, even while providing stronger consistency than those adopted by many systems today. The final article is “Copy-on-Abundant-Write for Nimble File System Clones” (originally titled “How to Copy Files”) by Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter,
每年,存储和文件系统社区都会在USENIX文件和存储技术会议(FAST)上召开会议,介绍和讨论正在塑造该领域的最令人兴奋的研究活动。2020年2月,幸运的是,就在COVID-19猖獗传播之前,我们能够在加利福尼亚州圣克拉拉举行的第18届USENIX文件和存储技术会议(FAST ' 20)上做同样的事情。今年,我们收到了138篇令人兴奋的论文,其中23篇论文被选中发表。与前几年一样,该计划涵盖了广泛的主题,从云和HPC存储,键值存储,闪存和非易失性存储器,以及长期存在的传统主题,如文件系统,一致性和可靠性。在ACM存储事务的这个特别部分中,我们突出了三篇由项目主席选择的高质量文章。这些精选文章是FAST出版物的扩展版本(由提交的原始审稿人重新审查),其中包括由于会议论文的空间限制而不得不排除的材料,以便对该主题进行更全面的讨论。我们相信您会更喜欢这些文章。第一篇文章是由Stathis Maneas, Kaveh Mahdaviani, Tim Emami和Bianca Schroeder撰写的“企业存储系统中SSD的可靠性:大规模现场研究”(标题为“大规模企业存储部署中的SSD可靠性研究”)。这篇文章是作为一篇部署系统论文提交的,它介绍了在NetApp部署的160万个基于nand的ssd的大规模实地研究。本文是对企业存储系统的第一项研究,该系统包括来自三个不同制造商、18种不同型号、12种不同容量的各种ssd,以及从SLC到3D-TLC的所有主要闪存技术。第二篇文章是由Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau和Remzi H. ArpaciDusseau撰写的“具有一致性意识持久性的强而有效的一致性”。本文介绍了支持一致性的持久性(CAD),这是分布式存储中实现持久性的一种新方法,以及一种新的强一致性特性,称为跨客户机单调读取。作者表明,通过将持久性点从写转移到读,可以满足CAD的这种新的一致性特性。通过一项实施研究,作者表明,这两个概念结合起来可以带来比即时持久和强一致性的ZooKeeper更高的性能,即使提供比当今许多系统采用的更强的一致性。最后一篇文章是Yang Zhan, Alex Conway, yizzheng Jiao, Nirjhar Mukherjee, Ian Groombridge, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter和Jun Yuan撰写的“Copy-on- - write for Nimble File System克隆”(原标题为“如何复制文件”)。本文介绍了如何在写优化的BetrFS文件系统中克隆文件和目录,这是许多实际应用程序和工作流中的重要操作。本文的主要观察结果是写优化的键值存储,例如B树或
{"title":"Introduction to the Special Section on USENIX FAST 2020","authors":"S. Noh, B. Welch","doi":"10.1145/3442685","DOIUrl":"https://doi.org/10.1145/3442685","url":null,"abstract":"Every year, the storage and file system community convene at the USENIX Conference on File and Storage Technologies (FAST) to present and discuss the best of the exciting research activities that are shaping the area. In February of 2020, luckily just before the rampant spread of COVID-19, we were able to do the same for the 18th USENIX Conference on File and Storage Technologies (FAST’20) at Santa Clara, CA. This year, we received 138 exciting papers, out of which 23 papers were selected for publication. As in previous years, the program covered a wide range of topics ranging from cloud and HPC storage, key-value stores, flash and non-volatile memory, as well as long standing traditional topics, such as file systems, consistency, and reliability. In this Special Section of the ACM Transactions on Storage, we highlight three high-quality articles that were selected by the program chairs. These select articles are expanded versions of the FAST publications (and re-reviewed by the original reviewers of the submission) that include material that had to be excluded due to the space limitation of conference papers, allowing for a more comprehensive discussion of the topic. We are confident that you will enjoy these articles even more. The first article is “Reliability of SSDs in Enterprise Storage Systems: A Large-scale Field Study” (titled “A Study of SSD Reliability in Large-scale Enterprise Storage Deployments” in the FAST’20 Proceedings) by Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. This article was submitted as a Deployed systems paper, and it presents a large-scale field study of 1.6 million NAND-based SSDs deployed at NetApp. This article is the first study of an enterprise storage system, which comprises a diverse set of SSDs from three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies from SLC to 3D-TLC. The second article is “Strong and Efficient Consistency with Consistency-aware Durability” by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. ArpaciDusseau. This article introduces consistency-aware durability or CAD, which is a new approach to durability in distributed storage, and a novel and strong consistency property called cross-client monotonic reads. The authors show that this new consistency property can be satisfied with CAD by shifting the point of durability from writes to reads. Through an implementation study, the authors show that the two notions combined can bring about performance significantly higher than immediately durable and strongly consistent ZooKeeper, even while providing stronger consistency than those adopted by many systems today. The final article is “Copy-on-Abundant-Write for Nimble File System Clones” (originally titled “How to Copy Files”) by Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter,","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116212186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Nowadays, due to power limitations, it is important to reduce CPU overheads for data processing. In this article, we propose Kreon, a key-value store that targets servers with flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness. We first observe that two significant sources of overhead in key-value stores are: (a) The use of compaction in Log-Structured Merge-Trees (LSM-Tree) that constantly perform merging and sorting of large data segments and (b) the use of an I/O cache to access devices, which incurs overhead even for data that reside in memory. To avoid these, Kreon performs data movement from level to level by using partial reorganization instead of full data reorganization via the use of a full index per-level. Kreon uses memory-mapped I/O via a custom kernel path to avoid a user-space cache. For a large dataset, Kreon reduces CPU cycles/op by up to 5.8×, reduces I/O amplification for inserts by up to 4.61×, and increases insert ops/s by up to 5.3×, compared to RocksDB.
{"title":"Kreon","authors":"Anastasios Papagiannis, Giorgos Saloustros, Giorgos Xanthakis, Giorgos Kalaentzis, Pilar González-Férez, A. Bilas","doi":"10.1145/3418414","DOIUrl":"https://doi.org/10.1145/3418414","url":null,"abstract":"Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Nowadays, due to power limitations, it is important to reduce CPU overheads for data processing. In this article, we propose Kreon, a key-value store that targets servers with flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness. We first observe that two significant sources of overhead in key-value stores are: (a) The use of compaction in Log-Structured Merge-Trees (LSM-Tree) that constantly perform merging and sorting of large data segments and (b) the use of an I/O cache to access devices, which incurs overhead even for data that reside in memory. To avoid these, Kreon performs data movement from level to level by using partial reorganization instead of full data reorganization via the use of a full index per-level. Kreon uses memory-mapped I/O via a custom kernel path to avoid a user-space cache. For a large dataset, Kreon reduces CPU cycles/op by up to 5.8×, reduces I/O amplification for inserts by up to 4.61×, and increases insert ops/s by up to 5.3×, compared to RocksDB.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117331050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Cheng, Chunyan Li, Lingfang Zeng, Y. Qian, Xi Li, A. Brinkmann
In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPCC implements two caching modes: a read and write mode (RW-NVMM-LPCC for short) and a read only mode (RO-NVMM-LPCC for short). NVMM-LPCC integrates with the Lustre Hierarchical Storage Management (HSM) solution and the Lustre layout lock mechanism to provide consistent persistent caching services for I/O applications running on client nodes, meanwhile maintaining a global unified namespace of the entire Lustre file system. The evaluation results presented in this article show that NVMM-LPCC can increase the average read throughput by up to 35.80 times and the average write throughput by up to 9.83 times compared with the native Lustre system, while providing excellent scalability.
{"title":"NVMM-Oriented Hierarchical Persistent Client Caching for Lustre","authors":"Wen Cheng, Chunyan Li, Lingfang Zeng, Y. Qian, Xi Li, A. Brinkmann","doi":"10.1145/3404190","DOIUrl":"https://doi.org/10.1145/3404190","url":null,"abstract":"In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPCC implements two caching modes: a read and write mode (RW-NVMM-LPCC for short) and a read only mode (RO-NVMM-LPCC for short). NVMM-LPCC integrates with the Lustre Hierarchical Storage Management (HSM) solution and the Lustre layout lock mechanism to provide consistent persistent caching services for I/O applications running on client nodes, meanwhile maintaining a global unified namespace of the entire Lustre file system. The evaluation results presented in this article show that NVMM-LPCC can increase the average read throughput by up to 35.80 times and the average write throughput by up to 9.83 times compared with the native Lustre system, while providing excellent scalability.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122231064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
We introduce consistency-aware durability or Cad, a new approach to durability in distributed storage that enables strong consistency while delivering high performance. We demonstrate the efficacy of this approach by designing cross-client monotonic reads, a novel and strong consistency property that provides monotonic reads across failures and sessions in leader-based systems; such a property can be particularly beneficial in geo-distributed and edge-computing scenarios. We build Orca, a modified version of ZooKeeper that implements Cad and cross-client monotonic reads. We experimentally show that Orca provides strong consistency while closely matching the performance of weakly consistent ZooKeeper. Compared to strongly consistent ZooKeeper, Orca provides significantly higher throughput (1.8--3.3×) and notably reduces latency, sometimes by an order of magnitude in geo-distributed settings. We also implement Cad in Redis and show that the performance benefits are similar to that of Cad’s implementation in ZooKeeper.
{"title":"Strong and Efficient Consistency with Consistency-aware Durability","authors":"Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/3423138","DOIUrl":"https://doi.org/10.1145/3423138","url":null,"abstract":"We introduce consistency-aware durability or Cad, a new approach to durability in distributed storage that enables strong consistency while delivering high performance. We demonstrate the efficacy of this approach by designing cross-client monotonic reads, a novel and strong consistency property that provides monotonic reads across failures and sessions in leader-based systems; such a property can be particularly beneficial in geo-distributed and edge-computing scenarios. We build Orca, a modified version of ZooKeeper that implements Cad and cross-client monotonic reads. We experimentally show that Orca provides strong consistency while closely matching the performance of weakly consistent ZooKeeper. Compared to strongly consistent ZooKeeper, Orca provides significantly higher throughput (1.8--3.3×) and notably reduces latency, sometimes by an order of magnitude in geo-distributed settings. We also implement Cad in Redis and show that the performance benefits are similar to that of Cad’s implementation in ZooKeeper.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"40 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114122048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stathis Maneas, K. Mahdaviani, Tim Emami, Bianca Schroeder
This article presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.6 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in prior works, including the effect of firmware versions, the reliability of TLC NAND, and the correlations between drives within a RAID system. This article presents our analysis, along with a number of practical implications derived from it.
{"title":"Reliability of SSDs in Enterprise Storage Systems","authors":"Stathis Maneas, K. Mahdaviani, Tim Emami, Bianca Schroeder","doi":"10.1145/3423088","DOIUrl":"https://doi.org/10.1145/3423088","url":null,"abstract":"This article presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.6 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in prior works, including the effect of firmware versions, the reliability of TLC NAND, and the correlations between drives within a RAID system. This article presents our analysis, along with a number of practical implications derived from it.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129970963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Yadgar, Moshe Gabel, Shehbaz Jaffer, Bianca Schroeder
Storage systems are designed and optimized relying on wisdom derived from analysis studies of file-system and block-level workloads. However, while SSDs are becoming a dominant building block in many storage systems, their design continues to build on knowledge derived from analysis targeted at hard disk optimization. Though still valuable, it does not cover important aspects relevant for SSD performance. In a sense, we are “searching under the streetlight,” possibly missing important opportunities for optimizing storage system design. We present the first I/O workload analysis designed with SSDs in mind. We characterize traces from four repositories and examine their “temperature” ranges, sensitivity to page size, and “logical locality.” We then take the first step towards correlating these characteristics with three standard performance metrics: write amplification, read amplification, and flash read costs. Our results show that SSD-specific characteristics strongly affect performance, often in surprising ways.
{"title":"SSD-based Workload Characteristics and Their Performance Implications","authors":"G. Yadgar, Moshe Gabel, Shehbaz Jaffer, Bianca Schroeder","doi":"10.1145/3423137","DOIUrl":"https://doi.org/10.1145/3423137","url":null,"abstract":"Storage systems are designed and optimized relying on wisdom derived from analysis studies of file-system and block-level workloads. However, while SSDs are becoming a dominant building block in many storage systems, their design continues to build on knowledge derived from analysis targeted at hard disk optimization. Though still valuable, it does not cover important aspects relevant for SSD performance. In a sense, we are “searching under the streetlight,” possibly missing important opportunities for optimizing storage system design. We present the first I/O workload analysis designed with SSDs in mind. We characterize traces from four repositories and examine their “temperature” ranges, sensitivity to page size, and “logical locality.” We then take the first step towards correlating these characteristics with three standard performance metrics: write amplification, read amplification, and flash read costs. Our results show that SSD-specific characteristics strongly affect performance, often in surprising ways.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114154191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Shu, Youmin Chen, Qing Wang, Bohong Zhu, Junru Li, Youyou Lu
The rapidly increasing data in recent years requires the datacenter infrastructure to store and process data with extremely high throughput and low latency. Fortunately, persistent memory (PM) and RDMA technologies bring new opportunities towards this goal. Both of them are capable of delivering more than 10 GB/s of bandwidth and sub-microsecond latency. However, our past experiences and recent studies show that it is non-trivial to build an efficient and distributed storage system with such new hardware. In this article, we design and implement TH-DPMS (TsingHua Distributed Persistent Memory System) based on persistent memory and RDMA, which unifies the memory, file system, and key-value interface in a single system. TH-DPMS is designed based on a unified distributed persistent memory abstract, pDSM. pDSM acts as a generic layer to connect the PMs of different storage nodes via high-speed RDMA network and organizes them into a global shared address space. It provides the fundamental functionalities, including global address management, space management, fault tolerance, and crash consistency guarantees. Applications are enabled to access pDSM with a group of flexible and easy-to-use APIs by using either raw read/write interfaces or the transactional ones with ACID guarantees. Based on pDSM, we implement a distributed file system and a key-value store named pDFS and pDKVS, respectively. Together, they uphold TH-DPMS with high-performance, low-latency, and fault-tolerant data storage. We evaluate TH-DPMS with both micro-benchmarks and real-world memory-intensive workloads. Experimental results show that TH-DPMS is capable of delivering an aggregated bandwidth of 120 GB/s with 6 nodes. When processing memory-intensive workloads such as YCSB and Graph500, TH-DPMS improves the performance by one order of magnitude compared to existing systems and keeps consistent high efficiency when the workload size grows to multiple terabytes.
{"title":"TH-DPMS","authors":"J. Shu, Youmin Chen, Qing Wang, Bohong Zhu, Junru Li, Youyou Lu","doi":"10.1145/3412852","DOIUrl":"https://doi.org/10.1145/3412852","url":null,"abstract":"The rapidly increasing data in recent years requires the datacenter infrastructure to store and process data with extremely high throughput and low latency. Fortunately, persistent memory (PM) and RDMA technologies bring new opportunities towards this goal. Both of them are capable of delivering more than 10 GB/s of bandwidth and sub-microsecond latency. However, our past experiences and recent studies show that it is non-trivial to build an efficient and distributed storage system with such new hardware. In this article, we design and implement TH-DPMS (TsingHua Distributed Persistent Memory System) based on persistent memory and RDMA, which unifies the memory, file system, and key-value interface in a single system. TH-DPMS is designed based on a unified distributed persistent memory abstract, pDSM. pDSM acts as a generic layer to connect the PMs of different storage nodes via high-speed RDMA network and organizes them into a global shared address space. It provides the fundamental functionalities, including global address management, space management, fault tolerance, and crash consistency guarantees. Applications are enabled to access pDSM with a group of flexible and easy-to-use APIs by using either raw read/write interfaces or the transactional ones with ACID guarantees. Based on pDSM, we implement a distributed file system and a key-value store named pDFS and pDKVS, respectively. Together, they uphold TH-DPMS with high-performance, low-latency, and fault-tolerant data storage. We evaluate TH-DPMS with both micro-benchmarks and real-world memory-intensive workloads. Experimental results show that TH-DPMS is capable of delivering an aggregated bandwidth of 120 GB/s with 6 nodes. When processing memory-intensive workloads such as YCSB and Graph500, TH-DPMS improves the performance by one order of magnitude compared to existing systems and keeps consistent high efficiency when the workload size grows to multiple terabytes.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129076445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}