ACM Transactions on Storage (TOS)最新文献_第4页

Lightweight Robust Size Aware Cache Management 轻量级健壮的大小感知缓存管理

ACM Transactions on Storage (TOS)

Pub Date : 2021-05-18 DOI: 10.1145/3507920

Gil Einziger, Ohad Eytan, R. Friedman, Ben Manes

Modern key-value stores, object stores, Internet proxy caches, and Content Delivery Networks (CDN) often manage objects of diverse sizes, e.g., blobs, video files of different lengths, images with varying resolutions, and small documents. In such workloads, size-aware cache policies outperform size-oblivious algorithms. Unfortunately, existing size-aware algorithms tend to be overly complicated and computationally expensive. Our work follows a more approachable pattern; we extend the prevalent (size-oblivious) TinyLFU cache admission policy to handle variable-sized items. Implementing our approach inside two popular caching libraries only requires minor changes. We show that our algorithms yield competitive or better hit-ratios and byte hit-ratios compared to the state-of-the-art size-aware algorithms such as AdaptSize, LHD, LRB, and GDSF. Further, a runtime comparison indicates that our implementation is faster by up to 3× compared to the best alternative, i.e., it imposes a much lower CPU overhead.

现代键值存储、对象存储、Internet代理缓存和内容交付网络(CDN)经常管理不同大小的对象，例如blob、不同长度的视频文件、不同分辨率的图像和小文档。在这种工作负载中，大小感知缓存策略优于大小无关算法。不幸的是，现有的大小感知算法往往过于复杂且计算成本高。我们的工作遵循一种更平易近人的模式;我们扩展了流行的(大小无关的)TinyLFU缓存允许策略来处理可变大小的项。在两个流行的缓存库中实现我们的方法只需要微小的更改。我们表明，与AdaptSize、LHD、LRB和GDSF等最先进的大小感知算法相比，我们的算法产生了具有竞争力或更好的命中率和字节命中率。此外，运行时比较表明，与最佳替代方案相比，我们的实现速度快了3倍，也就是说，它带来的CPU开销要低得多。

引用次数: 5

Copy-on-Abundant-Write for Nimble File System Clones 灵活的文件系统克隆的copy -on- abundance - write

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-29 DOI: 10.1145/3423495

Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, M. A. Bender, Martín Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, Jun Yuan

Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem. This article describes nimble clones in B-ε-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write. We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.

制作文件和目录的逻辑副本或克隆对于许多实际应用程序和工作流(包括备份、虚拟机和容器)至关重要。理想的克隆实现应满足以下性能目标:(1)创建克隆具有低延迟;(2)所有版本的读取速度都很快(即即使经过修改，也始终保持空间局部性);(3)所有版本的写入速度都很快;(4)整体系统空间高效。实现实现所有四个属性的克隆操作，我们称之为灵活的克隆，是一个长期存在的开放性问题。本文描述了B-ε-tree文件系统(BetrFS)中的灵活克隆，这是一个开源的、全路径索引的、写优化的文件系统。我们工作背后的关键观察是，标准的“写时复制”启发式方法可能过于粗糙，无法提高空间效率，或者过于细粒度，无法保留局域性。另一方面，写优化的键值存储，例如bh树或日志结构的合并树(LSM)树，可以将更新的逻辑应用程序与物理复制数据的粒度解耦。在我们的写优化克隆实现中，只有当克隆发生了足够大的变化，需要进行复制时，克隆之间的数据共享才会中断，我们将这种策略称为copy-on- abundance -write。我们证明了批处理和分摊BetrFS克隆操作成本所需的算法工作不会削弱基线BetrFS的性能优势;在少数情况下，BetrFS的性能甚至有所提高。BetrFS克隆效率高;例如，当使用克隆操作创建容器时，BetrFS的性能比简单的递归复制高出两个数量级，比具有专用Linux容器(LXC)后端的文件系统高出3—4倍。

{"title":"Copy-on-Abundant-Write for Nimble File System Clones","authors":"Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, M. A. Bender, Martín Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, Jun Yuan","doi":"10.1145/3423495","DOIUrl":"https://doi.org/10.1145/3423495","url":null,"abstract":"Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem. This article describes nimble clones in B-ε-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write. We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129690589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Thanking the TOS Associated Editors and Reviewers 感谢TOS相关编辑和审稿人

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-29 DOI: 10.1145/3442683

S. Noh

To many of us, 2020 may be a year that we would like to forget. Our lives have been immensely altered by the COVID-19 pandemic, with some of us having to suffer losses of our close ones. But life moves on, and as we publish our first issue of ACM Transactions on Storage for 2021, we look for hope and encouragement. In this light, I take this opportunity to express my appreciation to all those who have worked to make ACM TOS, the premier journal it is today. In particular, I thank the Associate Editors and the reviewers who have voluntarily devoted their time and effort to serve the community. Our entire community is indebted to these volunteers, who have generously shared their expertise to handle and thoroughly review the articles that were submitted. In the past two years, ACM TOS received the help of over 31 Editorial Board members along with 178 invited reviewers to curate nearly 117 submissions and publish more than 48 articles with meaningful and impactful results. While the Associate Editors and all the reviewers over the past two years have been listed in our website, https://tos.acm.org/, I take this opportunity to list our distinguished reviewers, who went out of their way to provide careful, thorough, and timely reviews. These names are based on the recommendations of the Associate Editors, through whom the reviews were solicited. Again, we thank all the reviewers for their dedication and support to ACM TOS and the computer system storage community as a whole. Thank you all.

对我们许多人来说，2020年可能是我们想要忘记的一年。COVID-19大流行极大地改变了我们的生活，我们中的一些人不得不失去亲人。但生活还在继续，当我们出版2021年第一期ACM存储交易时，我们期待着希望和鼓励。有鉴于此，我借此机会向所有为使《ACM TOS》成为今天最重要的期刊而努力的人表示感谢。我特别感谢那些自愿为社区奉献时间和精力的副编辑和审稿人。我们整个社区都感谢这些志愿者，他们慷慨地分享了他们的专业知识，以处理和彻底审查提交的文章。在过去的两年中，ACM TOS得到了超过31位编辑委员会成员和178位特邀评审员的帮助，策划了近117份提交，发表了48多篇有意义和有影响力的文章。虽然我们的网站https://tos.acm.org/已经列出了过去两年的副编辑和所有审稿人，但我借此机会列出了我们杰出的审稿人，他们不遗余力地提供了仔细，彻底和及时的审稿。这些名字是基于副编辑的推荐，通过他们征求评论。我们再次感谢所有审稿人对ACM TOS和整个计算机系统存储社区的奉献和支持。谢谢大家。

{"title":"Thanking the TOS Associated Editors and Reviewers","authors":"S. Noh","doi":"10.1145/3442683","DOIUrl":"https://doi.org/10.1145/3442683","url":null,"abstract":"To many of us, 2020 may be a year that we would like to forget. Our lives have been immensely altered by the COVID-19 pandemic, with some of us having to suffer losses of our close ones. But life moves on, and as we publish our first issue of ACM Transactions on Storage for 2021, we look for hope and encouragement. In this light, I take this opportunity to express my appreciation to all those who have worked to make ACM TOS, the premier journal it is today. In particular, I thank the Associate Editors and the reviewers who have voluntarily devoted their time and effort to serve the community. Our entire community is indebted to these volunteers, who have generously shared their expertise to handle and thoroughly review the articles that were submitted. In the past two years, ACM TOS received the help of over 31 Editorial Board members along with 178 invited reviewers to curate nearly 117 submissions and publish more than 48 articles with meaningful and impactful results. While the Associate Editors and all the reviewers over the past two years have been listed in our website, https://tos.acm.org/, I take this opportunity to list our distinguished reviewers, who went out of their way to provide careful, thorough, and timely reviews. These names are based on the recommendations of the Associate Editors, through whom the reviews were solicited. Again, we thank all the reviewers for their dedication and support to ACM TOS and the computer system storage community as a whole. Thank you all.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127155514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the Special Section on USENIX FAST 2020 介绍USENIX FAST 2020的特殊部分

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-29 DOI: 10.1145/3442685

S. Noh, B. Welch

Every year, the storage and file system community convene at the USENIX Conference on File and Storage Technologies (FAST) to present and discuss the best of the exciting research activities that are shaping the area. In February of 2020, luckily just before the rampant spread of COVID-19, we were able to do the same for the 18th USENIX Conference on File and Storage Technologies (FAST’20) at Santa Clara, CA. This year, we received 138 exciting papers, out of which 23 papers were selected for publication. As in previous years, the program covered a wide range of topics ranging from cloud and HPC storage, key-value stores, flash and non-volatile memory, as well as long standing traditional topics, such as file systems, consistency, and reliability. In this Special Section of the ACM Transactions on Storage, we highlight three high-quality articles that were selected by the program chairs. These select articles are expanded versions of the FAST publications (and re-reviewed by the original reviewers of the submission) that include material that had to be excluded due to the space limitation of conference papers, allowing for a more comprehensive discussion of the topic. We are confident that you will enjoy these articles even more. The first article is “Reliability of SSDs in Enterprise Storage Systems: A Large-scale Field Study” (titled “A Study of SSD Reliability in Large-scale Enterprise Storage Deployments” in the FAST’20 Proceedings) by Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. This article was submitted as a Deployed systems paper, and it presents a large-scale field study of 1.6 million NAND-based SSDs deployed at NetApp. This article is the first study of an enterprise storage system, which comprises a diverse set of SSDs from three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies from SLC to 3D-TLC. The second article is “Strong and Efficient Consistency with Consistency-aware Durability” by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. ArpaciDusseau. This article introduces consistency-aware durability or CAD, which is a new approach to durability in distributed storage, and a novel and strong consistency property called cross-client monotonic reads. The authors show that this new consistency property can be satisfied with CAD by shifting the point of durability from writes to reads. Through an implementation study, the authors show that the two notions combined can bring about performance significantly higher than immediately durable and strongly consistent ZooKeeper, even while providing stronger consistency than those adopted by many systems today. The final article is “Copy-on-Abundant-Write for Nimble File System Clones” (originally titled “How to Copy Files”) by Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter,

每年，存储和文件系统社区都会在USENIX文件和存储技术会议(FAST)上召开会议，介绍和讨论正在塑造该领域的最令人兴奋的研究活动。2020年2月，幸运的是，就在COVID-19猖獗传播之前，我们能够在加利福尼亚州圣克拉拉举行的第18届USENIX文件和存储技术会议(FAST ' 20)上做同样的事情。今年，我们收到了138篇令人兴奋的论文，其中23篇论文被选中发表。与前几年一样，该计划涵盖了广泛的主题，从云和HPC存储，键值存储，闪存和非易失性存储器，以及长期存在的传统主题，如文件系统，一致性和可靠性。在ACM存储事务的这个特别部分中，我们突出了三篇由项目主席选择的高质量文章。这些精选文章是FAST出版物的扩展版本(由提交的原始审稿人重新审查)，其中包括由于会议论文的空间限制而不得不排除的材料，以便对该主题进行更全面的讨论。我们相信您会更喜欢这些文章。第一篇文章是由Stathis Maneas, Kaveh Mahdaviani, Tim Emami和Bianca Schroeder撰写的“企业存储系统中SSD的可靠性:大规模现场研究”(标题为“大规模企业存储部署中的SSD可靠性研究”)。这篇文章是作为一篇部署系统论文提交的，它介绍了在NetApp部署的160万个基于nand的ssd的大规模实地研究。本文是对企业存储系统的第一项研究，该系统包括来自三个不同制造商、18种不同型号、12种不同容量的各种ssd，以及从SLC到3D-TLC的所有主要闪存技术。第二篇文章是由Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau和Remzi H. ArpaciDusseau撰写的“具有一致性意识持久性的强而有效的一致性”。本文介绍了支持一致性的持久性(CAD)，这是分布式存储中实现持久性的一种新方法，以及一种新的强一致性特性，称为跨客户机单调读取。作者表明，通过将持久性点从写转移到读，可以满足CAD的这种新的一致性特性。通过一项实施研究，作者表明，这两个概念结合起来可以带来比即时持久和强一致性的ZooKeeper更高的性能，即使提供比当今许多系统采用的更强的一致性。最后一篇文章是Yang Zhan, Alex Conway, yizzheng Jiao, Nirjhar Mukherjee, Ian Groombridge, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter和Jun Yuan撰写的“Copy-on- - write for Nimble File System克隆”(原标题为“如何复制文件”)。本文介绍了如何在写优化的BetrFS文件系统中克隆文件和目录，这是许多实际应用程序和工作流中的重要操作。本文的主要观察结果是写优化的键值存储，例如B树或

{"title":"Introduction to the Special Section on USENIX FAST 2020","authors":"S. Noh, B. Welch","doi":"10.1145/3442685","DOIUrl":"https://doi.org/10.1145/3442685","url":null,"abstract":"Every year, the storage and file system community convene at the USENIX Conference on File and Storage Technologies (FAST) to present and discuss the best of the exciting research activities that are shaping the area. In February of 2020, luckily just before the rampant spread of COVID-19, we were able to do the same for the 18th USENIX Conference on File and Storage Technologies (FAST’20) at Santa Clara, CA. This year, we received 138 exciting papers, out of which 23 papers were selected for publication. As in previous years, the program covered a wide range of topics ranging from cloud and HPC storage, key-value stores, flash and non-volatile memory, as well as long standing traditional topics, such as file systems, consistency, and reliability. In this Special Section of the ACM Transactions on Storage, we highlight three high-quality articles that were selected by the program chairs. These select articles are expanded versions of the FAST publications (and re-reviewed by the original reviewers of the submission) that include material that had to be excluded due to the space limitation of conference papers, allowing for a more comprehensive discussion of the topic. We are confident that you will enjoy these articles even more. The first article is “Reliability of SSDs in Enterprise Storage Systems: A Large-scale Field Study” (titled “A Study of SSD Reliability in Large-scale Enterprise Storage Deployments” in the FAST’20 Proceedings) by Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. This article was submitted as a Deployed systems paper, and it presents a large-scale field study of 1.6 million NAND-based SSDs deployed at NetApp. This article is the first study of an enterprise storage system, which comprises a diverse set of SSDs from three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies from SLC to 3D-TLC. The second article is “Strong and Efficient Consistency with Consistency-aware Durability” by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. ArpaciDusseau. This article introduces consistency-aware durability or CAD, which is a new approach to durability in distributed storage, and a novel and strong consistency property called cross-client monotonic reads. The authors show that this new consistency property can be satisfied with CAD by shifting the point of durability from writes to reads. Through an implementation study, the authors show that the two notions combined can bring about performance significantly higher than immediately durable and strongly consistent ZooKeeper, even while providing stronger consistency than those adopted by many systems today. The final article is “Copy-on-Abundant-Write for Nimble File System Clones” (originally titled “How to Copy Files”) by Yang Zhan, Alex Conway, Yizheng Jiao, Nirjhar Mukherjee, Ian Groombridge, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter,","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116212186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Kreon

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-18 DOI: 10.1145/3418414

Anastasios Papagiannis, Giorgos Saloustros, Giorgos Xanthakis, Giorgos Kalaentzis, Pilar González-Férez, A. Bilas

Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Nowadays, due to power limitations, it is important to reduce CPU overheads for data processing. In this article, we propose Kreon, a key-value store that targets servers with flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness. We first observe that two significant sources of overhead in key-value stores are: (a) The use of compaction in Log-Structured Merge-Trees (LSM-Tree) that constantly perform merging and sorting of large data segments and (b) the use of an I/O cache to access devices, which incurs overhead even for data that reside in memory. To avoid these, Kreon performs data movement from level to level by using partial reorganization instead of full data reorganization via the use of a full index per-level. Kreon uses memory-mapped I/O via a custom kernel path to avoid a user-space cache. For a large dataset, Kreon reduces CPU cycles/op by up to 5.8×, reduces I/O amplification for inserts by up to 4.61×, and increases insert ops/s by up to 5.3×, compared to RocksDB.

持久键值存储已成为现代数据处理系统中数据访问路径的主要组成部分。然而，它们表现出很高的CPU和I/O开销。如今，由于功率限制，降低数据处理的CPU开销非常重要。在本文中，我们提出了Kreon，这是一种键值存储，目标服务器使用基于闪存的存储，与I/O随机性相比，CPU开销和I/O放大是更重要的瓶颈。我们首先观察到键值存储开销的两个重要来源是:(a)在日志结构合并树(LSM-Tree)中使用压缩，它不断地执行大型数据段的合并和排序;(b)使用I/O缓存来访问设备，这甚至会对驻留在内存中的数据产生开销。为了避免这些问题，Kreon通过使用部分重组而不是通过使用每个级别的完整索引来完成数据重组，从而在级别之间执行数据移动。Kreon通过自定义内核路径使用内存映射I/O，以避免用户空间缓存。对于大型数据集，与RocksDB相比，Kreon将CPU周期/op降低了5.8倍，将插入的I/O放大降低了4.61倍，并将插入ops/s提高了5.3倍。

{"title":"Kreon","authors":"Anastasios Papagiannis, Giorgos Saloustros, Giorgos Xanthakis, Giorgos Kalaentzis, Pilar González-Férez, A. Bilas","doi":"10.1145/3418414","DOIUrl":"https://doi.org/10.1145/3418414","url":null,"abstract":"Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Nowadays, due to power limitations, it is important to reduce CPU overheads for data processing. In this article, we propose Kreon, a key-value store that targets servers with flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness. We first observe that two significant sources of overhead in key-value stores are: (a) The use of compaction in Log-Structured Merge-Trees (LSM-Tree) that constantly perform merging and sorting of large data segments and (b) the use of an I/O cache to access devices, which incurs overhead even for data that reside in memory. To avoid these, Kreon performs data movement from level to level by using partial reorganization instead of full data reorganization via the use of a full index per-level. Kreon uses memory-mapped I/O via a custom kernel path to avoid a user-space cache. For a large dataset, Kreon reduces CPU cycles/op by up to 5.8×, reduces I/O amplification for inserts by up to 4.61×, and increases insert ops/s by up to 5.3×, compared to RocksDB.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117331050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NVMM-Oriented Hierarchical Persistent Client Caching for Lustre 面向nvmm的分层持久客户端Lustre缓存

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-18 DOI: 10.1145/3404190

Wen Cheng, Chunyan Li, Lingfang Zeng, Y. Qian, Xi Li, A. Brinkmann

In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPCC implements two caching modes: a read and write mode (RW-NVMM-LPCC for short) and a read only mode (RO-NVMM-LPCC for short). NVMM-LPCC integrates with the Lustre Hierarchical Storage Management (HSM) solution and the Lustre layout lock mechanism to provide consistent persistent caching services for I/O applications running on client nodes, meanwhile maintaining a global unified namespace of the entire Lustre file system. The evaluation results presented in this article show that NVMM-LPCC can increase the average read throughput by up to 35.80 times and the average write throughput by up to 9.83 times compared with the native Lustre system, while providing excellent scalability.

在高性能计算(HPC)中，数据和元数据存储在专门的服务器节点上，客户端应用程序通过网络访问服务器的数据和元数据，这会导致网络延迟和资源争用。这些服务器节点通常配备(慢速)磁盘，而客户机节点将临时数据存储在快速ssd甚至非易失性主存储器(NVMM)上。因此，只有在整个存储体系结构中包含快速客户端存储设备时，才能充分发挥并行文件系统的潜力。在本文中，我们为Lustre文件系统(简称NVMM-LPCC)提出了一个基于nvmm的分层持久客户机缓存。NVMM-LPCC实现两种缓存模式:读写模式(简称RW-NVMM-LPCC)和只读模式(简称RO-NVMM-LPCC)。NVMM-LPCC集成了Lustre分层存储管理(HSM)解决方案和Lustre布局锁机制，为运行在客户端节点上的I/O应用程序提供一致的持久缓存服务，同时维护整个Lustre文件系统的全局统一命名空间。本文给出的评估结果表明，与原生Lustre系统相比，NVMM-LPCC的平均读吞吐量提高了35.80倍，平均写吞吐量提高了9.83倍，同时具有良好的可扩展性。

{"title":"NVMM-Oriented Hierarchical Persistent Client Caching for Lustre","authors":"Wen Cheng, Chunyan Li, Lingfang Zeng, Y. Qian, Xi Li, A. Brinkmann","doi":"10.1145/3404190","DOIUrl":"https://doi.org/10.1145/3404190","url":null,"abstract":"In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPCC implements two caching modes: a read and write mode (RW-NVMM-LPCC for short) and a read only mode (RO-NVMM-LPCC for short). NVMM-LPCC integrates with the Lustre Hierarchical Storage Management (HSM) solution and the Lustre layout lock mechanism to provide consistent persistent caching services for I/O applications running on client nodes, meanwhile maintaining a global unified namespace of the entire Lustre file system. The evaluation results presented in this article show that NVMM-LPCC can increase the average read throughput by up to 35.80 times and the average write throughput by up to 9.83 times compared with the native Lustre system, while providing excellent scalability.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122231064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Strong and Efficient Consistency with Consistency-aware Durability 强大而高效的一致性，以及具有一致性意识的持久性

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-17 DOI: 10.1145/3423138

Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

We introduce consistency-aware durability or Cad, a new approach to durability in distributed storage that enables strong consistency while delivering high performance. We demonstrate the efficacy of this approach by designing cross-client monotonic reads, a novel and strong consistency property that provides monotonic reads across failures and sessions in leader-based systems; such a property can be particularly beneficial in geo-distributed and edge-computing scenarios. We build Orca, a modified version of ZooKeeper that implements Cad and cross-client monotonic reads. We experimentally show that Orca provides strong consistency while closely matching the performance of weakly consistent ZooKeeper. Compared to strongly consistent ZooKeeper, Orca provides significantly higher throughput (1.8--3.3×) and notably reduces latency, sometimes by an order of magnitude in geo-distributed settings. We also implement Cad in Redis and show that the performance benefits are similar to that of Cad’s implementation in ZooKeeper.

我们引入了一致性感知持久性(Cad)，这是一种在分布式存储中实现持久性的新方法，它在提供高性能的同时实现了强一致性。我们通过设计跨客户端单调读取来证明这种方法的有效性，这是一种新颖且强一致性的特性，可以在基于领导者的系统中提供跨故障和会话的单调读取;这种特性在地理分布和边缘计算场景中特别有用。我们构建了Orca，一个ZooKeeper的修改版本，它实现了Cad和跨客户端单调读取。实验表明，Orca提供了强一致性，同时与弱一致性ZooKeeper的性能非常接近。与强一致性ZooKeeper相比，Orca提供了更高的吞吐量(1.8- 3.3倍)，并显著降低了延迟，有时在地理分布设置中可以降低一个数量级。我们还在Redis中实现了Cad，并显示了与在ZooKeeper中实现Cad的性能优势相似。

引用次数: 10

Reliability of SSDs in Enterprise Storage Systems 企业存储系统中ssd的可靠性

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-13 DOI: 10.1145/3423088

Stathis Maneas, K. Mahdaviani, Tim Emami, Bianca Schroeder

This article presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.6 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in prior works, including the effect of firmware versions, the reliability of TLC NAND, and the correlations between drives within a RAID system. This article presents our analysis, along with a number of practical implications derived from it.

本文介绍了企业存储系统(与分布式数据中心存储系统中的驱动器相比)中基于nand的ssd的首次大规模现场研究。该研究基于一组非常全面的现场数据，涵盖了一家主要存储供应商(NetApp)的160万块ssd。这些驱动器包括三个不同的制造商，18种不同的型号，12种不同的容量，以及所有主要的闪存技术(SLC, cMLC, eMLC, 3D-TLC)。这些数据使我们能够研究大量先前工作中未研究的因素，包括固件版本的影响、TLC NAND的可靠性以及RAID系统中驱动器之间的相关性。本文介绍了我们的分析，以及由此得出的一些实际含义。

引用次数: 4

SSD-based Workload Characteristics and Their Performance Implications 基于ssd的工作负载特征及其性能影响

ACM Transactions on Storage (TOS)

Pub Date : 2021-01-08 DOI: 10.1145/3423137

G. Yadgar, Moshe Gabel, Shehbaz Jaffer, Bianca Schroeder

Storage systems are designed and optimized relying on wisdom derived from analysis studies of file-system and block-level workloads. However, while SSDs are becoming a dominant building block in many storage systems, their design continues to build on knowledge derived from analysis targeted at hard disk optimization. Though still valuable, it does not cover important aspects relevant for SSD performance. In a sense, we are “searching under the streetlight,” possibly missing important opportunities for optimizing storage system design. We present the first I/O workload analysis designed with SSDs in mind. We characterize traces from four repositories and examine their “temperature” ranges, sensitivity to page size, and “logical locality.” We then take the first step towards correlating these characteristics with three standard performance metrics: write amplification, read amplification, and flash read costs. Our results show that SSD-specific characteristics strongly affect performance, often in surprising ways.

存储系统的设计和优化依赖于从文件系统和块级工作负载的分析研究中获得的智慧。然而，虽然ssd正在成为许多存储系统的主要组成部分，但它们的设计仍然基于针对硬盘优化的分析得出的知识。虽然仍然很有价值，但它没有涵盖与SSD性能相关的重要方面。从某种意义上说，我们正在“路灯下搜索”，可能错过了优化存储系统设计的重要机会。我们提出了第一个考虑到ssd而设计的I/O工作负载分析。我们描述来自四个存储库的跟踪，并检查它们的“温度”范围、对页面大小的敏感性和“逻辑位置”。然后，我们采取第一步，将这些特征与三个标准性能指标相关联:写放大、读放大和闪存读取成本。我们的结果表明，特定于ssd的特性会以令人惊讶的方式强烈影响性能。

引用次数: 47

TH-DPMS

ACM Transactions on Storage (TOS)

Pub Date : 2020-10-01 DOI: 10.1145/3412852

J. Shu, Youmin Chen, Qing Wang, Bohong Zhu, Junru Li, Youyou Lu

The rapidly increasing data in recent years requires the datacenter infrastructure to store and process data with extremely high throughput and low latency. Fortunately, persistent memory (PM) and RDMA technologies bring new opportunities towards this goal. Both of them are capable of delivering more than 10 GB/s of bandwidth and sub-microsecond latency. However, our past experiences and recent studies show that it is non-trivial to build an efficient and distributed storage system with such new hardware. In this article, we design and implement TH-DPMS (TsingHua Distributed Persistent Memory System) based on persistent memory and RDMA, which unifies the memory, file system, and key-value interface in a single system. TH-DPMS is designed based on a unified distributed persistent memory abstract, pDSM. pDSM acts as a generic layer to connect the PMs of different storage nodes via high-speed RDMA network and organizes them into a global shared address space. It provides the fundamental functionalities, including global address management, space management, fault tolerance, and crash consistency guarantees. Applications are enabled to access pDSM with a group of flexible and easy-to-use APIs by using either raw read/write interfaces or the transactional ones with ACID guarantees. Based on pDSM, we implement a distributed file system and a key-value store named pDFS and pDKVS, respectively. Together, they uphold TH-DPMS with high-performance, low-latency, and fault-tolerant data storage. We evaluate TH-DPMS with both micro-benchmarks and real-world memory-intensive workloads. Experimental results show that TH-DPMS is capable of delivering an aggregated bandwidth of 120 GB/s with 6 nodes. When processing memory-intensive workloads such as YCSB and Graph500, TH-DPMS improves the performance by one order of magnitude compared to existing systems and keeps consistent high efficiency when the workload size grows to multiple terabytes.

近年来数据的快速增长要求数据中心基础设施以极高的吞吐量和低延迟来存储和处理数据。幸运的是，持久内存(PM)和RDMA技术为实现这一目标带来了新的机会。它们都能够提供超过10gb /s的带宽和亚微秒级的延迟。然而，我们过去的经验和最近的研究表明，用这样的新硬件构建一个高效的分布式存储系统并非易事。在本文中，我们设计并实现了基于持久内存和RDMA的TH-DPMS(清华分布式持久内存系统)，它将内存、文件系统和键值接口统一在一个系统中。TH-DPMS是基于统一的分布式持久存储抽象pDSM而设计的。pDSM作为通用层，通过高速RDMA网络将不同存储节点的pm连接起来，组织成一个全局共享地址空间。它提供基本功能，包括全局地址管理、空间管理、容错和崩溃一致性保证。通过使用原始读/写接口或具有ACID保证的事务性接口，应用程序可以使用一组灵活且易于使用的api访问pDSM。基于pDSM，我们分别实现了一个分布式文件系统pDFS和一个键值存储pDKVS。它们共同支持TH-DPMS，具有高性能、低延迟和容错数据存储。我们使用微基准测试和实际内存密集型工作负载来评估TH-DPMS。实验结果表明，TH-DPMS能够在6个节点下提供120gb /s的聚合带宽。在处理YCSB和Graph500等内存密集型工作负载时，与现有系统相比，TH-DPMS将性能提高了一个数量级，并且在工作负载大小增长到多个tb时保持一致的高效率。

{"title":"TH-DPMS","authors":"J. Shu, Youmin Chen, Qing Wang, Bohong Zhu, Junru Li, Youyou Lu","doi":"10.1145/3412852","DOIUrl":"https://doi.org/10.1145/3412852","url":null,"abstract":"The rapidly increasing data in recent years requires the datacenter infrastructure to store and process data with extremely high throughput and low latency. Fortunately, persistent memory (PM) and RDMA technologies bring new opportunities towards this goal. Both of them are capable of delivering more than 10 GB/s of bandwidth and sub-microsecond latency. However, our past experiences and recent studies show that it is non-trivial to build an efficient and distributed storage system with such new hardware. In this article, we design and implement TH-DPMS (TsingHua Distributed Persistent Memory System) based on persistent memory and RDMA, which unifies the memory, file system, and key-value interface in a single system. TH-DPMS is designed based on a unified distributed persistent memory abstract, pDSM. pDSM acts as a generic layer to connect the PMs of different storage nodes via high-speed RDMA network and organizes them into a global shared address space. It provides the fundamental functionalities, including global address management, space management, fault tolerance, and crash consistency guarantees. Applications are enabled to access pDSM with a group of flexible and easy-to-use APIs by using either raw read/write interfaces or the transactional ones with ACID guarantees. Based on pDSM, we implement a distributed file system and a key-value store named pDFS and pDKVS, respectively. Together, they uphold TH-DPMS with high-performance, low-latency, and fault-tolerant data storage. We evaluate TH-DPMS with both micro-benchmarks and real-world memory-intensive workloads. Experimental results show that TH-DPMS is capable of delivering an aggregated bandwidth of 120 GB/s with 6 nodes. When processing memory-intensive workloads such as YCSB and Graph500, TH-DPMS improves the performance by one order of magnitude compared to existing systems and keeps consistent high efficiency when the workload size grows to multiple terabytes.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129076445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7