Jin-yong Choi, E. Nam, Yoon-Jae Seong, Jinhyuk Yoon, Sookwan Lee, Hongseok Kim, Jeongsu Park, Yeong-Jae Woo, Sheayun Lee, S. Min
We present a framework called Hierarchically Interacting Logs (HIL) for constructing Flash Translation Layers (FTLs). The main goal of the HIL framework is to heal the Achilles heel —the crash recovery—of FTLs (hence, its name). Nonetheless, the framework itself is general enough to encompass not only block-mapped and page-mapped FTLs but also many of their variants, including hybrid ones, because of its compositional nature. Crash recovery within the HIL framework proceeds in two phases: structural recovery and functional recovery. During the structural recovery, residual effects due to program operations ongoing at the time of the crash are eliminated in an atomic manner using shadow paging. During the functional recovery, operations that would have been performed if there had been no crash are replayed in a redo-only fashion. Both phases operate in an idempotent manner, preventing repeated crashes during recovery from causing any additional problems. We demonstrate the practicality of the proposed HIL framework by implementing a prototype and showing that its performance during normal execution and also during crash recovery is at least as good as those of state-of-the-art SSDs.
{"title":"HIL","authors":"Jin-yong Choi, E. Nam, Yoon-Jae Seong, Jinhyuk Yoon, Sookwan Lee, Hongseok Kim, Jeongsu Park, Yeong-Jae Woo, Sheayun Lee, S. Min","doi":"10.1145/3281030","DOIUrl":"https://doi.org/10.1145/3281030","url":null,"abstract":"We present a framework called Hierarchically Interacting Logs (HIL) for constructing Flash Translation Layers (FTLs). The main goal of the HIL framework is to heal the Achilles heel —the crash recovery—of FTLs (hence, its name). Nonetheless, the framework itself is general enough to encompass not only block-mapped and page-mapped FTLs but also many of their variants, including hybrid ones, because of its compositional nature. Crash recovery within the HIL framework proceeds in two phases: structural recovery and functional recovery. During the structural recovery, residual effects due to program operations ongoing at the time of the crash are eliminated in an atomic manner using shadow paging. During the functional recovery, operations that would have been performed if there had been no crash are replayed in a redo-only fashion. Both phases operate in an idempotent manner, preventing repeated crashes during recovery from causing any additional problems. We demonstrate the practicality of the proposed HIL framework by implementing a prototype and showing that its performance during normal execution and also during crash recovery is at least as good as those of state-of-the-art SSDs.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128584964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Om Rameshwar Gatla, Muhammad Hameed, Mai Zheng, Viacheslav Dubeyko, A. Manzanares, F. Blagojevic, Cyril Guyot, R. Mateescu
File systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to complete the repair without interruption, which may not be true in practice. In this work, we demonstrate via fault injection experiments that checkers of widely used file systems (EXT4, XFS, BtrFS, and F2FS) may leave the file system in an uncorrectable state if the repair procedure is interrupted unexpectedly. To address the problem, we first fix the ordering issue in the undo logging of e2fsck and then build a general logging library (i.e., rfsck-lib) for strengthening checkers. To demonstrate the practicality, we integrate rfsck-lib with existing checkers and create two new checkers: rfsck-ext, a robust checker for Ext-family file systems, and rfsck-xfs, a robust checker for XFS file systems, both of which require only tens of lines of modification to the original versions. Both rfsck-ext and rfsck-xfs are resilient to faults in our experiments. Also, both checkers incur reasonable performance overhead (i.e., up to 12%) compared to the original unreliable versions. Moreover, rfsck-ext outperforms the patched e2fsck by up to nine times while achieving the same level of robustness.
{"title":"Towards Robust File System Checkers","authors":"Om Rameshwar Gatla, Muhammad Hameed, Mai Zheng, Viacheslav Dubeyko, A. Manzanares, F. Blagojevic, Cyril Guyot, R. Mateescu","doi":"10.1145/3281031","DOIUrl":"https://doi.org/10.1145/3281031","url":null,"abstract":"File systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to complete the repair without interruption, which may not be true in practice. In this work, we demonstrate via fault injection experiments that checkers of widely used file systems (EXT4, XFS, BtrFS, and F2FS) may leave the file system in an uncorrectable state if the repair procedure is interrupted unexpectedly. To address the problem, we first fix the ordering issue in the undo logging of e2fsck and then build a general logging library (i.e., rfsck-lib) for strengthening checkers. To demonstrate the practicality, we integrate rfsck-lib with existing checkers and create two new checkers: rfsck-ext, a robust checker for Ext-family file systems, and rfsck-xfs, a robust checker for XFS file systems, both of which require only tens of lines of modification to the original versions. Both rfsck-ext and rfsck-xfs are resilient to faults in our experiments. Also, both checkers incur reasonable performance overhead (i.e., up to 12%) compared to the original unreliable versions. Moreover, rfsck-ext outperforms the patched e2fsck by up to nine times while achieving the same level of robustness.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"530 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124148803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Pletka, Ioannis Koltsidas, Nikolas Ioannou, Sasa Tomic, N. Papandreou, Thomas Parnell, H. Pozidis, Aaron Fry, T. Fisher
Despite its widespread use in consumer devices and enterprise storage systems, NAND flash faces a growing number of challenges. While technology advances have helped to increase the storage density and reduce costs, they have also led to reduced endurance and larger block variations, which cannot be compensated solely by stronger ECC or read-retry schemes but have to be addressed holistically. Our goal is to enable low-cost NAND flash in enterprise storage for cost efficiency. We present novel flash-management approaches that reduce write amplification, achieve better wear leveling, and enhance endurance without sacrificing performance. We introduce block calibration, a technique to determine optimal read-threshold voltage levels that minimize error rates, and novel garbage-collection as well as data-placement schemes that alleviate the effects of block health variability and show how these techniques complement one another and thereby achieve enterprise storage requirements. By combining the proposed schemes, we improve endurance by up to 15× compared to the baseline endurance of NAND flash without using a stronger ECC scheme. The flash-management algorithms presented herein were designed and implemented in simulators, hardware test platforms, and eventually in the flash controllers of production enterprise all-flash arrays. Their effectiveness has been validated across thousands of customer deployments since 2015.
{"title":"Management of Next-Generation NAND Flash to Achieve Enterprise-Level Endurance and Latency Targets","authors":"R. Pletka, Ioannis Koltsidas, Nikolas Ioannou, Sasa Tomic, N. Papandreou, Thomas Parnell, H. Pozidis, Aaron Fry, T. Fisher","doi":"10.1145/3241060","DOIUrl":"https://doi.org/10.1145/3241060","url":null,"abstract":"Despite its widespread use in consumer devices and enterprise storage systems, NAND flash faces a growing number of challenges. While technology advances have helped to increase the storage density and reduce costs, they have also led to reduced endurance and larger block variations, which cannot be compensated solely by stronger ECC or read-retry schemes but have to be addressed holistically. Our goal is to enable low-cost NAND flash in enterprise storage for cost efficiency. We present novel flash-management approaches that reduce write amplification, achieve better wear leveling, and enhance endurance without sacrificing performance. We introduce block calibration, a technique to determine optimal read-threshold voltage levels that minimize error rates, and novel garbage-collection as well as data-placement schemes that alleviate the effects of block health variability and show how these techniques complement one another and thereby achieve enterprise storage requirements. By combining the proposed schemes, we improve endurance by up to 15× compared to the baseline endurance of NAND flash without using a stronger ECC scheme. The flash-management algorithms presented herein were designed and implemented in simulators, hardware test platforms, and eventually in the flash controllers of production enterprise all-flash arrays. Their effectiveness has been validated across thousands of customer deployments since 2015.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115700192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent Memory devices present properties that are uniquely different from prior technologies for which applications have been built. Unfortunately, the conventional approach to building applications fail to either efficiently utilize these new devices or provide programmers a seamless development experience. We have built LibPM, a Persistent Memory Library that implements an easy-to-use container abstraction for consuming PM. LibPM’s containers are data hosting units that can store arbitrarily complex data types while preserving their integrity and consistency. Consequently, LibPM’s containers provide a generic interface to applications, allowing applications to store and manipulate arbitrarily structured data with strong durability and consistency properties, all without having to navigate all the myriad pitfalls of programming PM directly. By providing a simple and high-performing transactional update mechanism, LibPM allows applications to manipulate persistent data at the speed of memory. The container abstraction and automatic persistent data discovery mechanisms within LibPM also simplify porting legacy applications to PM. From a performance perspective, LibPM closely matches and often exceeds the performance of state-of-the-art application libraries for PM. For instance, LibPM ’s performance is 195× better for write intensive workloads and 2.6× better for read intensive workloads when compared with the state-of-the-art Pmem.IO persistent memory library.
{"title":"LibPM","authors":"L. Mármol, M. Chowdhury, R. Rangaswami","doi":"10.1145/3278141","DOIUrl":"https://doi.org/10.1145/3278141","url":null,"abstract":"Persistent Memory devices present properties that are uniquely different from prior technologies for which applications have been built. Unfortunately, the conventional approach to building applications fail to either efficiently utilize these new devices or provide programmers a seamless development experience. We have built LibPM, a Persistent Memory Library that implements an easy-to-use container abstraction for consuming PM. LibPM’s containers are data hosting units that can store arbitrarily complex data types while preserving their integrity and consistency. Consequently, LibPM’s containers provide a generic interface to applications, allowing applications to store and manipulate arbitrarily structured data with strong durability and consistency properties, all without having to navigate all the myriad pitfalls of programming PM directly. By providing a simple and high-performing transactional update mechanism, LibPM allows applications to manipulate persistent data at the speed of memory. The container abstraction and automatic persistent data discovery mechanisms within LibPM also simplify porting legacy applications to PM. From a performance perspective, LibPM closely matches and often exceeds the performance of state-of-the-art application libraries for PM. For instance, LibPM ’s performance is 195× better for write intensive workloads and 2.6× better for read intensive workloads when compared with the state-of-the-art Pmem.IO persistent memory library.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"87 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126295612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhan, Yizheng Jiao, Donald E. Porter, Alex Conway, Eric Knorr, Martín Farach-Colton, M. A. Bender, Jun Yuan, William Jannen, Rob Johnson
Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local file system with full-path indexing are prohibitively expensive. This article shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. The article introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value Application Programming Interface (API) and simplifies the overall file system design. We implemented this mechanism in B&egr;-trees File System (BetrFS), an in-kernel, local file system for Linux. This new version, BetrFS 0.4, performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS 0.3, but renames are competitive with indirection-based file systems for a range of sizes. BetrFS 0.4 outperforms BetrFS 0.3, as well as traditional file systems, such as ext4, Extents File System (XFS), and Z File System (ZFS), across a variety of workloads.
对于使用传统的分层目录组织数据的工作负载,全路径索引可以提高I/O效率,因为数据是按扫描顺序放在持久存储上的。但是,先前的结果表明,使用全路径索引的本地文件系统中的重命名代价非常高。本文展示了如何在文件系统中使用全路径索引来实现快速的目录扫描、写入和重命名。本文介绍了一种范围重命名机制,用于在写优化字典中有效地更改键空间。这种机制被封装在键值应用程序编程接口(API)中,简化了整个文件系统的设计。我们在B&egr; trees文件系统(BetrFS)中实现了这种机制,这是Linux的内核内本地文件系统。这个新版本,即BetrFS 0.4,执行递归greps的速度比BetrFS 0.3快1.5倍,随机写速度比BetrFS 0.3快1.2倍,但是对于各种大小的文件系统,重命名与基于间接的文件系统竞争。在各种工作负载中,BetrFS 0.4的性能优于BetrFS 0.3,也优于传统的文件系统,如ext4、Extents file System (XFS)和Z file System (ZFS)。
{"title":"Efficient Directory Mutations in a Full-Path-Indexed File System","authors":"Yang Zhan, Yizheng Jiao, Donald E. Porter, Alex Conway, Eric Knorr, Martín Farach-Colton, M. A. Bender, Jun Yuan, William Jannen, Rob Johnson","doi":"10.1145/3241061","DOIUrl":"https://doi.org/10.1145/3241061","url":null,"abstract":"Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local file system with full-path indexing are prohibitively expensive. This article shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. The article introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value Application Programming Interface (API) and simplifies the overall file system design. We implemented this mechanism in B&egr;-trees File System (BetrFS), an in-kernel, local file system for Linux. This new version, BetrFS 0.4, performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS 0.3, but renames are competitive with indirection-based file systems for a range of sizes. BetrFS 0.4 outperforms BetrFS 0.3, as well as traditional file systems, such as ext4, Extents File System (XFS), and Z File System (ZFS), across a variety of workloads.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123042450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huizhang Luo, Qing Liu, J. Hu, Qiao Li, Liang Shi, Qingfeng Zhuge, E. Sha
The emerging Phase Change Memory (PCM) is considered to be a promising candidate to replace DRAM as the next generation main memory due to its higher scalability and lower leakage power. However, the high write power consumption has become a major challenge in adopting PCM as main memory. In addition to the fact that writing to PCM cells requires high write current and voltage, current loss in the charge pumps also contributes a large percentage of high power consumption. The pumping efficiency of a PCM chip is a concave function of the write current. Leveraging the characteristics of the concave function, the overall pumping efficiency can be improved if the write current is uniform. In this article, we propose a peak-to-average (PTA) write scheme, which smooths the write current fluctuation by regrouping write units. In particular, we calculate the current requirements for each write unit by their values when they are evicted from the last level cache (LLC). When the write units are waiting in the memory controller, we regroup the write units by LLC-assisted PTA to reach the current-uniform goal. Experimental results show that LLC-assisted PTA achieved 13.4% of overall energy saving compared to the baseline.
{"title":"Write Energy Reduction for PCM via Pumping Efficiency Improvement","authors":"Huizhang Luo, Qing Liu, J. Hu, Qiao Li, Liang Shi, Qingfeng Zhuge, E. Sha","doi":"10.1145/3200139","DOIUrl":"https://doi.org/10.1145/3200139","url":null,"abstract":"The emerging Phase Change Memory (PCM) is considered to be a promising candidate to replace DRAM as the next generation main memory due to its higher scalability and lower leakage power. However, the high write power consumption has become a major challenge in adopting PCM as main memory. In addition to the fact that writing to PCM cells requires high write current and voltage, current loss in the charge pumps also contributes a large percentage of high power consumption. The pumping efficiency of a PCM chip is a concave function of the write current. Leveraging the characteristics of the concave function, the overall pumping efficiency can be improved if the write current is uniform. In this article, we propose a peak-to-average (PTA) write scheme, which smooths the write current fluctuation by regrouping write units. In particular, we calculate the current requirements for each write unit by their values when they are evicted from the last level cache (LLC). When the write units are waiting in the memory controller, we regroup the write units by LLC-assisted PTA to reach the current-uniform goal. Experimental results show that LLC-assisted PTA achieved 13.4% of overall energy saving compared to the baseline.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131807670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent key-value (KV) stores mostly build on the Log-Structured Merge (LSM) tree for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification. KV separation mitigates I/O amplification by storing only keys in the LSM-tree and values in separate storage. However, the current KV separation design remains inefficient under update-intensive workloads due to its high garbage collection (GC) overhead in value storage. We propose HashKV, which aims for high update performance atop KV separation under update-intensive workloads. HashKV uses hash-based data grouping, which deterministically maps values to storage space to make both updates and GC efficient. We further relax the restriction of such deterministic mappings via simple but useful design extensions. We extensively evaluate various design aspects of HashKV. We show that HashKV achieves 4.6× update throughput and 53.4% less write traffic compared to the current KV separation design. In addition, we demonstrate that we can integrate the design of HashKV with state-of-the-art KV stores and improve their respective performance.
{"title":"Enabling Efficient Updates in KV Storage via Hashing","authors":"Yongkun Li, H. Chan, P. Lee, Yinlong Xu","doi":"10.1145/3340287","DOIUrl":"https://doi.org/10.1145/3340287","url":null,"abstract":"Persistent key-value (KV) stores mostly build on the Log-Structured Merge (LSM) tree for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification. KV separation mitigates I/O amplification by storing only keys in the LSM-tree and values in separate storage. However, the current KV separation design remains inefficient under update-intensive workloads due to its high garbage collection (GC) overhead in value storage. We propose HashKV, which aims for high update performance atop KV separation under update-intensive workloads. HashKV uses hash-based data grouping, which deterministically maps values to storage space to make both updates and GC efficient. We further relax the restriction of such deterministic mappings via simple but useful design extensions. We extensively evaluate various design aspects of HashKV. We show that HashKV achieves 4.6× update throughput and 53.4% less write traffic compared to the current KV separation design. In addition, we demonstrate that we can integrate the design of HashKV with state-of-the-art KV stores and improve their respective performance.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"19 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special issue of the ACM Transactions on Storage (TOS) presents some of the highlights of the 16th USENIX Conference on File and Storage Technologies (FAST’18). Over the years, FAST has evolved into a community of researchers and practitioners working on a diverse and expanding set of research topics; the conference represents some of the latest and best work being done, and this year was no different. FAST’18 received a record number of 139 submissions on topics ranging from non-volatile memory; distributed, cloud, and data center storage; and performance and scalability to experiences with deployed systems. Of these, we selected five high-quality articles for publication in this special issue of ACM TOS. The first article, which was also selected as one of the best papers at the conference, is “Protocol-Aware Recovery for Consensus-based Storage” by Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee, Aws Albarghouthi, Vijay Chidambaram, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. Distributed storage systems are in widespread use today. The authors demonstrate how storage faults can significantly affect recovery in distributed storage systems that are based on replicated state machines, including ones in widespread use today. They then propose corruption-tolerant replication as a solution that can ensure safe recovery. The second article is “Efficient Directory Mutations in a Full-Path Indexed File System” by Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. BetrFS is a file system that offers dramatically faster execution times for common modern-day file-system operations. In this significant update to the design of BetrFS, the authors tackle the last stronghold of performance challenges, rename, with a new “range-rename” mechanism. The third article is “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems” by Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. Mysterious storage faults are legends within the computer industry and increasingly more so as the scale of deployed systems grows rapidly; this article presents a lively discussion of one such class of faults, namely fail-slow, that has significant impact. The authors draw from a large-scale study based on significant documented and anecdotal evidence obtained from 101 reports of such incidents sourced from 12 different institutions. The fourth article, which was also selected as one of the best papers at the conference, is “Bringing Order to Chaos: Barrier-Enabled I/O Stack for Flash Storage” by Youjip Won, Joontaek Oh, Jaemin Jung, Gyeongyeol Choi, Seongbae Son, Jo
本期ACM存储交易(TOS)特刊介绍了第16届USENIX文件和存储技术会议(FAST’18)的一些亮点。多年来,FAST已经发展成为一个研究人员和实践者的社区,致力于多样化和不断扩大的研究主题;会议代表了一些最新和最好的工作,今年也不例外。FAST’18收到了创纪录的139份提交,主题从非易失性存储器;分布式、云、数据中心存储;以及部署系统的性能和可伸缩性。其中,我们选择了5篇高质量的文章发表在本期ACM TOS特刊上。第一篇文章是Ramnatthan Alagappan、Aishwarya Ganesan、Eric Lee、Aws Albarghouthi、Vijay Chidambaram、Andrea Arpaci-Dusseau和Remzi Arpaci-Dusseau撰写的“基于共识的存储的协议感知恢复”,也是会议上最好的论文之一。分布式存储系统在今天得到了广泛的应用。作者演示了存储故障如何显著影响基于复制状态机的分布式存储系统的恢复,包括目前广泛使用的存储系统。然后,他们提出了一种可以确保安全恢复的容错复制解决方案。第二篇文章是《全路径索引文件系统中的有效目录突变》,作者是詹杨、Alex Conway、焦奕正、Eric Knorr、Michael a . Bender、Martin Farach-Colton、William Jannen、Rob Johnson、Donald E. Porter和Yuan Jun。BetrFS是一种文件系统,它为常见的现代文件系统操作提供了更快的执行时间。在这个对BetrFS设计的重大更新中,作者使用一种新的“range-rename”机制解决了性能挑战的最后一个据点——重命名。第三篇文章是Haryadi S. Gunawi、Riza O. Suminto、Russell Sears、Casey Golliher、Swaminathan Sundararaman、林星、Tim Emami、沈伟光、Nematollah Bidokhti、Caitie McCaffrey、Gary Grider、Parks M. Fields、Kevin Harms、andrew Jacobson、Robert Ricci、Kirk Webb、Peter Alvaro、H. Birali Runesha、Mingzhe Hao和Huaicheng Li。神秘的存储故障是计算机行业的传奇,随着部署系统规模的迅速增长,这种传奇越来越多;本文生动地讨论了其中一类具有重大影响的故障,即慢速故障。作者从一项大规模研究中得出结论,该研究基于从12个不同机构获得的101份此类事件报告中获得的重要文献和轶事证据。被选为本次大会最佳论文的第4篇论文是《将秩序带入混乱:基于障碍的闪存I/O堆栈》,作者是元酉杰、吴俊泽、郑在民、崔景烈、孙成培、黄炯荣、赵尚渊。现代存储I/O堆栈极其复杂;造成这种复杂性的主要原因是分层和跨层的“阻抗不匹配”。本文的作者重新审视了这个久已成熟的领域,并做出了令人惊讶的原创贡献,它不仅功能强大,而且从根本上简单,能够安全地从高性能存储中提取最大的数据。
{"title":"Introduction to the Special Issue on USENIX FAST 2018","authors":"Nitin Agrawal, R. Rangaswami","doi":"10.1145/3242152","DOIUrl":"https://doi.org/10.1145/3242152","url":null,"abstract":"This special issue of the ACM Transactions on Storage (TOS) presents some of the highlights of the 16th USENIX Conference on File and Storage Technologies (FAST’18). Over the years, FAST has evolved into a community of researchers and practitioners working on a diverse and expanding set of research topics; the conference represents some of the latest and best work being done, and this year was no different. FAST’18 received a record number of 139 submissions on topics ranging from non-volatile memory; distributed, cloud, and data center storage; and performance and scalability to experiences with deployed systems. Of these, we selected five high-quality articles for publication in this special issue of ACM TOS. The first article, which was also selected as one of the best papers at the conference, is “Protocol-Aware Recovery for Consensus-based Storage” by Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee, Aws Albarghouthi, Vijay Chidambaram, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. Distributed storage systems are in widespread use today. The authors demonstrate how storage faults can significantly affect recovery in distributed storage systems that are based on replicated state machines, including ones in widespread use today. They then propose corruption-tolerant replication as a solution that can ensure safe recovery. The second article is “Efficient Directory Mutations in a Full-Path Indexed File System” by Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. BetrFS is a file system that offers dramatically faster execution times for common modern-day file-system operations. In this significant update to the design of BetrFS, the authors tackle the last stronghold of performance challenges, rename, with a new “range-rename” mechanism. The third article is “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems” by Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. Mysterious storage faults are legends within the computer industry and increasingly more so as the scale of deployed systems grows rapidly; this article presents a lively discussion of one such class of faults, namely fail-slow, that has significant impact. The authors draw from a large-scale study based on significant documented and anecdotal evidence obtained from 101 reports of such incidents sourced from 12 different institutions. The fourth article, which was also selected as one of the best papers at the conference, is “Bringing Order to Chaos: Barrier-Enabled I/O Stack for Flash Storage” by Youjip Won, Joontaek Oh, Jaemin Jung, Gyeongyeol Choi, Seongbae Son, Jo","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133162850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The combination of the explosive growth in digital data and the demand to preserve much of these data in the long term has made it imperative to find a more cost-effective way than HDD arrays and a more easily accessible way than tape libraries to store massive amounts of data. While modern optical discs are capable of guaranteeing more than 50-year data preservation without media replacement, individual optical discs’ lack of the performance and capacity relative to HDDs or tapes has significantly limited their use in datacenters. This article presents a Rack-scale Optical disc library System, or ROS in short, which provides a PB-level total capacity and inline accessibility on thousands of optical discs built within a 42U Rack. A rotatable roller and robotic arm separating and fetching discs are designed to improve disc placement density and simplify the mechanical structure. A hierarchical storage system based on SSDs, hard disks, and optical discs is proposed to effectively hide the delay of mechanical operation. However, an optical library file system (OLFS) based on FUSE is proposed to schedule mechanical operation and organize data on the tiered storage with a POSIX user interface to provide an illusion of inline data accessibility. We further optimize OLFS by reducing unnecessary user/kernel context switches inheriting from legacy FUSE framework. We evaluate ROS on a few key performance metrics, including operation delays of the mechanical structure and software overhead in a prototype PB-level ROS system. The results show that ROS stacked on Samba and FUSE as network-attached storage (NAS) mode almost saturates the throughput provided by underlying samba via 10GbE network for external users, as well as in this scenario provides about 53ms file write and 15ms read latency, exhibiting its inline accessibility. Besides, ROS is able to effectively hide and virtualize internal complex operational behaviors and be easily deployable in datacenters.
{"title":"ROS","authors":"Wenrui Yan, Jie Yao, Q. Cao, C. Xie, Hong Jiang","doi":"10.1145/3231599","DOIUrl":"https://doi.org/10.1145/3231599","url":null,"abstract":"The combination of the explosive growth in digital data and the demand to preserve much of these data in the long term has made it imperative to find a more cost-effective way than HDD arrays and a more easily accessible way than tape libraries to store massive amounts of data. While modern optical discs are capable of guaranteeing more than 50-year data preservation without media replacement, individual optical discs’ lack of the performance and capacity relative to HDDs or tapes has significantly limited their use in datacenters. This article presents a Rack-scale Optical disc library System, or ROS in short, which provides a PB-level total capacity and inline accessibility on thousands of optical discs built within a 42U Rack. A rotatable roller and robotic arm separating and fetching discs are designed to improve disc placement density and simplify the mechanical structure. A hierarchical storage system based on SSDs, hard disks, and optical discs is proposed to effectively hide the delay of mechanical operation. However, an optical library file system (OLFS) based on FUSE is proposed to schedule mechanical operation and organize data on the tiered storage with a POSIX user interface to provide an illusion of inline data accessibility. We further optimize OLFS by reducing unnecessary user/kernel context switches inheriting from legacy FUSE framework. We evaluate ROS on a few key performance metrics, including operation delays of the mechanical structure and software overhead in a prototype PB-level ROS system. The results show that ROS stacked on Samba and FUSE as network-attached storage (NAS) mode almost saturates the throughput provided by underlying samba via 10GbE network for external users, as well as in this scenario provides about 53ms file write and 15ms read latency, exhibiting its inline accessibility. Besides, ROS is able to effectively hide and virtualize internal complex operational behaviors and be easily deployable in datacenters.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122895605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Key-value caching is crucial to today’s low-latency Internet services. Conventional key-value cache systems, such as Memcached, heavily rely on expensive DRAM memory. To lower Total Cost of Ownership, the industry recently is moving toward more cost-efficient flash-based solutions, such as Facebook’s McDipper [14] and Twitter’s Fatcache [56]. These cache systems typically take commercial SSDs and adopt a Memcached-like scheme to store and manage key-value cache data in flash. Such a practice, though simple, is inefficient due to the huge semantic gap between the key-value cache manager and the underlying flash devices. In this article, we advocate to reconsider the cache system design and directly open device-level details of the underlying flash storage for key-value caching. We propose an enhanced flash-aware key-value cache manager, which consists of a novel unified address mapping module, an integrated garbage collection policy, a dynamic over-provisioning space management, and a customized wear-leveling policy, to directly drive the flash management. A thin intermediate library layer provides a slab-based abstraction of low-level flash memory space and an API interface for directly and easily operating flash devices. A special flash memory SSD hardware that exposes flash physical details is adopted to store key-value items. This co-design approach bridges the semantic gap and well connects the two layers together, which allows us to leverage both the domain knowledge of key-value caches and the unique device properties. In this way, we can maximize the efficiency of key-value caching on flash devices while minimizing its weakness. We implemented a prototype, called DIDACache, based on the Open-Channel SSD platform. Our experiments on real hardware show that we can significantly increase the throughput by 35.5%, reduce the latency by 23.6%, and remove unnecessary erase operations by 28%.
{"title":"DIDACache","authors":"Zhaoyan Shen, Feng Chen, Yichen Jia, Z. Shao","doi":"10.1145/3203410","DOIUrl":"https://doi.org/10.1145/3203410","url":null,"abstract":"Key-value caching is crucial to today’s low-latency Internet services. Conventional key-value cache systems, such as Memcached, heavily rely on expensive DRAM memory. To lower Total Cost of Ownership, the industry recently is moving toward more cost-efficient flash-based solutions, such as Facebook’s McDipper [14] and Twitter’s Fatcache [56]. These cache systems typically take commercial SSDs and adopt a Memcached-like scheme to store and manage key-value cache data in flash. Such a practice, though simple, is inefficient due to the huge semantic gap between the key-value cache manager and the underlying flash devices. In this article, we advocate to reconsider the cache system design and directly open device-level details of the underlying flash storage for key-value caching. We propose an enhanced flash-aware key-value cache manager, which consists of a novel unified address mapping module, an integrated garbage collection policy, a dynamic over-provisioning space management, and a customized wear-leveling policy, to directly drive the flash management. A thin intermediate library layer provides a slab-based abstraction of low-level flash memory space and an API interface for directly and easily operating flash devices. A special flash memory SSD hardware that exposes flash physical details is adopted to store key-value items. This co-design approach bridges the semantic gap and well connects the two layers together, which allows us to leverage both the domain knowledge of key-value caches and the unique device properties. In this way, we can maximize the efficiency of key-value caching on flash devices while minimizing its weakness. We implemented a prototype, called DIDACache, based on the Open-Channel SSD platform. Our experiments on real hardware show that we can significantly increase the throughput by 35.5%, reduce the latency by 23.6%, and remove unnecessary erase operations by 28%.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124218494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}