Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong-Ye Xie, Ryan Stutsman, HaiNing Li, Huiqi Hu
Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Although modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This article presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include (1) a shared-everything architecture based on a two-layer log-structured merge-tree; (2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; and (3) find-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank, and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.
{"title":"SolarDB","authors":"Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong-Ye Xie, Ryan Stutsman, HaiNing Li, Huiqi Hu","doi":"10.1145/3318158","DOIUrl":"https://doi.org/10.1145/3318158","url":null,"abstract":"Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Although modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This article presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include (1) a shared-everything architecture based on a two-layer log-structured merge-tree; (2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; and (3) find-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank, and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132837467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-volatile memory (NVM) technologies as persistent memory are promising candidates to complement or replace DRAM for building future memory systems, due to having the advantages of high density, low power, and non-volatility. In main memory systems, hashing index structures are fundamental building blocks to provide fast query responses. However, hashing index structures originally designed for dynamic random access memory (DRAM) become inefficient for persistent memory due to new challenges including hardware limitations of NVM and the requirement of data consistency. To address these challenges, this article proposes level hashing, a write-optimized and high-performance hashing index scheme with low-overhead consistency guarantee and cost-efficient resizing. Level hashing provides a sharing-based two-level hash table, which achieves constant-scale worst-case time complexity for search, insertion, deletion, and update operations, and rarely incurs extra NVM writes. To guarantee the consistency with low overhead, level hashing leverages log-free consistency schemes for deletion, insertion, and resizing operations, and an opportunistic log-free scheme for update operation. To cost-efficiently resize this hash table, level hashing leverages an in-place resizing scheme that only needs to rehash 1/3 of buckets instead of the entire table to expand a hash table and rehash 2/3 of buckets to shrink a hash table, thus significantly improving the resizing performance and reducing the number of rehashed buckets. Extensive experimental results show that the level hashing speeds up insertions by 1.4×−3.0×, updates by 1.2×−2.1×, expanding by over 4.3×, and shrinking by over 1.4× while maintaining high search and deletion performance compared with start-of-the-art hashing schemes.
{"title":"Level Hashing","authors":"Pengfei Zuo, Yu Hua, Jie Wu","doi":"10.1145/3322096","DOIUrl":"https://doi.org/10.1145/3322096","url":null,"abstract":"Non-volatile memory (NVM) technologies as persistent memory are promising candidates to complement or replace DRAM for building future memory systems, due to having the advantages of high density, low power, and non-volatility. In main memory systems, hashing index structures are fundamental building blocks to provide fast query responses. However, hashing index structures originally designed for dynamic random access memory (DRAM) become inefficient for persistent memory due to new challenges including hardware limitations of NVM and the requirement of data consistency. To address these challenges, this article proposes level hashing, a write-optimized and high-performance hashing index scheme with low-overhead consistency guarantee and cost-efficient resizing. Level hashing provides a sharing-based two-level hash table, which achieves constant-scale worst-case time complexity for search, insertion, deletion, and update operations, and rarely incurs extra NVM writes. To guarantee the consistency with low overhead, level hashing leverages log-free consistency schemes for deletion, insertion, and resizing operations, and an opportunistic log-free scheme for update operation. To cost-efficiently resize this hash table, level hashing leverages an in-place resizing scheme that only needs to rehash 1/3 of buckets instead of the entire table to expand a hash table and rehash 2/3 of buckets to shrink a hash table, thus significantly improving the resizing performance and reducing the number of rehashed buckets. Extensive experimental results show that the level hashing speeds up insertions by 1.4×−3.0×, updates by 1.2×−2.1×, expanding by over 4.3×, and shrinking by over 1.4× while maintaining high search and deletion performance compared with start-of-the-art hashing schemes.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130498617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special section of the ACM Transactions on Storage presents two articles from the 13th USENIX Symposium on Operating System Design and Implementation (OSDI’18). OSDI’18 contained 47 exceptionally strong articles across a range of topics: file and storage systems, networking, scheduling, security, formal verification, graph processing, machine learning, programming languages, fault-tolerance and reliability, debugging, and, of course, operating systems design and implementation. The two high-quality articles we have selected for TOS focus, not surprisingly, on file and storage systems. The first article is “Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory” by Pengfei Zuo, Yu Hua, and Jie Wu. This article introduces an elegant hashing data structure for non-volatile memory (NVM), called level hashing. Level hashing optimizes for NVM with low-overhead consistency mechanisms and by reducing the number of write operations. Level hashing has particularly interesting algorithms for performing in-place resizing of the hash table. The second article is “Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing” by Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, and Vijay Chidambaram. CrashMonkey and Ace are a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey tests a target file system by simulating power-loss crashes and then checks if the file system recovers to a correct state. Ace automatically generates interesting workloads to stress the target file system; Ace is particularly innovative in how it explores the infinite space of possible workloads. With these new tools, the authors have found many difficult crash-consistency bugs, including 10 previously unknown bugs in widely used, mature Linux file systems and one in FSCQ, a verified file system. We hope you will find these articles interesting and inspiring!
{"title":"Introduction to the Special Section on OSDI’18","authors":"A. Arpaci-Dusseau, G. Voelker","doi":"10.1145/3322101","DOIUrl":"https://doi.org/10.1145/3322101","url":null,"abstract":"This special section of the ACM Transactions on Storage presents two articles from the 13th USENIX Symposium on Operating System Design and Implementation (OSDI’18). OSDI’18 contained 47 exceptionally strong articles across a range of topics: file and storage systems, networking, scheduling, security, formal verification, graph processing, machine learning, programming languages, fault-tolerance and reliability, debugging, and, of course, operating systems design and implementation. The two high-quality articles we have selected for TOS focus, not surprisingly, on file and storage systems. The first article is “Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory” by Pengfei Zuo, Yu Hua, and Jie Wu. This article introduces an elegant hashing data structure for non-volatile memory (NVM), called level hashing. Level hashing optimizes for NVM with low-overhead consistency mechanisms and by reducing the number of write operations. Level hashing has particularly interesting algorithms for performing in-place resizing of the hash table. The second article is “Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing” by Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, and Vijay Chidambaram. CrashMonkey and Ace are a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey tests a target file system by simulating power-loss crashes and then checks if the file system recovers to a correct state. Ace automatically generates interesting workloads to stress the target file system; Ace is particularly innovative in how it explores the infinite space of possible workloads. With these new tools, the authors have found many difficult crash-consistency bugs, including 10 previously unknown bugs in widely used, mature Linux file systems and one in FSCQ, a verified file system. We hope you will find these articles interesting and inspiring!","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134072306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Synchronous I/O has long been a design challenge in file systems. Although open-channel solid state drives (SSDs) provide better performance and endurance to file systems, they still suffer from synchronous I/Os due to the amplified writes and worse hot/cold data grouping. The reason lies in the controversy design choices between flash write and read/erase operations. While fine-grained logging improves performance and endurance in writes, it hurts indexing and data grouping efficiency in read and erase operations. In this article, we propose a flash-friendly data layout by introducing a built-in persistent staging layer to provide balanced read, write, and garbage collection performance. Based on this, we design a new flash file system (FS) named StageFS, which decouples the content and structure updates. Content updates are logically logged to the staging layer in a persistence-efficient way, which achieves better write performance and lower write amplification. The updated contents are reorganized into the normal data area for structure updates, with improved hot/cold grouping and in a page-level indexing way, which is more friendly to read and garbage collection operations. Evaluation results show that, compared to recent flash-friendly file system (F2FS), StageFS effectively improves performance by up to 211.4% and achieves low garbage collection overhead for workloads with frequent synchronization.
{"title":"Mitigating Synchronous I/O Overhead in File Systems on Open-Channel SSDs","authors":"Youyou Lu, J. Shu, Jiacheng Zhang","doi":"10.1145/3319369","DOIUrl":"https://doi.org/10.1145/3319369","url":null,"abstract":"Synchronous I/O has long been a design challenge in file systems. Although open-channel solid state drives (SSDs) provide better performance and endurance to file systems, they still suffer from synchronous I/Os due to the amplified writes and worse hot/cold data grouping. The reason lies in the controversy design choices between flash write and read/erase operations. While fine-grained logging improves performance and endurance in writes, it hurts indexing and data grouping efficiency in read and erase operations. In this article, we propose a flash-friendly data layout by introducing a built-in persistent staging layer to provide balanced read, write, and garbage collection performance. Based on this, we design a new flash file system (FS) named StageFS, which decouples the content and structure updates. Content updates are logically logged to the staging layer in a persistence-efficient way, which achieves better write performance and lower write amplification. The updated contents are reorganized into the normal data area for structure updates, with improved hot/cold grouping and in a page-level indexing way, which is more friendly to read and garbage collection operations. Evaluation results show that, compared to recent flash-friendly file system (F2FS), StageFS effectively improves performance by up to 211.4% and achieves low garbage collection overhead for workloads with frequent synchronization.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128622446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yin Li, Xubin Chen, Ning Zheng, Jingpeng Hao, T. Zhang
This article presents a design framework aiming to reduce mass data storage cost in data centers. Its underlying principle is simple: Assume one may noticeably reduce the HDD manufacturing cost by significantly (i.e., at least several orders of magnitude) relaxing raw HDD reliability, which ensures the eventual data storage integrity via low-cost system-level redundancy. This is called system-assisted HDD bit cost reduction. To better utilize both capacity and random IOPS of HDDs, it is desirable to mix data with complementary requirements on capacity and random IOPS in each HDD. Nevertheless, different capacity and random IOPS requirements may demand different raw HDD reliability vs. bit cost trade-offs and hence different forms of system-assisted bit cost reduction. This article presents a software-centric design framework to realize data-adaptive system-assisted bit cost reduction for data center HDDs. Implementation is solely handled by the filesystem and demands only minor change of the error correction coding (ECC) module inside HDDs. Hence, it is completely transparent to all the other components in the software stack (e.g., applications, OS kernel, and drivers) and keeps fundamental HDD design practice (e.g., firmware, media, head, and servo) intact. We carried out analysis and experiments to evaluate its implementation feasibility and effectiveness. We integrated the design techniques into ext4 to further quantitatively measure its impact on system speed performance.
{"title":"An Exploratory Study on Software-Defined Data Center Hard Disk Drives","authors":"Yin Li, Xubin Chen, Ning Zheng, Jingpeng Hao, T. Zhang","doi":"10.1145/3319405","DOIUrl":"https://doi.org/10.1145/3319405","url":null,"abstract":"This article presents a design framework aiming to reduce mass data storage cost in data centers. Its underlying principle is simple: Assume one may noticeably reduce the HDD manufacturing cost by significantly (i.e., at least several orders of magnitude) relaxing raw HDD reliability, which ensures the eventual data storage integrity via low-cost system-level redundancy. This is called system-assisted HDD bit cost reduction. To better utilize both capacity and random IOPS of HDDs, it is desirable to mix data with complementary requirements on capacity and random IOPS in each HDD. Nevertheless, different capacity and random IOPS requirements may demand different raw HDD reliability vs. bit cost trade-offs and hence different forms of system-assisted bit cost reduction. This article presents a software-centric design framework to realize data-adaptive system-assisted bit cost reduction for data center HDDs. Implementation is solely handled by the filesystem and demands only minor change of the error correction coding (ECC) module inside HDDs. Hence, it is completely transparent to all the other components in the software stack (e.g., applications, OS kernel, and drivers) and keeps fundamental HDD design practice (e.g., firmware, media, head, and servo) intact. We carried out analysis and experiments to evaluate its implementation feasibility and effectiveness. We integrated the design techniques into ext4 to further quantitatively measure its impact on system speed performance.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127447699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special section of the ACM Transactions on Storage presents some of the highlights of the 2018 USENIX Annual Technical Conference (ATC’18). Over the years, USENIX ATC has evolved into a community of researchers and practitioners working on a diverse and expanding set of research topics; the conference represents some of the latest and best work being done, and this year was no different. ATC’18 received a record number of 377 submissions. Of these, we selected three high-quality storage-related articles for publication in this special section of ACM Transactions on Storage. The first article, which was also selected as one of the best papers in ATC’18, is “TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions” by Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. Building transactional systems is complex and error-prone. The authors of this article introduce a novel approach to build a transactional file system by taking advantage of the mature, well-tested filesystem journal feature. Compared to earlier transactional file systems, it is easy to develop and use. It also demonstrates performance boosts for a number of different workloads. The second article is “CGraph: A Distributed Storage and Processing System for Concurrent Iterative Graph Analysis Jobs” by Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, and Ligang He. Nowadays, distributed graph processing platform, which handles massive Concurrent iterative Graph Processing (CGP) jobs, is widely used. However, existing distributed systems face a high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. The authors observed that this phenomenon happened because these CGP jobs need to repeatedly traverse the shared graph structure. They then propose exploiting the observed spatial and temporal correlations between the data accesses of these jobs to enable multiple concurrent iterative graph processing jobs. Hence, it efficiently shares the graph data and provides higher throughput. The final article is “SolarDB: Towards a Shared-Everything Database on Distributed Log-Structured Storage” by Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, and Huiqi Hu. Supporting transactions in distributed database with outstanding performance has been a complex issue to tackle. In this work, the authors describe how to build a shared-everything distributed relational database that achieves dramatically faster, fine-grained, and non-blocking transactions, based on log-structured trees and new control mechanisms.
{"title":"Introduction to the Special Section on the 2018 USENIX Annual Technical Conference (ATC’18)","authors":"Haryadi S. Gunawi, B. Reed","doi":"10.1145/3322100","DOIUrl":"https://doi.org/10.1145/3322100","url":null,"abstract":"This special section of the ACM Transactions on Storage presents some of the highlights of the 2018 USENIX Annual Technical Conference (ATC’18). Over the years, USENIX ATC has evolved into a community of researchers and practitioners working on a diverse and expanding set of research topics; the conference represents some of the latest and best work being done, and this year was no different. ATC’18 received a record number of 377 submissions. Of these, we selected three high-quality storage-related articles for publication in this special section of ACM Transactions on Storage. The first article, which was also selected as one of the best papers in ATC’18, is “TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions” by Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. Building transactional systems is complex and error-prone. The authors of this article introduce a novel approach to build a transactional file system by taking advantage of the mature, well-tested filesystem journal feature. Compared to earlier transactional file systems, it is easy to develop and use. It also demonstrates performance boosts for a number of different workloads. The second article is “CGraph: A Distributed Storage and Processing System for Concurrent Iterative Graph Analysis Jobs” by Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, and Ligang He. Nowadays, distributed graph processing platform, which handles massive Concurrent iterative Graph Processing (CGP) jobs, is widely used. However, existing distributed systems face a high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. The authors observed that this phenomenon happened because these CGP jobs need to repeatedly traverse the shared graph structure. They then propose exploiting the observed spatial and temporal correlations between the data accesses of these jobs to enable multiple concurrent iterative graph processing jobs. Hence, it efficiently shares the graph data and provides higher throughput. The final article is “SolarDB: Towards a Shared-Everything Database on Distributed Log-Structured Storage” by Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, and Huiqi Hu. Supporting transactions in distributed database with outstanding performance has been a complex issue to tackle. In this work, the authors describe how to build a shared-everything distributed relational database that achieves dramatically faster, fine-grained, and non-blocking transactions, based on log-structured trees and new control mechanisms.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"60 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114003710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, T. Cheng, Vijay Chidambaram, E. Witchel
We introduce TxFS, a transactional file system that builds upon a file system’s atomic-update mechanism such as journaling. Though prior work has explored a number of transactional file systems, TxFS has a unique set of properties: a simple API, portability across different hardware, high performance, low complexity (by building on the file-system journal), and full ACID transactions. We port SQLite, OpenLDAP, and Git to use TxFS and experimentally show that TxFS provides strong crash consistency while providing equal or better performance.
{"title":"TxFS","authors":"Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, T. Cheng, Vijay Chidambaram, E. Witchel","doi":"10.1145/3318159","DOIUrl":"https://doi.org/10.1145/3318159","url":null,"abstract":"We introduce TxFS, a transactional file system that builds upon a file system’s atomic-update mechanism such as journaling. Though prior work has explored a number of transactional file systems, TxFS has a unique set of properties: a simple API, portability across different hardware, high performance, low complexity (by building on the file-system journal), and full ACID transactions. We port SQLite, OpenLDAP, and Git to use TxFS and experimentally show that TxFS provides strong crash consistency while providing equal or better performance.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121652287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bharath Kumar Reddy Vangoor, Prafful Agarwal, Manu Mathew, Arun Ramachandran, Swaminathan Sivaraman, Vasily Tarasov, E. Zadok
Traditionally, file systems were implemented as part of operating systems kernels, which provide a limited set of tools and facilities to a programmer. As the complexity of file systems grew, many new file systems began being developed in user space. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. As a result, the topic of user-space file systems remains rather controversial: while some consider user-space file systems a “toy” not to be used in production, others develop full-fledged production file systems in user space. In this article, we analyze the design and implementation of a well-known user-space file system framework, FUSE, for Linux. We characterize its performance and resource utilization for a wide range of workloads. We present FUSE performance and also resource utilization with various mount and configuration options, using 45 different workloads that were generated using Filebench on two different hardware configurations. We instrumented FUSE to extract useful statistics and traces, which helped us analyze its performance bottlenecks and present our analysis results. Our experiments indicate that depending on the workload and hardware used, performance degradation (throughput) caused by FUSE can be completely imperceptible or as high as −83%, even when optimized; and latencies of FUSE file system operations can be increased from none to 4× when compared to Ext4. On the resource utilization side, FUSE can increase relative CPU utilization by up to 31% and underutilize disk bandwidth by as much as −80% compared to Ext4, though for many data-intensive workloads the impact was statistically indistinguishable. Our conclusion is that user-space file systems can indeed be used in production (non-“toy”) settings, but their applicability depends on the expected workloads.
{"title":"Performance and Resource Utilization of FUSE User-Space File Systems","authors":"Bharath Kumar Reddy Vangoor, Prafful Agarwal, Manu Mathew, Arun Ramachandran, Swaminathan Sivaraman, Vasily Tarasov, E. Zadok","doi":"10.1145/3310148","DOIUrl":"https://doi.org/10.1145/3310148","url":null,"abstract":"Traditionally, file systems were implemented as part of operating systems kernels, which provide a limited set of tools and facilities to a programmer. As the complexity of file systems grew, many new file systems began being developed in user space. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. As a result, the topic of user-space file systems remains rather controversial: while some consider user-space file systems a “toy” not to be used in production, others develop full-fledged production file systems in user space. In this article, we analyze the design and implementation of a well-known user-space file system framework, FUSE, for Linux. We characterize its performance and resource utilization for a wide range of workloads. We present FUSE performance and also resource utilization with various mount and configuration options, using 45 different workloads that were generated using Filebench on two different hardware configurations. We instrumented FUSE to extract useful statistics and traces, which helped us analyze its performance bottlenecks and present our analysis results. Our experiments indicate that depending on the workload and hardware used, performance degradation (throughput) caused by FUSE can be completely imperceptible or as high as −83%, even when optimized; and latencies of FUSE file system operations can be increased from none to 4× when compared to Ext4. On the resource utilization side, FUSE can increase relative CPU utilization by up to 31% and underutilize disk bandwidth by as much as −80% compared to Ext4, though for many data-intensive workloads the impact was statistically indistinguishable. Our conclusion is that user-space file systems can indeed be used in production (non-“toy”) settings, but their applicability depends on the expected workloads.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128665544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, P. Raju, Vijay Chidambaram
We present CrashMonkey and Ace, a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey is a record-and-replay framework which tests a given workload on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. Ace automatically generates all the workloads to be run on the target file system. We build CrashMonkey and Ace based on a new approach to test file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last 5 years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly created file system, and that all reported bugs result from crashes after fsync()-related system calls. CrashMonkey and Ace are able to find 24 out of the 26 crash-consistency bugs reported in the last 5 years. Our tools also revealed 10 new crash-consistency bugs in widely used, mature Linux file systems, 7 of which existed in the kernel since 2014. Additionally, our tools found a crash-consistency bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity, loss of persisted files and directories, and data loss.
{"title":"CrashMonkey and ACE","authors":"Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, P. Raju, Vijay Chidambaram","doi":"10.1145/3320275","DOIUrl":"https://doi.org/10.1145/3320275","url":null,"abstract":"We present CrashMonkey and Ace, a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey is a record-and-replay framework which tests a given workload on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. Ace automatically generates all the workloads to be run on the target file system. We build CrashMonkey and Ace based on a new approach to test file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last 5 years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly created file system, and that all reported bugs result from crashes after fsync()-related system calls. CrashMonkey and Ace are able to find 24 out of the 26 crash-consistency bugs reported in the last 5 years. Our tools also revealed 10 new crash-consistency bugs in widely used, mature Linux file systems, 7 of which existed in the kernel since 2014. Additionally, our tools found a crash-consistency bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity, loss of persisted files and directories, and data loss.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127747867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ligang He
Distributed graph processing platforms usually need to handle massive Concurrent iterative Graph Processing (CGP) jobs for different purposes. However, existing distributed systems face high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. We observed that there are strong spatial and temporal correlations among the data accesses issued by different CGP jobs, because these concurrently running jobs usually need to repeatedly traverse the shared graph structure for the iterative processing of each vertex. Based on this observation, this article proposes a distributed storage and processing system CGraph for the CGP jobs to efficiently handle the underlying static/evolving graph for high throughput. It uses a data-centric load-trigger-pushing model, together with several optimizations, to enable the CGP jobs to efficiently share the graph structure data in the cache/memory and their accesses by fully exploiting such correlations, where the graph structure data is decoupled from the vertex state associated with each job. It can deliver much higher throughput for the CGP jobs by effectively reducing their average ratio of data access cost to computation. Experimental results show that CGraph improves the throughput of the CGP jobs by up to 3.47× in comparison with existing solutions on distributed platforms.
{"title":"CGraph","authors":"Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ligang He","doi":"10.1145/3319406","DOIUrl":"https://doi.org/10.1145/3319406","url":null,"abstract":"Distributed graph processing platforms usually need to handle massive Concurrent iterative Graph Processing (CGP) jobs for different purposes. However, existing distributed systems face high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. We observed that there are strong spatial and temporal correlations among the data accesses issued by different CGP jobs, because these concurrently running jobs usually need to repeatedly traverse the shared graph structure for the iterative processing of each vertex. Based on this observation, this article proposes a distributed storage and processing system CGraph for the CGP jobs to efficiently handle the underlying static/evolving graph for high throughput. It uses a data-centric load-trigger-pushing model, together with several optimizations, to enable the CGP jobs to efficiently share the graph structure data in the cache/memory and their accesses by fully exploiting such correlations, where the graph structure data is decoupled from the vertex state associated with each job. It can deliver much higher throughput for the CGP jobs by effectively reducing their average ratio of data access cost to computation. Experimental results show that CGraph improves the throughput of the CGP jobs by up to 3.47× in comparison with existing solutions on distributed platforms.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131245406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}