Bradley C. Kuszmaul, Matteo Frigo, Justin Mazzola Paluska, Alexander Sandler
Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key minitransactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018 and holds tens of thousands of customer file systems comprising many petabytes of data.
{"title":"Everyone Loves File","authors":"Bradley C. Kuszmaul, Matteo Frigo, Justin Mazzola Paluska, Alexander Sandler","doi":"10.1145/3377877","DOIUrl":"https://doi.org/10.1145/3377877","url":null,"abstract":"Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key minitransactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018 and holds tens of thousands of customer file systems comprising many petabytes of data.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123754454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As I start my second three-year term as Editor-in-Chief (EiC) of ACM TOS, I would like to take this opportunity to announce some shuffling of Associate Editors. Those leaving are (in alphabetical order) Nitin Agrawal, Sangyeon Cho, Cheng Huang, Onur Mutlu, Michael Swift, Nisha Talagala, Andy Wang, and Tong Zhang. I thank them for their devoted services the last three years. Without their sacrifice, it would have been impossible to run this journal. I am also appointing a new batch of Associate Editors. Namely (again, alphabetically), Yuan Hao Chang, Jooyoung Hwang, Geoff Kuenning, Philip Shilane, Devesh Tiwali, Swami Sundararaman, and Ming Zhao. As their short bios that follow shows, they are all respected experts in the field of storage. I am sure they will contribute immensely to the continued success of our journal.
{"title":"EIC Message","authors":"K. Wagner, Y. Zorian","doi":"10.1145/3372345","DOIUrl":"https://doi.org/10.1145/3372345","url":null,"abstract":"As I start my second three-year term as Editor-in-Chief (EiC) of ACM TOS, I would like to take this opportunity to announce some shuffling of Associate Editors. Those leaving are (in alphabetical order) Nitin Agrawal, Sangyeon Cho, Cheng Huang, Onur Mutlu, Michael Swift, Nisha Talagala, Andy Wang, and Tong Zhang. I thank them for their devoted services the last three years. Without their sacrifice, it would have been impossible to run this journal. I am also appointing a new batch of Associate Editors. Namely (again, alphabetically), Yuan Hao Chang, Jooyoung Hwang, Geoff Kuenning, Philip Shilane, Devesh Tiwali, Swami Sundararaman, and Ming Zhao. As their short bios that follow shows, they are all respected experts in the field of storage. I am sure they will contribute immensely to the continued success of our journal.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"47 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122410700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia
We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.
{"title":"INSTalytics","authors":"Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia","doi":"10.1145/3369738","DOIUrl":"https://doi.org/10.1145/3369738","url":null,"abstract":"We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121822298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Kesavan, Matthew Curtis-Maury, V. Devadas, K. Mishra
As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. Similarly, wasted space can also occur when a file system writes a collection of blocks out to object storage as a single large object, because the constituent blocks can become free at different times. The impact of fragmentation also depends on the underlying storage media. This article studies each form of fragmentation in the NetApp® WAFL®file system, and explains how the file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The article analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.
{"title":"Countering Fragmentation in an Enterprise Storage System","authors":"R. Kesavan, Matthew Curtis-Maury, V. Devadas, K. Mishra","doi":"10.1145/3366173","DOIUrl":"https://doi.org/10.1145/3366173","url":null,"abstract":"As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. Similarly, wasted space can also occur when a file system writes a collection of blocks out to object storage as a single large object, because the constituent blocks can become free at different times. The impact of fragmentation also depends on the underlying storage media. This article studies each form of fragmentation in the NetApp® WAFL®file system, and explains how the file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The article analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121958323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is a growing need to perform a diverse set of real-time analytics (batch and stream analytics) on evolving graphs to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations; rather, they excel in one area that keeps a private data store in a specialized way to favor their operations only. To address this challenge, we have designed and developed GraphOne, a graph data store that abstracts the graph data store away from the specialized systems to solve the fundamental research problems associated with the data store design. It combines two complementary graph storage formats (edge list and adjacency list) and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions (called data visibility) for concurrent execution of diverse classes of real-time graph analytics with only a small data duplication. Experimental results show that GraphOne is able to deliver 11.40× and 5.36× average speedup in ingestion rate against LLAMA and Stinger, the two state-of-the-art dynamic graph systems, respectively. Further, they achieve an average speedup of 8.75× and 4.14× against LLAMA and 12.80× and 3.18× against Stinger for BFS and PageRank analytics (batch version), respectively. GraphOne also gains over 2,000× speedup against Kickstarter, a state-of-the-art stream analytics engine in ingesting the streaming edges and performing streaming BFS when treating first half as a base snapshot and rest as streaming edge in a synthetic graph. GraphOne also achieves an ingestion rate of two to three orders of magnitude higher than graph databases. Finally, we demonstrate that it is possible to run concurrent stream analytics from the same data store.
{"title":"GraphOne","authors":"P. Kumar, H. H. Huang","doi":"10.1145/3364180","DOIUrl":"https://doi.org/10.1145/3364180","url":null,"abstract":"There is a growing need to perform a diverse set of real-time analytics (batch and stream analytics) on evolving graphs to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations; rather, they excel in one area that keeps a private data store in a specialized way to favor their operations only. To address this challenge, we have designed and developed GraphOne, a graph data store that abstracts the graph data store away from the specialized systems to solve the fundamental research problems associated with the data store design. It combines two complementary graph storage formats (edge list and adjacency list) and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions (called data visibility) for concurrent execution of diverse classes of real-time graph analytics with only a small data duplication. Experimental results show that GraphOne is able to deliver 11.40× and 5.36× average speedup in ingestion rate against LLAMA and Stinger, the two state-of-the-art dynamic graph systems, respectively. Further, they achieve an average speedup of 8.75× and 4.14× against LLAMA and 12.80× and 3.18× against Stinger for BFS and PageRank analytics (batch version), respectively. GraphOne also gains over 2,000× speedup against Kickstarter, a state-of-the-art stream analytics engine in ingesting the streaming edges and performing streaming BFS when treating first half as a base snapshot and rest as streaming edge in a synthetic graph. GraphOne also achieves an ingestion rate of two to three orders of magnitude higher than graph databases. Finally, we demonstrate that it is possible to run concurrent stream analytics from the same data store.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"483 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114280338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we propose a simple but practical and efficient optimization scheme for journaling in ext4, called lightweight data journaling (LDJ). By compressing journaled data prior to writing, LDJ can perform comparable to or even faster than the default ordered journaling (OJ) mode in ext4 on top of both HDDs and flash storage devices, while still guaranteeing the version consistency of the data journaling (DJ) mode. This surprising result can be explained with three main reasons. First, on modern storage devices, the sequential write pattern dominating in DJ mode is more and more high-performant than the random one in OJ mode. Second, the compression significantly reduces the amount of journal writes, which will in turn make the write completion faster and prolong the lifespan of storage devices. Third, the compression also enables the atomicity of each journal write without issuing an intervening FLUSH command between journal data blocks and commit block, thus halving the number of costly FLUSH calls in LDJ. We have prototyped our LDJ by slightly modifying the existing ext4 with jbd2 for journaling and also e2fsck for recovery; less than 300 lines of source code were changed. Also, we carried out a comprehensive evaluation using four standard benchmarks and three real applications. Our evaluation results clearly show that LDJ outperforms the OJ mode by up to 9.6× on the real applications.
{"title":"LDJ","authors":"Donghyun Kang, Sang-Won Lee, Y. Eom","doi":"10.1145/3365918","DOIUrl":"https://doi.org/10.1145/3365918","url":null,"abstract":"In this article, we propose a simple but practical and efficient optimization scheme for journaling in ext4, called lightweight data journaling (LDJ). By compressing journaled data prior to writing, LDJ can perform comparable to or even faster than the default ordered journaling (OJ) mode in ext4 on top of both HDDs and flash storage devices, while still guaranteeing the version consistency of the data journaling (DJ) mode. This surprising result can be explained with three main reasons. First, on modern storage devices, the sequential write pattern dominating in DJ mode is more and more high-performant than the random one in OJ mode. Second, the compression significantly reduces the amount of journal writes, which will in turn make the write completion faster and prolong the lifespan of storage devices. Third, the compression also enables the atomicity of each journal write without issuing an intervening FLUSH command between journal data blocks and commit block, thus halving the number of costly FLUSH calls in LDJ. We have prototyped our LDJ by slightly modifying the existing ext4 with jbd2 for journaling and also e2fsck for recovery; less than 300 lines of source code were changed. Also, we carried out a comprehensive evaluation using four standard benchmarks and three real applications. Our evaluation results clearly show that LDJ outperforms the OJ mode by up to 9.6× on the real applications.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130082045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ji Wang, Weidong Bao, Lei Zheng, Xiaomin Zhu, Philip S. Yu
Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.
{"title":"An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems","authors":"Ji Wang, Weidong Bao, Lei Zheng, Xiaomin Zhu, Philip S. Yu","doi":"10.1145/3340290","DOIUrl":"https://doi.org/10.1145/3340290","url":null,"abstract":"Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116262087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaolu Li, Zuoru Yang, Jinhong Li, Runhui Li, P. Lee, Qun Huang, Yuchong Hu
We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design different extensions of repair pipelining algorithms for heterogeneous environments and multi-block repair operations. We implement a repair pipelining prototype, called ECPipe, and integrate it as a middleware system into two versions of Hadoop Distributed File System (HDFS) (namely, HDFS-RAID and HDFS-3) as well as Quantcast File System. Experiments on a local testbed and Amazon EC2 show that repair pipelining significantly improves the performance of degraded reads and full-node recovery over existing repair techniques.
{"title":"Repair Pipelining for Erasure-coded Storage: Algorithms and Evaluation","authors":"Xiaolu Li, Zuoru Yang, Jinhong Li, Runhui Li, P. Lee, Qun Huang, Yuchong Hu","doi":"10.1145/3436890","DOIUrl":"https://doi.org/10.1145/3436890","url":null,"abstract":"We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design different extensions of repair pipelining algorithms for heterogeneous environments and multi-block repair operations. We implement a repair pipelining prototype, called ECPipe, and integrate it as a middleware system into two versions of Hadoop Distributed File System (HDFS) (namely, HDFS-RAID and HDFS-3) as well as Quantcast File System. Experiments on a local testbed and Amazon EC2 show that repair pipelining significantly improves the performance of degraded reads and full-node recovery over existing repair techniques.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121651446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Integrating solid-state drives (SSDs) and host-aware shingled magnetic recording (HA-SMR) drives can potentially build a cost-effective high-performance storage system. However, existing SSD tiering and caching designs in such a hybrid system are not fully matched with the intrinsic properties of HA-SMR drives due to their lacking consideration of how to handle non-sequential writes (NSWs). We propose ZoneTier, a zone-based storage tiering and caching co-design, to effectively control all the NSWs by leveraging the host-aware property of HA-SMR drives. ZoneTier exploits real-time data layout of SMR zones to optimize zone placement, reshapes NSWs generated from zone demotions to SMR preferred sequential writes, and transforms the inevitable NSWs to cleaning-friendly write traffics for SMR zones. ZoneTier can be easily extended to match host-managed SMR drives using proactive cleaning policy. We implemented a prototype of ZoneTier with user space data management algorithms and real SSD and HA-SMR drives, which are manipulated by the functions provided by libzbc and libaio. Our experiments show that ZoneTier can reduce zone relocation overhead by 29.41% on average, shorten performance recovery time of HA-SMR drives from cleaning by up to 33.37%, and improve performance by up to 32.31% than existing hybrid storage designs.
{"title":"ZoneTier","authors":"Xuchao Xie, Liquan Xiao, D. H. Du","doi":"10.1145/3335548","DOIUrl":"https://doi.org/10.1145/3335548","url":null,"abstract":"Integrating solid-state drives (SSDs) and host-aware shingled magnetic recording (HA-SMR) drives can potentially build a cost-effective high-performance storage system. However, existing SSD tiering and caching designs in such a hybrid system are not fully matched with the intrinsic properties of HA-SMR drives due to their lacking consideration of how to handle non-sequential writes (NSWs). We propose ZoneTier, a zone-based storage tiering and caching co-design, to effectively control all the NSWs by leveraging the host-aware property of HA-SMR drives. ZoneTier exploits real-time data layout of SMR zones to optimize zone placement, reshapes NSWs generated from zone demotions to SMR preferred sequential writes, and transforms the inevitable NSWs to cleaning-friendly write traffics for SMR zones. ZoneTier can be easily extended to match host-managed SMR drives using proactive cleaning policy. We implemented a prototype of ZoneTier with user space data management algorithms and real SSD and HA-SMR drives, which are manipulated by the functions provided by libzbc and libaio. Our experiments show that ZoneTier can reduce zone relocation overhead by 29.41% on average, shorten performance recovery time of HA-SMR drives from cleaning by up to 33.37%, and improve performance by up to 32.31% than existing hybrid storage designs.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125571672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weidong Wen, Yang Li, Wenhai Li, Lingfeng Deng, Yanxiang He
The relatively high cost of record deserialization is increasingly becoming the bottleneck of column-based storage systems in tree-structured applications [58]. Due to record transformation in the storage layer, unnecessary processing costs derived from fields and rows irrelevant to queries may be very heavy in nested schemas, significantly wasting the computational resources in large-scale analytical workloads. This leads to the question of how to reduce both the deserialization and IO costs of queries with highly selective filters following arbitrary paths in a nested schema. We present CORES (Column-Oriented Regeneration Embedding Scheme) to push highly selective filters down into column-based storage engines, where each filter consists of several filtering conditions on a field. By applying highly selective filters in the storage layer, we demonstrate that both the deserialization and IO costs could be significantly reduced. We show how to introduce fine-grained composition on filtering results. We generalize this technique by two pair-wise operations, rollup and drilldown, such that a series of conjunctive filters can effectively deliver their payloads in nested schema. The proposed methods are implemented on an open-source platform. For practical purposes, we highlight how to build a column storage engine and how to drive a query efficiently based on a cost model. We apply this design to the nested relational model especially when hierarchical entities are frequently required by ad hoc queries. The experiments, including a real workload and the modified TPCH benchmark, demonstrate that CORES improves the performance by 0.7×--26.9× compared to state-of-the-art platforms in scan-intensive workloads.
{"title":"CORES","authors":"Weidong Wen, Yang Li, Wenhai Li, Lingfeng Deng, Yanxiang He","doi":"10.1145/3321704","DOIUrl":"https://doi.org/10.1145/3321704","url":null,"abstract":"The relatively high cost of record deserialization is increasingly becoming the bottleneck of column-based storage systems in tree-structured applications [58]. Due to record transformation in the storage layer, unnecessary processing costs derived from fields and rows irrelevant to queries may be very heavy in nested schemas, significantly wasting the computational resources in large-scale analytical workloads. This leads to the question of how to reduce both the deserialization and IO costs of queries with highly selective filters following arbitrary paths in a nested schema. We present CORES (Column-Oriented Regeneration Embedding Scheme) to push highly selective filters down into column-based storage engines, where each filter consists of several filtering conditions on a field. By applying highly selective filters in the storage layer, we demonstrate that both the deserialization and IO costs could be significantly reduced. We show how to introduce fine-grained composition on filtering results. We generalize this technique by two pair-wise operations, rollup and drilldown, such that a series of conjunctive filters can effectively deliver their payloads in nested schema. The proposed methods are implemented on an open-source platform. For practical purposes, we highlight how to build a column storage engine and how to drive a query efficiently based on a cost model. We apply this design to the nested relational model especially when hierarchical entities are frequently required by ad hoc queries. The experiments, including a real workload and the modified TPCH benchmark, demonstrate that CORES improves the performance by 0.7×--26.9× compared to state-of-the-art platforms in scan-intensive workloads.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}