Traditionally, the only option for developers was to implement file systems (FSs) via drivers within the operating system kernel. However, there exists a growing number of file systems (FSs), notably distributed FSs for the cloud, whose interfaces are implemented solely in user space to (i) isolate FS logic, (ii) take advantage of user space libraries, and/or (iii) for rapid FS prototyping. Common interfaces for implementing FSs in user space exist, but they do not guarantee POSIX compliance in all cases, or suffer from considerable performance penalties due to high amounts of wait context switchs between kernel and user space processes. We propose DEFUSE: an interface for user space FSs that provides fast accesses while ensuring access correctness and requiring no modifications to applications. DEFUSE: achieves significant performance improvements over existing user space FS interfaces thanks to its novel design that drastically reduces the number of wait context switchs for FS accesses. Additionally, to ensure access correctness, DEFUSE: maintains POSIX compliance for FS accesses thanks to three novel concepts of bypassed file descriptor (FD) lookup, FD stashing, and user space paging. Our evaluation spanning a variety of workloads shows that by reducing the number of wait context switchs per workload from as many as 16,000 or 41,000 with filesystem in user space down to 9 on average, DEFUSE: increases performance 2× over existing interfaces for typical workloads and by as many as 10× in certain instances.
{"title":"DEFUSE: An Interface for Fast and Correct User Space File System Access","authors":"James Lembke, Pierre-Louis Roman, P. Eugster","doi":"10.1145/3494556","DOIUrl":"https://doi.org/10.1145/3494556","url":null,"abstract":"Traditionally, the only option for developers was to implement file systems (FSs) via drivers within the operating system kernel. However, there exists a growing number of file systems (FSs), notably distributed FSs for the cloud, whose interfaces are implemented solely in user space to (i) isolate FS logic, (ii) take advantage of user space libraries, and/or (iii) for rapid FS prototyping. Common interfaces for implementing FSs in user space exist, but they do not guarantee POSIX compliance in all cases, or suffer from considerable performance penalties due to high amounts of wait context switchs between kernel and user space processes. We propose DEFUSE: an interface for user space FSs that provides fast accesses while ensuring access correctness and requiring no modifications to applications. DEFUSE: achieves significant performance improvements over existing user space FS interfaces thanks to its novel design that drastically reduces the number of wait context switchs for FS accesses. Additionally, to ensure access correctness, DEFUSE: maintains POSIX compliance for FS accesses thanks to three novel concepts of bypassed file descriptor (FD) lookup, FD stashing, and user space paging. Our evaluation spanning a variety of workloads shows that by reducing the number of wait context switchs per workload from as many as 16,000 or 41,000 with filesystem in user space down to 9 on average, DEFUSE: increases performance 2× over existing interfaces for typical workloads and by as many as 10× in certain instances.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126545898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianwei Zheng, Zhenhua Li, Yuanhui Qiu, Hao Lin, HE Xiao, Yang Li, Yun-Fei Liu
Delta synchronization (sync) is crucial to the network-level efficiency of cloud storage services, especially when handling large files with small increments. Practical delta sync techniques are, however, only available for PC clients and mobile apps, but not web browsers—the most pervasive and OS-independent access method. To bridge this gap, prior work concentrates on either reversing the delta sync protocol or utilizing the native client, all striving around the tradeoffs among efficiency, applicability, and usability and thus forming an “impossible triangle.” Recently, we note the advent of WebAssembly (WASM), a portable binary instruction format that is efficient in both encoding size and load time. In principle, the unique advantages of WASM can make web-based applications enjoy near-native runtime speed without significant cloud-side or client-side changes. Thus, we implement a straightforward WASM-based delta sync solution, WASMrsync, finding its quasi-asynchronous working manner and conventional In-situ Separate Memory Allocation greatly increase sync time and memory usage. To address them, we strategically devise sync-async code decoupling and streaming compilation, together with Informed In-place File Construction. The resulting solution, WASMrsync+, achieves comparable sync time as the state-of-the-art (most efficient) solution with nearly only half of memory usage, letting the “impossible triangle” reach a reconciliation.
{"title":"WebAssembly-based Delta Sync for Cloud Storage Services","authors":"Jianwei Zheng, Zhenhua Li, Yuanhui Qiu, Hao Lin, HE Xiao, Yang Li, Yun-Fei Liu","doi":"10.1145/3502847","DOIUrl":"https://doi.org/10.1145/3502847","url":null,"abstract":"Delta synchronization (sync) is crucial to the network-level efficiency of cloud storage services, especially when handling large files with small increments. Practical delta sync techniques are, however, only available for PC clients and mobile apps, but not web browsers—the most pervasive and OS-independent access method. To bridge this gap, prior work concentrates on either reversing the delta sync protocol or utilizing the native client, all striving around the tradeoffs among efficiency, applicability, and usability and thus forming an “impossible triangle.” Recently, we note the advent of WebAssembly (WASM), a portable binary instruction format that is efficient in both encoding size and load time. In principle, the unique advantages of WASM can make web-based applications enjoy near-native runtime speed without significant cloud-side or client-side changes. Thus, we implement a straightforward WASM-based delta sync solution, WASMrsync, finding its quasi-asynchronous working manner and conventional In-situ Separate Memory Allocation greatly increase sync time and memory usage. To address them, we strategically devise sync-async code decoupling and streaming compilation, together with Informed In-place File Construction. The resulting solution, WASMrsync+, achieves comparable sync time as the state-of-the-art (most efficient) solution with nearly only half of memory usage, letting the “impossible triangle” reach a reconciliation.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121145288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don’t consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive’s internal structure and metadata. To address the gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives’ internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag’s patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag’s patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag’s patches are negligibly larger.
{"title":"Donag: Generating Efficient Patches and Diffs for Compressed Archives","authors":"Michael J. May","doi":"10.1145/3507919","DOIUrl":"https://doi.org/10.1145/3507919","url":null,"abstract":"Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don’t consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive’s internal structure and metadata. To address the gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives’ internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag’s patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag’s patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag’s patches are negligibly larger.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115327421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Host-managed shingled magnetic recording drives (HM-SMR) are advantageous in capacity to harness the explosive growth of data. For key-value (KV) stores based on log-structured merge trees (LSM-trees), the HM-SMR drive is an ideal solution owning to its capacity, predictable performance, and economical cost. However, building an LSM-tree-based KV store on HM-SMR drives presents severe challenges in maintaining the performance and space utilization efficiency due to the redundant cleaning processes for applications and storage devices (i.e., compaction and garbage collection). To eliminate the overhead of on-disk garbage collection (GC) and improve compaction efficiency, this article presents GearDB, a GC-free KV store tailored for HM-SMR drives. GearDB improves the write performance and space efficiency through three new techniques: a new on-disk data layout, compaction windows, and a novel gear compaction algorithm. We further augment the read performance of GearDB with a new SSTable layout and read ahead mechanism. We implement GearDB with LevelDB, and use zonefs to access a real HM-SMR drive. Our extensive experiments confirm that GearDB achieves both high performance and space efficiency, i.e., on average 1.7× and 1.5× better than LevelDB in random write and read, respectively, with up to 86.9% space efficiency.
{"title":"Building GC-free Key-value Store on HM-SMR Drives with ZoneFS","authors":"Yiwen Zhang, Ting Yao, Ji-guang Wan, Changsheng Xie","doi":"10.1145/3502846","DOIUrl":"https://doi.org/10.1145/3502846","url":null,"abstract":"Host-managed shingled magnetic recording drives (HM-SMR) are advantageous in capacity to harness the explosive growth of data. For key-value (KV) stores based on log-structured merge trees (LSM-trees), the HM-SMR drive is an ideal solution owning to its capacity, predictable performance, and economical cost. However, building an LSM-tree-based KV store on HM-SMR drives presents severe challenges in maintaining the performance and space utilization efficiency due to the redundant cleaning processes for applications and storage devices (i.e., compaction and garbage collection). To eliminate the overhead of on-disk garbage collection (GC) and improve compaction efficiency, this article presents GearDB, a GC-free KV store tailored for HM-SMR drives. GearDB improves the write performance and space efficiency through three new techniques: a new on-disk data layout, compaction windows, and a novel gear compaction algorithm. We further augment the read performance of GearDB with a new SSTable layout and read ahead mechanism. We implement GearDB with LevelDB, and use zonefs to access a real HM-SMR drive. Our extensive experiments confirm that GearDB achieves both high performance and space efficiency, i.e., on average 1.7× and 1.5× better than LevelDB in random write and read, respectively, with up to 86.9% space efficiency.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123704586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sara McAllister, Benjamin Berg, Julian Tutuncu-Macias, Juncheng Yang, S. Gunasekar, Jimmy Lu, Daniel S. Berger, Nathan Beckmann, G. Ganger
Many social-media and IoT services have very large working sets consisting of billions of tiny (≈100 B) objects. Large, flash-based caches are important to serving these working sets at acceptable monetary cost. However, caching tiny objects on flash is challenging for two reasons: (i) SSDs can read/write data only in multi-KB “pages” that are much larger than a single object, stressing the limited number of times flash can be written; and (ii) very few bits per cached object can be kept in DRAM without losing flash’s cost advantage. Unfortunately, existing flash-cache designs fall short of addressing these challenges: write-optimized designs require too much DRAM, and DRAM-optimized designs require too many flash writes. We present Kangaroo, a new flash-cache design that optimizes both DRAM usage and flash writes to maximize cache performance while minimizing cost. Kangaroo combines a large, set-associative cache with a small, log-structured cache. The set-associative cache requires minimal DRAM, while the log-structured cache minimizes Kangaroo’s flash writes. Experiments using traces from Meta and Twitter show that Kangaroo achieves DRAM usage close to the best prior DRAM-optimized design, flash writes close to the best prior write-optimized design, and miss ratios better than both. Kangaroo’s design is Pareto-optimal across a range of allowed write rates, DRAM sizes, and flash sizes, reducing misses by 29% over the state of the art. These results are corroborated by analytical models presented herein and with a test deployment of Kangaroo in a production flash cache at Meta.
{"title":"Kangaroo: Theory and Practice of Caching Billions of Tiny Objects on Flash","authors":"Sara McAllister, Benjamin Berg, Julian Tutuncu-Macias, Juncheng Yang, S. Gunasekar, Jimmy Lu, Daniel S. Berger, Nathan Beckmann, G. Ganger","doi":"10.1145/3542928","DOIUrl":"https://doi.org/10.1145/3542928","url":null,"abstract":"Many social-media and IoT services have very large working sets consisting of billions of tiny (≈100 B) objects. Large, flash-based caches are important to serving these working sets at acceptable monetary cost. However, caching tiny objects on flash is challenging for two reasons: (i) SSDs can read/write data only in multi-KB “pages” that are much larger than a single object, stressing the limited number of times flash can be written; and (ii) very few bits per cached object can be kept in DRAM without losing flash’s cost advantage. Unfortunately, existing flash-cache designs fall short of addressing these challenges: write-optimized designs require too much DRAM, and DRAM-optimized designs require too many flash writes. We present Kangaroo, a new flash-cache design that optimizes both DRAM usage and flash writes to maximize cache performance while minimizing cost. Kangaroo combines a large, set-associative cache with a small, log-structured cache. The set-associative cache requires minimal DRAM, while the log-structured cache minimizes Kangaroo’s flash writes. Experiments using traces from Meta and Twitter show that Kangaroo achieves DRAM usage close to the best prior DRAM-optimized design, flash writes close to the best prior write-optimized design, and miss ratios better than both. Kangaroo’s design is Pareto-optimal across a range of allowed write rates, DRAM sizes, and flash sizes, reducing misses by 29% over the state of the art. These results are corroborated by analytical models presented herein and with a test deployment of Kangaroo in a production flash cache at Meta.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133521398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aishwarya Ganesan, R. Alagappan, Anthony Rebello, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This article answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this article, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros, a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: For many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3× lower latency while providing the same high throughput offered by throughput-optimized Paxos.
{"title":"Exploiting Nil-external Interfaces for Fast Replicated Storage","authors":"Aishwarya Ganesan, R. Alagappan, Anthony Rebello, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/3542821","DOIUrl":"https://doi.org/10.1145/3542821","url":null,"abstract":"Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This article answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this article, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros, a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: For many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3× lower latency while providing the same high throughput offered by throughput-optimized Paxos.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130017594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyu Zou, Jingsong Yuan, Philip Shilane, Wen Xia, Haijun Zhang, Xuan Wang
Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication. Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.
{"title":"From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality","authors":"Xiangyu Zou, Jingsong Yuan, Philip Shilane, Wen Xia, Haijun Zhang, Xuan Wang","doi":"10.1145/3507921","DOIUrl":"https://doi.org/10.1145/3507921","url":null,"abstract":"Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication. Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Zuo, Qihui Zhou, Jiazhao Sun, Liu Yang, Shuangwu Zhang, Yu Hua, James Cheng, Rongfeng He, Huabing Yan
Memory disaggregation is a promising technique in datacenters with the benefit of improving resource utilization, failure isolation, and elasticity. Hashing indexes have been widely used to provide fast lookup services in distributed memory systems. However, traditional hashing indexes become inefficient for disaggregated memory, since the computing power in the memory pool is too weak to execute complex index requests. To provide efficient indexing services in disaggregated memory scenarios, this article proposes RACE hashing, a one-sided RDMA-Conscious Extendible hashing index with lock-free remote concurrency control and efficient remote resizing. RACE hashing enables all index operations to be efficiently executed by using only one-sided RDMA verbs without involving any compute resource in the memory pool. To support remote concurrent access with high performance, RACE hashing leverages a lock-free remote concurrency control scheme to enable different clients to concurrently operate the same hashing index in the memory pool in a lock-free manner. To resize the hash table with low overheads, RACE hashing leverages an extendible remote resizing scheme to reduce extra RDMA accesses caused by extendible resizing and allow concurrent request execution during resizing. Extensive experimental results demonstrate that RACE hashing outperforms state-of-the-art distributed in-memory hashing indexes by 1.4–13.7× in YCSB hybrid workloads.
{"title":"RACE: One-sided RDMA-conscious Extendible Hashing","authors":"Pengfei Zuo, Qihui Zhou, Jiazhao Sun, Liu Yang, Shuangwu Zhang, Yu Hua, James Cheng, Rongfeng He, Huabing Yan","doi":"10.1145/3511895","DOIUrl":"https://doi.org/10.1145/3511895","url":null,"abstract":"Memory disaggregation is a promising technique in datacenters with the benefit of improving resource utilization, failure isolation, and elasticity. Hashing indexes have been widely used to provide fast lookup services in distributed memory systems. However, traditional hashing indexes become inefficient for disaggregated memory, since the computing power in the memory pool is too weak to execute complex index requests. To provide efficient indexing services in disaggregated memory scenarios, this article proposes RACE hashing, a one-sided RDMA-Conscious Extendible hashing index with lock-free remote concurrency control and efficient remote resizing. RACE hashing enables all index operations to be efficiently executed by using only one-sided RDMA verbs without involving any compute resource in the memory pool. To support remote concurrent access with high performance, RACE hashing leverages a lock-free remote concurrency control scheme to enable different clients to concurrently operate the same hashing index in the memory pool in a lock-free manner. To resize the hash table with low overheads, RACE hashing leverages an extendible remote resizing scheme to reduce extra RDMA accesses caused by extendible resizing and allow concurrent request execution during resizing. Extensive experimental results demonstrate that RACE hashing outperforms state-of-the-art distributed in-memory hashing indexes by 1.4–13.7× in YCSB hybrid workloads.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117189816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special section of the ACM Transactions on Storage presents some highlights from the storagerelated papers published in the USENIX Annual Technical Conference (ATC’21). Although ATC is a broad conference that covers all practical aspects of systems software, a large proportion of its papers have traditionally been related to storage in some way. ATC’21 has continued this trend. Out of 341 submissions, the authors tagged 121 (35%) with one or more topic labels of “Storage,” “File Systems,” or “Databases and Transactions.” The conference accepted 64 papers (19%), of which 21 (33%) were storage related. As conference co-chairs, we selected three storage papers to be highlighted in this special section. All three were expanded and retitled by their authors and re-reviewed in fast-track mode by several of their original ATC’21 reviewers. We summarize them here in no particular order. The first article is “RACE: One-sided RDMA-conscious Extendible Hashing” by Pengfei Zuo, Qihui Zhou, Jiazhao Sun, Liu Yang, Shuangwu Zhang, Yu Hua, James Cheng, Rongfeng He, and Huabing Yan (titled “One-sided RDMA-conscious Extendible Hashing for Disaggregated Memory” in ATC’21). RACE is a client-centric RDMA hash table designed for disaggregated memory running on low-power CPUs. RACE completely bypasses the remote CPU for all key-value store operations and allows the hash table to be resized without impacting the concurrent foreground traffic. The second article, “SmartFVM: A Fast, Flexible, and Scalable Hardware-based Virtualization for Commodity Storage Devices” (originally “A Fast and Flexible Hardware-based Virtualization Mechanism for Computational Storage Devices”) is by Dongup Kwon, Wonsik Lee, Dongryeong Kim, Junehyuk Boo, and Jangwoo Kim. This article introduces a practical and low-overhead solution to virtualize computational storage devices that uses an FPGA with direct access to an SSD through NVMe. SmartFVM uses hardware-assisted virtualization to remove software-stack overheads while still maintaining isolation, and a hardware-level orchestration mechanism between the FPGA and the SSD. The final article is “Power Optimized Deployment of Key-value Stores Using Storage Class Memory” by Hiwot Tadese Kassa, Jason Akers, Mrinmoy Ghosh, Zhichao Cao, Vaibhav Gogte, and Ronald Dreslinski (previously “Improving Performance of Flash-based Key-value Stores Using Storage Class Memory as a Volatile Memory Extension”). It optimizes RocksDB by introducing a second layer of block cache using storage class memory. The article shows that adding storageclass memory to a smaller, single-socket server results in significant performance improvements for RocksDB in production deployments at Facebook, while improving the cost compared to large two-socket servers with DRAM only.
{"title":"Introduction to the Special Section on USENIX ATC 2021","authors":"I. Calciu, G. Kuenning","doi":"10.1145/3519550","DOIUrl":"https://doi.org/10.1145/3519550","url":null,"abstract":"This special section of the ACM Transactions on Storage presents some highlights from the storagerelated papers published in the USENIX Annual Technical Conference (ATC’21). Although ATC is a broad conference that covers all practical aspects of systems software, a large proportion of its papers have traditionally been related to storage in some way. ATC’21 has continued this trend. Out of 341 submissions, the authors tagged 121 (35%) with one or more topic labels of “Storage,” “File Systems,” or “Databases and Transactions.” The conference accepted 64 papers (19%), of which 21 (33%) were storage related. As conference co-chairs, we selected three storage papers to be highlighted in this special section. All three were expanded and retitled by their authors and re-reviewed in fast-track mode by several of their original ATC’21 reviewers. We summarize them here in no particular order. The first article is “RACE: One-sided RDMA-conscious Extendible Hashing” by Pengfei Zuo, Qihui Zhou, Jiazhao Sun, Liu Yang, Shuangwu Zhang, Yu Hua, James Cheng, Rongfeng He, and Huabing Yan (titled “One-sided RDMA-conscious Extendible Hashing for Disaggregated Memory” in ATC’21). RACE is a client-centric RDMA hash table designed for disaggregated memory running on low-power CPUs. RACE completely bypasses the remote CPU for all key-value store operations and allows the hash table to be resized without impacting the concurrent foreground traffic. The second article, “SmartFVM: A Fast, Flexible, and Scalable Hardware-based Virtualization for Commodity Storage Devices” (originally “A Fast and Flexible Hardware-based Virtualization Mechanism for Computational Storage Devices”) is by Dongup Kwon, Wonsik Lee, Dongryeong Kim, Junehyuk Boo, and Jangwoo Kim. This article introduces a practical and low-overhead solution to virtualize computational storage devices that uses an FPGA with direct access to an SSD through NVMe. SmartFVM uses hardware-assisted virtualization to remove software-stack overheads while still maintaining isolation, and a hardware-level orchestration mechanism between the FPGA and the SSD. The final article is “Power Optimized Deployment of Key-value Stores Using Storage Class Memory” by Hiwot Tadese Kassa, Jason Akers, Mrinmoy Ghosh, Zhichao Cao, Vaibhav Gogte, and Ronald Dreslinski (previously “Improving Performance of Flash-based Key-value Stores Using Storage Class Memory as a Volatile Memory Extension”). It optimizes RocksDB by introducing a second layer of block cache using storage class memory. The article shows that adding storageclass memory to a smaller, single-socket server results in significant performance improvements for RocksDB in production deployments at Facebook, while improving the cost compared to large two-socket servers with DRAM only.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127205963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongup Kwon, Wonsik Lee, Dongryeong Kim, Junehyuk Boo, Jangwoo Kim
A computational storage device incorporating a computation unit inside or near its storage unit is a highly promising technology to maximize a storage server’s performance. However, to apply such computational storage devices and take their full potential in virtualized environments, server architects must resolve a fundamental challenge: cost-effective virtualization. This critical challenge can be directly addressed by the following questions: (1) how to virtualize two different hardware units (i.e., computation and storage), and (2) how to integrate them to construct virtual computational storage devices, and (3) how to provide them to users. However, the existing methods for computational storage virtualization severely suffer from their low performance and high costs due to the lack of hardware-assisted virtualization support. In this work, we propose SmartFVM-Engine, an FPGA card designed to maximize the performance and cost-effectiveness of computational storage virtualization. SmartFVM-Engine introduces three key ideas to achieve the design goals. First, it achieves high virtualization performance by applying hardware-assisted virtualization to both computation and storage units. Second, it further improves the performance by applying hardware-assisted resource orchestration for the virtualized units. Third, it achieves high cost-effectiveness by dynamically constructing and scheduling virtual computational storage devices. To the best of our knowledge, this is the first work to implement a hardware-assisted virtualization mechanism for modern computational storage devices.
{"title":"SmartFVM: A Fast, Flexible, and Scalable Hardware-based Virtualization for Commodity Storage Devices","authors":"Dongup Kwon, Wonsik Lee, Dongryeong Kim, Junehyuk Boo, Jangwoo Kim","doi":"10.1145/3511213","DOIUrl":"https://doi.org/10.1145/3511213","url":null,"abstract":"A computational storage device incorporating a computation unit inside or near its storage unit is a highly promising technology to maximize a storage server’s performance. However, to apply such computational storage devices and take their full potential in virtualized environments, server architects must resolve a fundamental challenge: cost-effective virtualization. This critical challenge can be directly addressed by the following questions: (1) how to virtualize two different hardware units (i.e., computation and storage), and (2) how to integrate them to construct virtual computational storage devices, and (3) how to provide them to users. However, the existing methods for computational storage virtualization severely suffer from their low performance and high costs due to the lack of hardware-assisted virtualization support. In this work, we propose SmartFVM-Engine, an FPGA card designed to maximize the performance and cost-effectiveness of computational storage virtualization. SmartFVM-Engine introduces three key ideas to achieve the design goals. First, it achieves high virtualization performance by applying hardware-assisted virtualization to both computation and storage units. Second, it further improves the performance by applying hardware-assisted resource orchestration for the virtualized units. Third, it achieves high cost-effectiveness by dynamically constructing and scheduling virtual computational storage devices. To the best of our knowledge, this is the first work to implement a hardware-assisted virtualization mechanism for modern computational storage devices.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}