We consider a new locality pattern in the form of burstiness to improve cache effectiveness in workflows where items are requested in possibly infrequent yet costly batches. Adding a cache that handles only bursty items to existing State-Of-The-Art algorithms shows a significant improvement in overall average time per query.
{"title":"On Latency Awareness with Delayed Hits","authors":"Gil Einziger, Nadav Keren, Gabriel Scalosub","doi":"10.1145/3579370.3594752","DOIUrl":"https://doi.org/10.1145/3579370.3594752","url":null,"abstract":"We consider a new locality pattern in the form of burstiness to improve cache effectiveness in workflows where items are requested in possibly infrequent yet costly batches. Adding a cache that handles only bursty items to existing State-Of-The-Art algorithms shows a significant improvement in overall average time per query.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122638674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages. We train prediction models on a four-month production dataset with 350 million entries at ByteDance. We observe that classic prediction models (GBDT, MLP, LSTM, and 1D-CNN) do not perform well---they are inaccurate for predictions and unstable over time. We propose several techniques to improve the precision and stability of predictions, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performance of our proposed techniques. The results show that our proposed techniques improve the prediction precision from 46.3% to 85.4% on production workloads.
{"title":"Predicting GPU Failures With High Precision Under Deep Learning Workloads","authors":"Heting Liu, Zhichao Li, Cheng Tan, Rongqiu Yang, Guohong Cao, Zherui Liu, Chuanxiong Guo","doi":"10.1145/3579370.3594777","DOIUrl":"https://doi.org/10.1145/3579370.3594777","url":null,"abstract":"Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages. We train prediction models on a four-month production dataset with 350 million entries at ByteDance. We observe that classic prediction models (GBDT, MLP, LSTM, and 1D-CNN) do not perform well---they are inaccurate for predictions and unstable over time. We propose several techniques to improve the precision and stability of predictions, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performance of our proposed techniques. The results show that our proposed techniques improve the prediction precision from 46.3% to 85.4% on production workloads.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133233001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Verifying the integrity of a file is a fundamental operation in file transfer. Common tools compute a short hash value that is sent along with the file, but computing this value requires going over the entire file and if the file is huge, then this process is slow. We introduce blkhash - a novel hash algorithm optimized for disk images, that is up to 4 orders of magnitude faster than commonly used tools. We implemented a new command line tool and library that can be used in the virtualization space for verifying storage management operations. Our approach can significantly contribute to use cases such as: (1) Very fast computing of virtual disk hash value in software defined storage, (2) Verifying an entire disk image content as part of a supply chain integrity verification or in the context of confidential computing.
{"title":"Efficient Hashing of Sparse Virtual Disks","authors":"Nir Soffer, Erez Waisbard","doi":"10.1145/3579370.3594748","DOIUrl":"https://doi.org/10.1145/3579370.3594748","url":null,"abstract":"Verifying the integrity of a file is a fundamental operation in file transfer. Common tools compute a short hash value that is sent along with the file, but computing this value requires going over the entire file and if the file is huge, then this process is slow. We introduce blkhash - a novel hash algorithm optimized for disk images, that is up to 4 orders of magnitude faster than commonly used tools. We implemented a new command line tool and library that can be used in the virtualization space for verifying storage management operations. Our approach can significantly contribute to use cases such as: (1) Very fast computing of virtual disk hash value in software defined storage, (2) Verifying an entire disk image content as part of a supply chain integrity verification or in the context of confidential computing.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125997176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtualization has become a critical aspect of modern computing, and with the advent of virtualization-based containers, fast nested virtualization has become increasingly important. Nested virtualization is implemented by emulating virtualization capabilities to the guest host which can result in significant overhead. Another source of overheads in virtualization stems from the address translation mechanisms employed to implement virtualization, which usually causes a mix of slower address translation, frequently trapping guests, and loss of granularity in page tables. Our research focuses on using guest-managed physical memory with the use of per-VM memory tags for checking each VMs' access permissions.
{"title":"Reducing The Virtual Memory Overhead in Nested Virtualization","authors":"Ori Ben Zur, Shai Bergman, M. Silberstein","doi":"10.1145/3579370.3594765","DOIUrl":"https://doi.org/10.1145/3579370.3594765","url":null,"abstract":"Virtualization has become a critical aspect of modern computing, and with the advent of virtualization-based containers, fast nested virtualization has become increasingly important. Nested virtualization is implemented by emulating virtualization capabilities to the guest host which can result in significant overhead. Another source of overheads in virtualization stems from the address translation mechanisms employed to implement virtualization, which usually causes a mix of slower address translation, frequently trapping guests, and loss of granularity in page tables. Our research focuses on using guest-managed physical memory with the use of per-VM memory tags for checking each VMs' access permissions.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132859772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Gu, Jiangpeng Li, Yong Peng, Yang Liu, T. Zhang
This paper studies how RAID (redundant array of independent disks) could take full advantage of modern SSDs (solid-state drives) with built-in transparent compression. In current practice, RAID users are forced to choose a specific RAID level (e.g., RAID 10 or RAID 5) with a fixed storage cost vs. speed performance trade-off. The commercial market is witnessing the emergence of a new family of SSDs that can internally perform hardware-based lossless compression on each 4KB LBA (logical block address) block, transparent to host OS and user applications. Beyond straightforwardly reducing the RAID storage cost, such modern SSDs make it possible to relieve RAID users from being locked into a fixed storage cost vs. speed performance trade-off. In particular, RAID systems could opportunistically leverage higher-than-expected runtime user data compressibility to enable dynamic RAID level conversion to improve the speed performance without compromising the effective storage capacity. This paper presents techniques to enable and optimize the practical implementation of such elastic RAID systems. We implemented a Linux software-based elastic RAID prototype that supports dynamic conversion between RAID 5 and RAID 10. Compared with a baseline software-based RAID 5, under sufficient runtime data compressibility that enables the conversion from RAID 5 to RAID 10 over 60% of user data, the elastic RAID could improve the 4KB random write IOPS (I/O per second) by 42% and 4KB random read IOPS in degraded mode by 46%, while maintaining the same effective storage capacity.
{"title":"Elastic RAID: Implementing RAID over SSDs with Built-in Transparent Compression","authors":"Zheng Gu, Jiangpeng Li, Yong Peng, Yang Liu, T. Zhang","doi":"10.1145/3579370.3594773","DOIUrl":"https://doi.org/10.1145/3579370.3594773","url":null,"abstract":"This paper studies how RAID (redundant array of independent disks) could take full advantage of modern SSDs (solid-state drives) with built-in transparent compression. In current practice, RAID users are forced to choose a specific RAID level (e.g., RAID 10 or RAID 5) with a fixed storage cost vs. speed performance trade-off. The commercial market is witnessing the emergence of a new family of SSDs that can internally perform hardware-based lossless compression on each 4KB LBA (logical block address) block, transparent to host OS and user applications. Beyond straightforwardly reducing the RAID storage cost, such modern SSDs make it possible to relieve RAID users from being locked into a fixed storage cost vs. speed performance trade-off. In particular, RAID systems could opportunistically leverage higher-than-expected runtime user data compressibility to enable dynamic RAID level conversion to improve the speed performance without compromising the effective storage capacity. This paper presents techniques to enable and optimize the practical implementation of such elastic RAID systems. We implemented a Linux software-based elastic RAID prototype that supports dynamic conversion between RAID 5 and RAID 10. Compared with a baseline software-based RAID 5, under sufficient runtime data compressibility that enables the conversion from RAID 5 to RAID 10 over 60% of user data, the elastic RAID could improve the 4KB random write IOPS (I/O per second) by 42% and 4KB random read IOPS in degraded mode by 46%, while maintaining the same effective storage capacity.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133514471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is known that ZFS dRAID [2] provides random data blocks permutation and reconstruction speed up is getting its boost with this initial condition. The question we tried to answer is if there some special permutation that would optimize reconstruction speed at theoretical maximum. We introduce our solution with the usage of cyclic matrices of data layout as currently the best found way to get maximum benefit out from initial declustered RAID configuration.
{"title":"Speeding up reconstruction of declustered RAID with special mapping","authors":"Svetlana Lazareva, G. Petrunin","doi":"10.1145/3579370.3594761","DOIUrl":"https://doi.org/10.1145/3579370.3594761","url":null,"abstract":"It is known that ZFS dRAID [2] provides random data blocks permutation and reconstruction speed up is getting its boost with this initial condition. The question we tried to answer is if there some special permutation that would optimize reconstruction speed at theoretical maximum. We introduce our solution with the usage of cyclic matrices of data layout as currently the best found way to get maximum benefit out from initial declustered RAID configuration.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117318456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a lightweight, self-adjusting algorithm for cache-content advertisement and cache selection. Our algorithm increases the hit ratio and mitigates wasteful, unnecessary cache accesses and cachecontent advertisements.
{"title":"Self-Adjusting Cache Advertisement and Selection","authors":"Itamar Cohen","doi":"10.1145/3579370.3594754","DOIUrl":"https://doi.org/10.1145/3579370.3594754","url":null,"abstract":"We present a lightweight, self-adjusting algorithm for cache-content advertisement and cache selection. Our algorithm increases the hit ratio and mitigates wasteful, unnecessary cache accesses and cachecontent advertisements.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131589803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inho Song, Myounghoon Oh, B. Kim, Seehwan Yoo, Jaedong Lee, Jongmoo Choi
The ZNS (Zoned NameSpace) interface shifts much of the storage maintenance responsibility to the host from the underlying SSDs (Solid-State Drives). In addition, it opens a new opportunity to exploit the internal parallelism of SSDs at both hardware and software levels. By orchestrating the mapping between zones and SSD-internal resources and by controlling zone allocation among threads, ZNS SSDs provide a distinct performance trade-off between parallelism and isolation. To understand and explore the design space of ZNS SSDs, we present ConfZNS (Configurable ZNS), an easy-to-configure and timing-accurate emulator based on QEMU. ConfZNS allows users to investigate a variety of ZNS SSD's internal architecture and how it performs with existing host software. We validate the accuracy of ConfZNS using real ZNS SSDs and explore performance characteristics of different ZNS SSD designs with real-world applications such as RocksDB, F2FS, and Docker environment.
{"title":"ConfZNS : A Novel Emulator for Exploring Design Space of ZNS SSDs","authors":"Inho Song, Myounghoon Oh, B. Kim, Seehwan Yoo, Jaedong Lee, Jongmoo Choi","doi":"10.1145/3579370.3594772","DOIUrl":"https://doi.org/10.1145/3579370.3594772","url":null,"abstract":"The ZNS (Zoned NameSpace) interface shifts much of the storage maintenance responsibility to the host from the underlying SSDs (Solid-State Drives). In addition, it opens a new opportunity to exploit the internal parallelism of SSDs at both hardware and software levels. By orchestrating the mapping between zones and SSD-internal resources and by controlling zone allocation among threads, ZNS SSDs provide a distinct performance trade-off between parallelism and isolation. To understand and explore the design space of ZNS SSDs, we present ConfZNS (Configurable ZNS), an easy-to-configure and timing-accurate emulator based on QEMU. ConfZNS allows users to investigate a variety of ZNS SSD's internal architecture and how it performs with existing host software. We validate the accuracy of ConfZNS using real ZNS SSDs and explore performance characteristics of different ZNS SSD designs with real-world applications such as RocksDB, F2FS, and Docker environment.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122653054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seungjin Lee, Chang-Gyu Lee, Donghyun Min, Inhyuk Park, Woosuk Chung, A. Sivasubramaniam, Youngjae Kim
Key-Value SSD (KVSSD) has shown great potential for several important classes of emerging data stores due to its high throughput and low latency. When designing a key-value store with range queries, an LSM-tree is considered a better choice than a hash table due to its key ordering. However, the design space for range queries in LSM-tree-based KVSSDs has yet to be explored, despite range queries being one of the most demanding features. In this paper, we investigate the design constraints in LSM-tree-based KVSSDs from the perspective of range queries and propose three design principles. Based on these principles, we present IterKVSSD, an Iterator interface extended LSM-tree-based KVSSD for range queries. We implement IterKVSSD on OpenSSD Cosmos+, and our evaluation shows that it increases range query throughput by up to 4.13× and 7.22× for random and sequential key distributions, respectively, compared to existing KVSSDs.
{"title":"Iterator Interface Extended LSM-tree-based KVSSD for Range Queries","authors":"Seungjin Lee, Chang-Gyu Lee, Donghyun Min, Inhyuk Park, Woosuk Chung, A. Sivasubramaniam, Youngjae Kim","doi":"10.1145/3579370.3594775","DOIUrl":"https://doi.org/10.1145/3579370.3594775","url":null,"abstract":"Key-Value SSD (KVSSD) has shown great potential for several important classes of emerging data stores due to its high throughput and low latency. When designing a key-value store with range queries, an LSM-tree is considered a better choice than a hash table due to its key ordering. However, the design space for range queries in LSM-tree-based KVSSDs has yet to be explored, despite range queries being one of the most demanding features. In this paper, we investigate the design constraints in LSM-tree-based KVSSDs from the perspective of range queries and propose three design principles. Based on these principles, we present IterKVSSD, an Iterator interface extended LSM-tree-based KVSSD for range queries. We implement IterKVSSD on OpenSSD Cosmos+, and our evaluation shows that it increases range query throughput by up to 4.13× and 7.22× for random and sequential key distributions, respectively, compared to existing KVSSDs.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125708222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asthma is a common inflammatory condition affecting more than 7 million children in US alone, and tens of millions more globally. Despite effective preventive medications, medication adherence in children and adolescents is often below 50% [1]. In this paper we present a novel personalized IoT-based system for improving children's adherence to inhaler use that is integrated into their daily life.
{"title":"A Smart Inhaler for Medication Adherence","authors":"Itai Dabran, Tom Sofer, N. Bitterman","doi":"10.1145/3579370.3594744","DOIUrl":"https://doi.org/10.1145/3579370.3594744","url":null,"abstract":"Asthma is a common inflammatory condition affecting more than 7 million children in US alone, and tens of millions more globally. Despite effective preventive medications, medication adherence in children and adolescents is often below 50% [1]. In this paper we present a novel personalized IoT-based system for improving children's adherence to inhaler use that is integrated into their daily life.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127628916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}