Pub Date : 2019-08-20DOI: 10.1109/NVMSA.2019.8863524
Duwon Hong, Myungsuk Kim, Jisung Park, Myoungsoo Jung, Jihong Kim
Copyback operation can improve the performance of data migrations in SSD, but they are rarely used because of their error propagation problem. In this paper, we propose an integrated approach that maximizes the efficiency of copyback operations but does not compromise data reliability. First, we propose a novel per-block error propagation model under consecutive copyback operations. Our model significantly increases the number of successive copybacks by exploiting the aging characteristics of NAND blocks. Second, we devise a resource-efficient error management scheme that can handle successive copybacks where pages move around multiple blocks with different reliability. Experimental results show that the proposed technique can improve the IO throughput by up to 25% over the existing technique.
{"title":"Improving SSD Performance Using Adaptive Restricted-Copyback Operations","authors":"Duwon Hong, Myungsuk Kim, Jisung Park, Myoungsoo Jung, Jihong Kim","doi":"10.1109/NVMSA.2019.8863524","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863524","url":null,"abstract":"Copyback operation can improve the performance of data migrations in SSD, but they are rarely used because of their error propagation problem. In this paper, we propose an integrated approach that maximizes the efficiency of copyback operations but does not compromise data reliability. First, we propose a novel per-block error propagation model under consecutive copyback operations. Our model significantly increases the number of successive copybacks by exploiting the aging characteristics of NAND blocks. Second, we devise a resource-efficient error management scheme that can handle successive copybacks where pages move around multiple blocks with different reliability. Experimental results show that the proposed technique can improve the IO throughput by up to 25% over the existing technique.","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134231615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/nvmsa.2019.8863512
{"title":"NVMSA 2019 Message from the General Co-Chairs","authors":"","doi":"10.1109/nvmsa.2019.8863512","DOIUrl":"https://doi.org/10.1109/nvmsa.2019.8863512","url":null,"abstract":"","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114260173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/NVMSA.2019.8863515
Sooyun Lee, Kyuhwa Han, Dongkun Shin
In datacenters and cloud computing, Quality of Service (QoS) is an essential concept as access to shared resources, including solid state drives (SSDs), must be ensured. The previously proposed workload-aware budget compensation (WA-BC) scheduling algorithm is a device I/O scheduler for guaranteeing performance isolation among multiple virtual machines sharing an SSD. This paper aims to resolve the following three shortcomings of WA-BC: (1) it is applicable to only SR-IOV supporting SSDs, (2) it is unfit for various types of workloads, and (3) it manages flash memory blocks separately in an inappropriate manner. We propose the host-level WA-BC (hWA-BC) scheduler, which aims to achieve performance isolation between multiple processes sharing an open-channel SSD.
在数据中心和云计算中,QoS (Quality of Service)是一个重要的概念,因为必须确保访问共享资源,包括ssd (solid state drives)。以前提出的工作负载感知预算补偿(WA-BC)调度算法是一种设备I/O调度程序,用于保证共享SSD的多个虚拟机之间的性能隔离。本文旨在解决WA-BC的三个缺点:(1)它只适用于支持ssd的SR-IOV;(2)它不适合各种类型的工作负载;(3)它单独管理闪存块的方式不合适。我们提出了主机级WA-BC (hWA-BC)调度器,它旨在实现共享开放通道SSD的多个进程之间的性能隔离。
{"title":"Host-Level Workload-Aware Budget Compensation I/O Scheduling for Open-Channel SSDs","authors":"Sooyun Lee, Kyuhwa Han, Dongkun Shin","doi":"10.1109/NVMSA.2019.8863515","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863515","url":null,"abstract":"In datacenters and cloud computing, Quality of Service (QoS) is an essential concept as access to shared resources, including solid state drives (SSDs), must be ensured. The previously proposed workload-aware budget compensation (WA-BC) scheduling algorithm is a device I/O scheduler for guaranteeing performance isolation among multiple virtual machines sharing an SSD. This paper aims to resolve the following three shortcomings of WA-BC: (1) it is applicable to only SR-IOV supporting SSDs, (2) it is unfit for various types of workloads, and (3) it manages flash memory blocks separately in an inappropriate manner. We propose the host-level WA-BC (hWA-BC) scheduler, which aims to achieve performance isolation between multiple processes sharing an open-channel SSD.","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124664156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/NVMSA.2019.8863522
Yusuke Omori, K. Kimura
The emerging technology of byte-addressable nonvolatile memory chips is expected to enable larger main memory and lower power consumption than the traditional DRAM. It also realizes durable data structure without ordinary file systems. However, while enumerating the advantages of nonvolatile main memory (NVMM), its write-time expensive latency and higher energy consumption in comparision with a DRAM must be considered. These special characteristics of NVMM require new compiler techniques and OS support as well as new memory architectures. Several NVMM emulators built on real machines have been proposed to facilitate those software and hardware researches. Their designs were originally based on a simple coarse-grain delay model that injected additional clock cycles in the read and write requests sent to the memory controller. However, they could not utilize bank-level parallelism and row-buffer access locality, relied on by today’s memory modules, to exploit their performance. Therefore, a fine-grain delay model was recently proposed where the delay is injected for the primitive memory operations issued by the memory controller. In this paper, we implement both the coarse-grain and the fine-grain delay models on an SoC-FPGA board along with the use of Linux kernel modifications and several runtime functions. Then, the program behavior differences between two models are evaluated with SPEC CPU programs. The fine-grain model reveals the program execution time is influenced by the frequency of NVMM memory requests rather than the cache hit ratio. Bank-level parallelism and row-buffer access locality also affect the memory access delay, and the fine-grain model shows lower execution time for four of fourteen programs than the coarse-grain even when the former has longer total write latency.
{"title":"Performance Evaluation on NVMM Emulator Employing Fine-Grain Delay Injection","authors":"Yusuke Omori, K. Kimura","doi":"10.1109/NVMSA.2019.8863522","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863522","url":null,"abstract":"The emerging technology of byte-addressable nonvolatile memory chips is expected to enable larger main memory and lower power consumption than the traditional DRAM. It also realizes durable data structure without ordinary file systems. However, while enumerating the advantages of nonvolatile main memory (NVMM), its write-time expensive latency and higher energy consumption in comparision with a DRAM must be considered. These special characteristics of NVMM require new compiler techniques and OS support as well as new memory architectures. Several NVMM emulators built on real machines have been proposed to facilitate those software and hardware researches. Their designs were originally based on a simple coarse-grain delay model that injected additional clock cycles in the read and write requests sent to the memory controller. However, they could not utilize bank-level parallelism and row-buffer access locality, relied on by today’s memory modules, to exploit their performance. Therefore, a fine-grain delay model was recently proposed where the delay is injected for the primitive memory operations issued by the memory controller. In this paper, we implement both the coarse-grain and the fine-grain delay models on an SoC-FPGA board along with the use of Linux kernel modifications and several runtime functions. Then, the program behavior differences between two models are evaluated with SPEC CPU programs. The fine-grain model reveals the program execution time is influenced by the frequency of NVMM memory requests rather than the cache hit ratio. Bank-level parallelism and row-buffer access locality also affect the memory access delay, and the fine-grain model shows lower execution time for four of fourteen programs than the coarse-grain even when the former has longer total write latency.","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131266572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/NVMSA.2019.8863519
Lei Han, Shangzhen Tan, Bin Xiao, Chenlin Ma, Z. Shao
Erasure codes such as Cauchy Reed-Solomon codes have been gaining ever-increasing importance for fault-tolerance in the SSD-based RAID arrays. However, erasure coding on a processor-based RAID controller relies on Galois Filed arithmetic to perform matrix-vector multiplication, which increases the computation complexity and leads to a huge number of memory accesses. In this paper, we investigate utilizing ReRAM to improve erasure coding performance. We propose Re-RAID which uses ReRAM as main memory in both RAID and SSD controllers, in which erasure coding can be processed on ReRAM. We also propose a confluent Cauchy-Vandermonde matrix as the generator matrix for encoding. By doing this, Re-RAID can distribute the reconstruction tasks for a single failure to SSDs, and then SSDs can recover the data with ReRAM memory. Experimental results show that we can improve the encoding and decoding performance by up to $598 times $ and $251 times $, respectively.
在基于ssd的RAID阵列中,诸如Cauchy Reed-Solomon码之类的擦除码对于容错越来越重要。然而,基于处理器的RAID控制器上的擦除编码依赖于伽罗瓦域算法来执行矩阵向量乘法,这增加了计算复杂度,并导致大量的内存访问。在本文中,我们研究利用ReRAM来提高擦除编码的性能。我们提出了在RAID控制器和SSD控制器中使用ReRAM作为主存储器的Re-RAID,其中擦除编码可以在ReRAM上处理。我们还提出了一个合流的Cauchy-Vandermonde矩阵作为编码的生成矩阵。通过这样做,Re-RAID可以将单个故障的重建任务分配给ssd,然后ssd可以使用ReRAM内存恢复数据。实验结果表明,我们可以将编码和解码性能分别提高$598 times $和$251 times $。
{"title":"Optimizing Cauchy Reed-Solomon Coding via ReRAM Crossbars in SSD-based RAID Systems","authors":"Lei Han, Shangzhen Tan, Bin Xiao, Chenlin Ma, Z. Shao","doi":"10.1109/NVMSA.2019.8863519","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863519","url":null,"abstract":"Erasure codes such as Cauchy Reed-Solomon codes have been gaining ever-increasing importance for fault-tolerance in the SSD-based RAID arrays. However, erasure coding on a processor-based RAID controller relies on Galois Filed arithmetic to perform matrix-vector multiplication, which increases the computation complexity and leads to a huge number of memory accesses. In this paper, we investigate utilizing ReRAM to improve erasure coding performance. We propose Re-RAID which uses ReRAM as main memory in both RAID and SSD controllers, in which erasure coding can be processed on ReRAM. We also propose a confluent Cauchy-Vandermonde matrix as the generator matrix for encoding. By doing this, Re-RAID can distribute the reconstruction tasks for a single failure to SSDs, and then SSDs can recover the data with ReRAM memory. Experimental results show that we can improve the encoding and decoding performance by up to $598 times $ and $251 times $, respectively.","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133707990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/NVMSA.2019.8863523
Cheng Ji, Lun Wang, Qiao Li, Congming Gao, Liang Shi, Chia-Lin Yang, C. Xue
Solid-state drives (SSD) are the mainstream solutions for massive data storage today. For modern computer systems, fair resource assignment is a critical design consideration and has drawn great interests in recent years. Although there are several I/O fairness schedulers proposed on the host side for SSDs, process fairness could still be dramatically degraded if garbage collection (GC) is triggered in the device side. A GC operation could block I/O requests, which causes unpredictable read/write latency variation and further impacts fairness between processes. This paper proposes Fair-GC, a novel coordinated host and device I/O scheduling strategy to achieve true fairness considering GC interferences. The key idea is to orchestrate GC operations inside SSDs carefully such that performance of a process is penalized by GC in the same degree (or comparable) as when it runs alone. In this way, the I/O fairness maintained by the host-side scheduler can be maintained in the presence of GC. Furthermore, our scheduler ensures that the timeslice of a process maintained at the host-side scheduler is updated in a timely manner to avoid unnecessary slowdown for maintaining fairness. Experimental results with a wide range of workloads verify that the proposed technique can achieve fairness as well as improve the throughput significantly. Compared to conventional fairness-based I/O scheduler, Fair-GC can reduce the slowdown of real applications by up to 99%, and improve the throughput by as much as 225%, respectively.
{"title":"Fair Down to the Device: A GC-Aware Fair Scheduler for SSD","authors":"Cheng Ji, Lun Wang, Qiao Li, Congming Gao, Liang Shi, Chia-Lin Yang, C. Xue","doi":"10.1109/NVMSA.2019.8863523","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863523","url":null,"abstract":"Solid-state drives (SSD) are the mainstream solutions for massive data storage today. For modern computer systems, fair resource assignment is a critical design consideration and has drawn great interests in recent years. Although there are several I/O fairness schedulers proposed on the host side for SSDs, process fairness could still be dramatically degraded if garbage collection (GC) is triggered in the device side. A GC operation could block I/O requests, which causes unpredictable read/write latency variation and further impacts fairness between processes. This paper proposes Fair-GC, a novel coordinated host and device I/O scheduling strategy to achieve true fairness considering GC interferences. The key idea is to orchestrate GC operations inside SSDs carefully such that performance of a process is penalized by GC in the same degree (or comparable) as when it runs alone. In this way, the I/O fairness maintained by the host-side scheduler can be maintained in the presence of GC. Furthermore, our scheduler ensures that the timeslice of a process maintained at the host-side scheduler is updated in a timely manner to avoid unnecessary slowdown for maintaining fairness. Experimental results with a wide range of workloads verify that the proposed technique can achieve fairness as well as improve the throughput significantly. Compared to conventional fairness-based I/O scheduler, Fair-GC can reduce the slowdown of real applications by up to 99%, and improve the throughput by as much as 225%, respectively.","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"34 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114124724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/NVMSA.2019.8863517
Aijiao Cui, Zhenxing Chang, Ziming Wang, G. Qu, Huawei Li
The scan based design-for-testability (DfT) has been widely adopted in modern integrated circuits (ICs) design to facilitate manufacture testing. However, the transitions in scan cells result in much test power consumption during testing. The scan hold flip-flop (SHFF) can insulate the transitions in scan chain from the circuit under test to reduce test power while incurring much area overhead. We propose to solve this problem by adopting a memristor-based D flip-flop (DFF) into SHFF. The new design breaks down the design structure of conventional CMOS scan cells and adopts memristors into SHFF to reduce the number of transistors and hence the chip area. The functionality of the proposed design is verified to be correct by HSPICE simulation. Compared with the conventional SHFF cells, the area overhead is reduced 26.5%
{"title":"A Memristor-based Scan Hold Flip-Flop","authors":"Aijiao Cui, Zhenxing Chang, Ziming Wang, G. Qu, Huawei Li","doi":"10.1109/NVMSA.2019.8863517","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863517","url":null,"abstract":"The scan based design-for-testability (DfT) has been widely adopted in modern integrated circuits (ICs) design to facilitate manufacture testing. However, the transitions in scan cells result in much test power consumption during testing. The scan hold flip-flop (SHFF) can insulate the transitions in scan chain from the circuit under test to reduce test power while incurring much area overhead. We propose to solve this problem by adopting a memristor-based D flip-flop (DFF) into SHFF. The new design breaks down the design structure of conventional CMOS scan cells and adopts memristors into SHFF to reduce the number of transistors and hence the chip area. The functionality of the proposed design is verified to be correct by HSPICE simulation. Compared with the conventional SHFF cells, the area overhead is reduced 26.5%","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134271155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-01DOI: 10.1109/NVMSA.2019.8863514
Somm Kim, Yunji Kang, Dongkun Shin
Open-Channel SSDs are widely studied because of their advantages such as predictable latency, efficient data placement, and I/O scheduling. Currently, the Linux kernel includes pblk (The Physical Block Device), a host FTL that supports Open-Channel SSDs. In addition, there are recent studies that expand the single-threaded architecture of pblk to multi-threaded architecture: MT-FTL and QBLK. However, both pblk and recent studies were designed without considering fsync latency. However, since the fsync system call is performed synchronously, has a great effect on the performance of the system. In this paper, we propose FA-FTL, which is a host FTL considering fsync latency. Experiments show that FA-FTL is 141% higher than pblk and 119% higher than MT-FTL.
开放通道ssd由于其可预测的延迟、高效的数据放置和I/O调度等优点而被广泛研究。目前,Linux内核包括pblk (the Physical Block Device),这是一个支持Open-Channel ssd的主机FTL。此外,最近有研究将pblk的单线程架构扩展到多线程架构:MT-FTL和QBLK。然而,pblk和最近的研究都没有考虑fsync延迟。但是,由于fsync系统调用是同步执行的,因此对系统的性能有很大的影响。在本文中,我们提出了FA-FTL,它是一种考虑fsync延迟的主机FTL。实验表明,FA-FTL比pblk高141%,比MT-FTL高119%。
{"title":"fsync-aware Multi-Buffer FTL for Improving the fsync Latency with Open-Channel SSDs","authors":"Somm Kim, Yunji Kang, Dongkun Shin","doi":"10.1109/NVMSA.2019.8863514","DOIUrl":"https://doi.org/10.1109/NVMSA.2019.8863514","url":null,"abstract":"Open-Channel SSDs are widely studied because of their advantages such as predictable latency, efficient data placement, and I/O scheduling. Currently, the Linux kernel includes pblk (The Physical Block Device), a host FTL that supports Open-Channel SSDs. In addition, there are recent studies that expand the single-threaded architecture of pblk to multi-threaded architecture: MT-FTL and QBLK. However, both pblk and recent studies were designed without considering fsync latency. However, since the fsync system call is performed synchronously, has a great effect on the performance of the system. In this paper, we propose FA-FTL, which is a host FTL considering fsync latency. Experiments show that FA-FTL is 141% higher than pblk and 119% higher than MT-FTL.","PeriodicalId":438544,"journal":{"name":"2019 IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125532893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}