2019 35th Symposium on Mass Storage Systems and Technologies (MSST)最新文献_第2页

Fighting with Unknowns: Estimating the Performance of Scalable Distributed Storage Systems with Minimal Measurement Data 与未知作斗争:用最小测量数据估计可扩展分布式存储系统的性能

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-21

Moo-Ryong Ra, H. Lee

Constructing an accurate performance model for distributed storage systems has been identified as a very difficult problem. Researchers in this area either come up with an involved mathematical model specifically tailored to a target storage system or treat each storage system as a black box and apply machine learning techniques to predict the performance. Both approaches involve a significant amount of efforts and data collection processes, which often take a prohibited amount of time to apply to real world scenarios. In this paper, we propose a simple, yet accurate, performance estimation technique for scalable distributed storage systems. We claim that the total processing capability per IO size is conserved across a different mix of read/write ratios and IO sizes. Based on the hypothesis, we construct a performance model which can be used to estimate the performance of an arbitrarily mixed IO workload. The proposed technique requires only a couple of measurement points per IO size in order to provide accurate performance estimation. Our preliminary results are very promising. Based on two widely-used distributed storage systems (i.e., Ceph and Swift) under a different cluster configuration, we show that the total processing capability per IO size indeed remains constant. As a result, our technique was able to provide accurate prediction results.

构建准确的分布式存储系统性能模型是一个非常困难的问题。该领域的研究人员要么提出一个专门针对目标存储系统的复杂数学模型，要么将每个存储系统视为一个黑盒子，并应用机器学习技术来预测性能。这两种方法都涉及大量的工作和数据收集过程，通常需要大量的时间才能应用于现实世界的场景。在本文中，我们为可扩展的分布式存储系统提出了一种简单而准确的性能评估技术。我们声称，每个IO大小的总处理能力在不同的读/写比率和IO大小混合中是守恒的。基于此假设，我们构建了一个性能模型，该模型可用于估计任意混合IO工作负载的性能。为了提供准确的性能估计，所提出的技术只需要每个IO大小的几个测量点。我们的初步结果很有希望。基于两个广泛使用的分布式存储系统(即Ceph和Swift)在不同的集群配置下，我们表明每个IO大小的总处理能力确实保持不变。因此，我们的技术能够提供准确的预测结果。

{"title":"Fighting with Unknowns: Estimating the Performance of Scalable Distributed Storage Systems with Minimal Measurement Data","authors":"Moo-Ryong Ra, H. Lee","doi":"10.1109/MSST.2019.00-21","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-21","url":null,"abstract":"Constructing an accurate performance model for distributed storage systems has been identified as a very difficult problem. Researchers in this area either come up with an involved mathematical model specifically tailored to a target storage system or treat each storage system as a black box and apply machine learning techniques to predict the performance. Both approaches involve a significant amount of efforts and data collection processes, which often take a prohibited amount of time to apply to real world scenarios. In this paper, we propose a simple, yet accurate, performance estimation technique for scalable distributed storage systems. We claim that the total processing capability per IO size is conserved across a different mix of read/write ratios and IO sizes. Based on the hypothesis, we construct a performance model which can be used to estimate the performance of an arbitrarily mixed IO workload. The proposed technique requires only a couple of measurement points per IO size in order to provide accurate performance estimation. Our preliminary results are very promising. Based on two widely-used distributed storage systems (i.e., Ceph and Swift) under a different cluster configuration, we show that the total processing capability per IO size indeed remains constant. As a result, our technique was able to provide accurate prediction results.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114631933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Long-Term JPEG Data Protection and Recovery for NAND Flash-Based Solid-State Storage 基于NAND闪存的固态存储的长期JPEG数据保护和恢复

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-8

Yu-Chun Kuo, Ruei-Fong Chiu, Ren-Shuo Liu

NAND flash memory is widely used in solid-state storage including SD cards and eMMC chips, in which JPEG pictures are one of the most valuable data. In this work, we study NAND flash memory-aware, long-term JPEG data protection and recovery. Our goal is to increase the robustness of JPEG files stored in flash-based storage and rescue JPEG files that are corrupted due to long-term retention. JPEG files with our proposed protection techniques are compatible with existing JPEG viewers. We conduct real-system experiments by storing JPEG files on 16 nm, 3-bit-per-cell flash chips and let the JPEG files undergo a retention process equivalent to ten years at 25 degrees Celsius. Experimental results show that the proposed techniques can rescue corrupted JPEG files to achieve a PSNR improvement of up to 23.5 dB.

NAND闪存广泛应用于包括SD卡和eMMC芯片在内的固态存储中，其中JPEG图片是最有价值的数据之一。在这项工作中，我们研究了NAND闪存感知，长期JPEG数据保护和恢复。我们的目标是增加存储在基于闪存的存储中的JPEG文件的健壮性，并挽救由于长期保留而损坏的JPEG文件。采用我们建议的保护技术的JPEG文件与现有的JPEG查看器兼容。我们通过将JPEG文件存储在16nm，每单元3位的闪存芯片上进行了实际系统实验，并让JPEG文件在25摄氏度下进行相当于10年的保留过程。实验结果表明，该方法可以有效地修复损坏的JPEG文件，使PSNR提高23.5 dB。

引用次数: 3

Towards Virtual Machine Image Management for Persistent Memory 面向持久内存的虚拟机映像管理

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-11

Jiachen Zhang, Lixiao Cui, Peng Li, Xiaoguang Liu, Gang Wang

Persistent memory's (PM) byte-addressability and high capacity will also make it emerging for virtualized environment. Modern virtual machine monitors virtualize PM using either I/O virtualization or memory virtualization. However, I/O virtualization will sacrifice PM's byte-addressability, and memory virtualization does not get the chance of PM image management. In this paper, we enhance QEMU's memory virtualization mechanism. The enhanced system can achieve both PM's byte-addressability inside virtual machines and PM image management outside the virtual machines. We also design pcow, a virtual machine image format for PM, which is compatible with our enhanced memory virtualization and supports storage virtualization features including thin-provision, base image and snapshot. Address translation is performed with the help of Extended Page Table (EPT), thus much faster than image formats implemented in I/O virtualization. We also optimize pcow considering PM's characteristics. The evaluation demonstrates that our scheme boosts the overall performance by up to 50x compared with qcow2, an image format implemented in I/O virtualization, and brings almost no performance overhead compared with the native memory virtualization.

持久内存(PM)的字节可寻址性和高容量也将使其在虚拟化环境中崭露头角。现代虚拟机监视器使用I/O虚拟化或内存虚拟化来虚拟化PM。但是，I/O虚拟化将牺牲PM的字节可寻址性，并且内存虚拟化没有机会进行PM映像管理。本文对QEMU的内存虚拟化机制进行了改进。增强的系统可以在虚拟机内部实现PM的字节可寻址性，也可以在虚拟机外部实现PM映像管理。我们还设计了pcow，这是一种用于PM的虚拟机映像格式，它与我们增强的内存虚拟化兼容，并支持存储虚拟化特性，包括精简配置、基本映像和快照。地址转换是在扩展页表(EPT)的帮助下执行的，因此比在I/O虚拟化中实现的映像格式快得多。同时考虑PM的特性对pcow进行了优化。评估表明，与qcow2(一种在I/O虚拟化中实现的映像格式)相比，我们的方案将整体性能提高了50倍，并且与本机内存虚拟化相比几乎没有性能开销。

{"title":"Towards Virtual Machine Image Management for Persistent Memory","authors":"Jiachen Zhang, Lixiao Cui, Peng Li, Xiaoguang Liu, Gang Wang","doi":"10.1109/MSST.2019.00-11","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-11","url":null,"abstract":"Persistent memory's (PM) byte-addressability and high capacity will also make it emerging for virtualized environment. Modern virtual machine monitors virtualize PM using either I/O virtualization or memory virtualization. However, I/O virtualization will sacrifice PM's byte-addressability, and memory virtualization does not get the chance of PM image management. In this paper, we enhance QEMU's memory virtualization mechanism. The enhanced system can achieve both PM's byte-addressability inside virtual machines and PM image management outside the virtual machines. We also design pcow, a virtual machine image format for PM, which is compatible with our enhanced memory virtualization and supports storage virtualization features including thin-provision, base image and snapshot. Address translation is performed with the help of Extended Page Table (EPT), thus much faster than image formats implemented in I/O virtualization. We also optimize pcow considering PM's characteristics. The evaluation demonstrates that our scheme boosts the overall performance by up to 50x compared with qcow2, an image format implemented in I/O virtualization, and brings almost no performance overhead compared with the native memory virtualization.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115296451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CeSR: A Cell State Remapping Strategy to Reduce Raw Bit Error Rate of MLC NAND Flash 一种降低MLC NAND闪存原始误码率的单元状态重映射策略

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-6

Yutong Zhao, Wei Tong, Jingning Liu, D. Feng, Hongwei Qin

Retention errors and program interference errors have been recognized as the two main types of NAND flash errors. Since NAND flash cells in the erased state which hold the lowest threshold voltage are least likely to cause program interference and retention errors, existing schemes preprocess the raw data to increase the ratio of cells in the erased state. However, such schemes do not effectively decrease the ratio of cells with the highest threshold voltage which are most likely to cause program interference and retention errors. In addition, we note that the dominant error type of flash varies with data hotness. Retention errors are not too much of a concern for frequently updated hot data while cold data that is rarely updated need to worry about the growing retention errors as P/E cycles increase. Furthermore, the effects of these two types of errors on the same cell partially counteract each other. Given the observation that retention errors and program interference errors are both cell-state-dependent, this paper presents a cell state remapping (CeSR) strategy based on the error tendencies of data with different hotness. For different types of data segments, CeSR adopts different flipping schemes to remap the cell states in order to achieve the least error-prone data pattern for written data with different hotness. Evaluation shows that the proposed CeSR strategy can reduce the raw bit error rates of hot and cold data by up to 20.30% and 67.24%, respectively, compared with the state-of-the-art NRC strategy.

保留错误和程序干扰错误被认为是NAND闪存错误的两种主要类型。由于保持最低阈值电压的擦除状态的NAND闪存单元最不可能引起程序干扰和保留错误，现有方案预处理原始数据以增加处于擦除状态的单元的比例。然而，这种方案并不能有效地降低具有最高阈值电压的细胞的比例，这些细胞最容易引起程序干扰和保留错误。此外，我们注意到闪存的主要错误类型随数据热度而变化。对于频繁更新的热数据来说，保留错误不是太大的问题，而很少更新的冷数据则需要担心随着市盈率周期的增加而增加的保留错误。此外，这两种类型的错误对同一细胞的影响部分地相互抵消。鉴于保留误差和程序干扰误差都与细胞状态有关，本文提出了一种基于不同热度数据误差倾向的细胞状态重映射(CeSR)策略。对于不同类型的数据段，CeSR采用不同的翻转方案来重新映射单元状态，以实现对不同热度的写入数据的最不容易出错的数据模式。评估表明，与最先进的NRC策略相比，所提出的CeSR策略可以将热数据和冷数据的原始误码率分别降低20.30%和67.24%。

{"title":"CeSR: A Cell State Remapping Strategy to Reduce Raw Bit Error Rate of MLC NAND Flash","authors":"Yutong Zhao, Wei Tong, Jingning Liu, D. Feng, Hongwei Qin","doi":"10.1109/MSST.2019.000-6","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-6","url":null,"abstract":"Retention errors and program interference errors have been recognized as the two main types of NAND flash errors. Since NAND flash cells in the erased state which hold the lowest threshold voltage are least likely to cause program interference and retention errors, existing schemes preprocess the raw data to increase the ratio of cells in the erased state. However, such schemes do not effectively decrease the ratio of cells with the highest threshold voltage which are most likely to cause program interference and retention errors. In addition, we note that the dominant error type of flash varies with data hotness. Retention errors are not too much of a concern for frequently updated hot data while cold data that is rarely updated need to worry about the growing retention errors as P/E cycles increase. Furthermore, the effects of these two types of errors on the same cell partially counteract each other. Given the observation that retention errors and program interference errors are both cell-state-dependent, this paper presents a cell state remapping (CeSR) strategy based on the error tendencies of data with different hotness. For different types of data segments, CeSR adopts different flipping schemes to remap the cell states in order to achieve the least error-prone data pattern for written data with different hotness. Evaluation shows that the proposed CeSR strategy can reduce the raw bit error rates of hot and cold data by up to 20.30% and 67.24%, respectively, compared with the state-of-the-art NRC strategy.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129654132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Adjustable Flat Layouts for Two-Failure Tolerant Storage Systems 双故障容错存储系统的可调平面布局

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-1

T. Schwarz

Systems suffer component failure at sometimes un-predictable rates. Storage systems are no exception; they add redundancy in order to deal with various types of failures. The additional storage constitutes an important capital and operational cost and needs to be dimensioned appropriately. Unfortunately, storage device failure rates are difficult to predict and change over the lifetime of the system. Large disk-based storage centers provide protection against failure at the level of objects. However, this abstraction makes it difficult to adjust to a batch of devices that fail at a higher than anticipated rate. We propose here a solution that uses large pods of storage devices of the same kind, but that can re-organize in response to an increased number of failures of components seen elsewhere in the system or to an anticipated higher failure rate such as infant mortality or end-of-life fragility. Here, I present ways of organizing user data and parity data that allow us to move from three-failure tolerance to two-tolerance and back. A storage system using disk drives that might be suffering from infant mortality can switch from an initially three-failure-tolerant layout to a two-failure-tolerant one when disks have been burnt in. It gains capacity by shedding failure tolerance that have become unnecessary. A storage system using Flash can sacrifice capacity for reliability as its components have undergone many write-erase cycles and thereby become less reliable. Adjustable reliability is easy to achieve using a standard layout based on RAID Level 6 stripes where it is easy to convert components containing user data to ones containing parity data. Here, we present layouts that unlike the RAID layout use only exclusive-or operations, and do not depend on sophisticated, but power-hungry processors. There main advantage is a noticeable increase in reliability over RAID Level 6.

系统遭受组件故障的速度有时是不可预测的。存储系统也不例外;他们增加冗余是为了处理各种类型的故障。额外的存储构成了重要的资本和运营成本，需要适当地确定其规模。不幸的是，存储设备的故障率很难预测，也很难在系统的生命周期内改变。大型基于磁盘的存储中心在对象级别提供防止故障的保护。然而，这种抽象使得难以适应一批故障率高于预期的设备。我们在这里提出一种解决方案，该解决方案使用相同类型的大型存储设备，但可以重新组织，以响应系统中其他地方出现的组件故障数量的增加，或响应预期的更高故障率，如婴儿死亡率或生命终结脆弱性。在这里，我介绍了组织用户数据和奇偶校验数据的方法，这些方法允许我们从三容错切换到两容错，然后再切换回来。使用磁盘驱动器的存储系统可能会受到婴儿死亡率的影响，当磁盘被烧入时，可以从最初的三容错布局切换到双容错布局。它通过减少不必要的故障容忍度来获得容量。使用闪存的存储系统可能会牺牲容量来换取可靠性，因为其组件经历了许多写擦周期，因此可靠性降低。使用基于RAID Level 6分条的标准布局，可以轻松地将包含用户数据的组件转换为包含奇偶校验数据的组件，从而实现可调可靠性。这里，我们介绍的布局与RAID布局不同，它只使用排他或操作，并且不依赖于复杂但耗电的处理器。与RAID Level 6相比，其主要优点是可靠性显著提高。

{"title":"Adjustable Flat Layouts for Two-Failure Tolerant Storage Systems","authors":"T. Schwarz","doi":"10.1109/MSST.2019.000-1","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-1","url":null,"abstract":"Systems suffer component failure at sometimes un-predictable rates. Storage systems are no exception; they add redundancy in order to deal with various types of failures. The additional storage constitutes an important capital and operational cost and needs to be dimensioned appropriately. Unfortunately, storage device failure rates are difficult to predict and change over the lifetime of the system. Large disk-based storage centers provide protection against failure at the level of objects. However, this abstraction makes it difficult to adjust to a batch of devices that fail at a higher than anticipated rate. We propose here a solution that uses large pods of storage devices of the same kind, but that can re-organize in response to an increased number of failures of components seen elsewhere in the system or to an anticipated higher failure rate such as infant mortality or end-of-life fragility. Here, I present ways of organizing user data and parity data that allow us to move from three-failure tolerance to two-tolerance and back. A storage system using disk drives that might be suffering from infant mortality can switch from an initially three-failure-tolerant layout to a two-failure-tolerant one when disks have been burnt in. It gains capacity by shedding failure tolerance that have become unnecessary. A storage system using Flash can sacrifice capacity for reliability as its components have undergone many write-erase cycles and thereby become less reliable. Adjustable reliability is easy to achieve using a standard layout based on RAID Level 6 stripes where it is easy to convert components containing user data to ones containing parity data. Here, we present layouts that unlike the RAID layout use only exclusive-or operations, and do not depend on sophisticated, but power-hungry processors. There main advantage is a noticeable increase in reliability over RAID Level 6.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133363978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel all the time: Plane Level Parallelism Exploration for High Performance SSDs 始终并行:高性能ssd的平面级并行探索

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-5

Congming Gao, Liang Shi, C. Xue, Cheng Ji, Jun Yang, Youtao Zhang

Solid state drives (SSDs) are constructed with multiple level parallel organization, including channels, chips, dies and planes. Among these parallel levels, plane level parallelism, which is the last level parallelism of SSDs, has the most strict restrictions. Only the same type of operations which access the same address in different planes can be processed in parallel. In order to maximize the access performance, several previous works have been proposed to exploit the plane level parallelism for host accesses and internal operations of SSDs. However, our preliminary studies show that the plane level parallelism is far from well utilized and should be further improved. The reason is that the strict restrictions of plane level parallelism are hard to be satisfied. In this work, a from plane to die parallel optimization framework is proposed to exploit the plane level parallelism through smartly satisfying the strict restrictions all the time. In order to achieve the objective, there are at least two challenges. First, due to that host access patterns are always complex, receiving multiple same-type requests to different planes at the same time is uncommon. Second, there are many internal activities, such as garbage collection (GC), which may destroy the restrictions. In order to solve above challenges, two schemes are proposed in the SSD controller: First, a die level write construction scheme is designed to make sure there are always N pages of data written by each write operation. Second, in a further step, a die level GC scheme is proposed to activate GC in the unit of all planes in the same die. Combing the die level write and die level GC, write accesses from both host write operations and GC induced valid page movements can be processed in parallel at all time. As a result, the GC cost and average write latency can be significantly reduced. Experiment results show that the proposed framework is able to significantly improve the write performance without read performance impact.

固态硬盘(ssd)是由多层并行组织构成的，包括通道、芯片、模具和平面。在这些并行级别中，平面级并行是ssd的最后一级并行，限制最严格。只有访问不同平面的同一地址的同一类型的操作才能并行处理。为了最大限度地提高访问性能，人们已经提出了一些利用平面级并行性进行主机访问和ssd内部操作的研究。然而，我们的初步研究表明，平面水平的并行性远远没有得到很好的利用，需要进一步改进。其原因是平面平行度的严格限制难以满足。本文提出了一种从平面到模具的并行优化框架，通过始终巧妙地满足严格的约束条件来开发平面级并行性。为了实现这一目标，至少有两个挑战。首先，由于主机访问模式总是复杂的，因此同时接收多个相同类型的请求到不同的平面是不常见的。其次，存在许多内部活动，例如垃圾收集(GC)，这可能会破坏这些限制。为了解决上述挑战，在SSD控制器中提出了两种方案:首先，设计了一个芯片级的写构造方案，以确保每次写操作始终有N页的数据写入。其次，在进一步的步骤中，提出了一种模具级GC方案，以同一模具中所有平面为单位激活GC。结合die级写和die级GC，来自主机写操作和GC诱导的有效页面移动的写访问可以在任何时候并行处理。因此，可以显著降低GC成本和平均写延迟。实验结果表明，该框架能够在不影响读性能的情况下显著提高写性能。

{"title":"Parallel all the time: Plane Level Parallelism Exploration for High Performance SSDs","authors":"Congming Gao, Liang Shi, C. Xue, Cheng Ji, Jun Yang, Youtao Zhang","doi":"10.1109/MSST.2019.000-5","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-5","url":null,"abstract":"Solid state drives (SSDs) are constructed with multiple level parallel organization, including channels, chips, dies and planes. Among these parallel levels, plane level parallelism, which is the last level parallelism of SSDs, has the most strict restrictions. Only the same type of operations which access the same address in different planes can be processed in parallel. In order to maximize the access performance, several previous works have been proposed to exploit the plane level parallelism for host accesses and internal operations of SSDs. However, our preliminary studies show that the plane level parallelism is far from well utilized and should be further improved. The reason is that the strict restrictions of plane level parallelism are hard to be satisfied. In this work, a from plane to die parallel optimization framework is proposed to exploit the plane level parallelism through smartly satisfying the strict restrictions all the time. In order to achieve the objective, there are at least two challenges. First, due to that host access patterns are always complex, receiving multiple same-type requests to different planes at the same time is uncommon. Second, there are many internal activities, such as garbage collection (GC), which may destroy the restrictions. In order to solve above challenges, two schemes are proposed in the SSD controller: First, a die level write construction scheme is designed to make sure there are always N pages of data written by each write operation. Second, in a further step, a die level GC scheme is proposed to activate GC in the unit of all planes in the same die. Combing the die level write and die level GC, write accesses from both host write operations and GC induced valid page movements can be processed in parallel at all time. As a result, the GC cost and average write latency can be significantly reduced. Experiment results show that the proposed framework is able to significantly improve the write performance without read performance impact.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131821340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

XORInc: Optimizing Data Repair and Update for Erasure-Coded Systems with XOR-Based In-Network Computation 基于xor的网络计算优化擦除编码系统的数据修复和更新

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00005

F. Wang, Yingjie Tang, Yanwen Xie, Xuehai Tang

Erasure coding is widely used in the distributed storage systems due to its significant storage efficiency compared with replication at the same fault tolerance level. However, erasure coding introduces high cross-rack traffic since (1) repairing a single failed data block needs to read other available blocks from multiple nodes and (2) updating a data block triggers parity updates for all parity blocks. In order to alleviate the impact of these traffic on the performance of erasure coding, many works concentrate on designing new transmission schemes to increase bandwidth utilization among multiple storage nodes but they don't actually reduce network traffic. With the emergence of programmable network devices, the concept of in-network computation has been proposed. The key idea is to offload compute operations onto intermediate network devices. Inspired by this idea, we propose XORInc, a framework that utilizes programmable network devices to XOR data flows from multiple storage nodes so that XORInc can effectively reduce network traffic (especially the cross-rack traffic) and eliminate network bottleneck. Under XORInc, we design two new transmission schemes, NetRepair and NetUpdate, to optimize the repair and update operations, respectively. We implement XORInc based on HDFS-RAID and SDN to simulate an in-network computation framework. Experiments on a local testbed show that NetRepair reduces the repair time to almost the same as the normal read time and reduces the network traffic by up to 41%, meanwhile, NetUpdate reduces the update time and traffic by up to 74% and 30%, respectively.

在相同容错级别的情况下，与复制相比，Erasure编码具有显著的存储效率，在分布式存储系统中得到了广泛的应用。然而，擦除编码引入了高跨机架流量，因为(1)修复单个故障数据块需要从多个节点读取其他可用块，(2)更新数据块触发所有奇偶校验块的奇偶校验更新。为了减轻这些流量对擦除编码性能的影响，许多工作都集中在设计新的传输方案来提高多个存储节点之间的带宽利用率，但实际上并没有减少网络流量。随着可编程网络设备的出现，网络内计算的概念被提出。关键思想是将计算操作卸载到中间网络设备上。受此启发，我们提出了XORInc框架，该框架利用可编程网络设备对来自多个存储节点的数据流进行异或，从而使XORInc能够有效地减少网络流量(特别是跨机架流量)，消除网络瓶颈。在XORInc下，我们设计了两种新的传输方案，NetRepair和neupupdate，分别优化修复和更新操作。我们实现了基于HDFS-RAID和SDN的XORInc来模拟一个网络内计算框架。在本地测试平台上的实验表明，NetRepair将修复时间缩短到与正常读取时间几乎相同，将网络流量减少了41%，同时netuupdate将更新时间和流量分别减少了74%和30%。

{"title":"XORInc: Optimizing Data Repair and Update for Erasure-Coded Systems with XOR-Based In-Network Computation","authors":"F. Wang, Yingjie Tang, Yanwen Xie, Xuehai Tang","doi":"10.1109/MSST.2019.00005","DOIUrl":"https://doi.org/10.1109/MSST.2019.00005","url":null,"abstract":"Erasure coding is widely used in the distributed storage systems due to its significant storage efficiency compared with replication at the same fault tolerance level. However, erasure coding introduces high cross-rack traffic since (1) repairing a single failed data block needs to read other available blocks from multiple nodes and (2) updating a data block triggers parity updates for all parity blocks. In order to alleviate the impact of these traffic on the performance of erasure coding, many works concentrate on designing new transmission schemes to increase bandwidth utilization among multiple storage nodes but they don't actually reduce network traffic. With the emergence of programmable network devices, the concept of in-network computation has been proposed. The key idea is to offload compute operations onto intermediate network devices. Inspired by this idea, we propose XORInc, a framework that utilizes programmable network devices to XOR data flows from multiple storage nodes so that XORInc can effectively reduce network traffic (especially the cross-rack traffic) and eliminate network bottleneck. Under XORInc, we design two new transmission schemes, NetRepair and NetUpdate, to optimize the repair and update operations, respectively. We implement XORInc based on HDFS-RAID and SDN to simulate an in-network computation framework. Experiments on a local testbed show that NetRepair reduces the repair time to almost the same as the normal read time and reduces the network traffic by up to 41%, meanwhile, NetUpdate reduces the update time and traffic by up to 74% and 30%, respectively.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126219354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Pattern-based Write Scheduling and Read Balance-oriented Wear-Leveling for Solid State Drivers 基于模式的固态驱动写调度和面向读平衡的损耗均衡

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-10

Jun Li, Xiaofei Xu, Xiaoning Peng, Jianwei Liao

This paper proposes a pattern-based I/O scheduling mechanism, which identifies frequently written data with patterns and dispatches them to the same SSD blocks having a small erase count. The data on the same block are mostly like to be invalided together, so that the overhead of garbage collection can be greatly reduced. Moreover, a read balance-oriented wear-leveling scheme is introduced to extend the lifetime of SSDs. Specifically, it distributes the hot read data in the blocks with a small erase count, to heavily erased blocks in different chips of the same SSD channel, while carrying out wear-leveling. As a result, internal parallelism at the chip level of SSD can be fully exploited for achieving better read data throughput. We conduct a series of simulation tests with a number of disk traces of real-world applications under the SSDsim platform. The experimental results show that the newly proposed mechanism can reduce garbage collection overhead by 11.3%, and the read response time by 12.8% in average, comparing to existing approaches of scheduling and wear-leveling for SSDs.

本文提出了一种基于模式的I/O调度机制，该机制通过模式识别频繁写入的数据，并将它们分配到具有较小擦除计数的相同SSD块上。同一块上的数据大多会一起失效，这样可以大大减少垃圾收集的开销。此外，还引入了一种面向读均衡的磨损均衡方案来延长ssd的寿命。具体来说，它将擦除次数少的块中的热读数据分配到同一SSD通道的不同芯片上的重擦除块中，同时进行损耗均衡。因此，可以充分利用SSD芯片级的内部并行性来实现更好的读取数据吞吐量。我们在SSDsim平台下对实际应用程序的许多磁盘跟踪进行了一系列模拟测试。实验结果表明，与现有的ssd调度和磨损均衡方法相比，该机制可将垃圾收集开销平均降低11.3%，读取响应时间平均降低12.8%。

引用次数: 15

Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based Mechanisms 基于预计算机制加速HPC数据集的相对误差有界有损压缩

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-15

Xiangyu Zou, Tao Lu, Wen Xia, Xuan Wang, Weizhe Zhang, S. Di, Dingwen Tao, F. Cappello

Scientific simulations in high-performance computing (HPC) environments are producing vast volume of data, which may cause a severe I/O bottleneck at runtime and a huge burden on storage space for post-analysis. Unlike the traditional data reduction schemes (such as deduplication or lossless compression), not only can error-controlled lossy compression significantly reduce the data size but it can also hold the promise to satisfy user demand on error control. Point-wise relative error bounds (i.e., compression errors depends on the data values) are widely used by many scientific applications in the lossy compression, since error control can adapt to the precision in the dataset automatically. Point-wise relative error bounded compression is complicated and time consuming. In this work, we develop efficient precomputation-based mechanisms in the SZ lossy compression framework. Our mechanisms can avoid costly logarithmic transformation and identify quantization factor values via a fast table lookup, greatly accelerating the relative-error bounded compression with excellent compression ratios. In addition, our mechanisms also help reduce traversing operations for Huffman decoding, and thus significantly accelerate the decompression process in SZ. Experiments with four well-known real-world scientific simulation datasets show that our solution can improve the compression rate by about 30% and decompression rate by about 70% in most of cases, making our designed lossy compression strategy the best choice in class in most cases.

高性能计算(HPC)环境中的科学模拟会产生大量数据，这可能会在运行时造成严重的I/O瓶颈，并给后期分析带来巨大的存储空间负担。与传统的数据缩减方案(如重复数据删除或无损压缩)不同，错误控制的有损压缩不仅可以显著减少数据大小，而且可以保证满足用户对错误控制的需求。逐点相对误差边界(即压缩误差取决于数据值)在有损压缩中被广泛应用于许多科学应用，因为误差控制可以自动适应数据集的精度。逐点相对误差有界压缩是一种复杂且耗时的压缩方法。在这项工作中，我们在SZ有损压缩框架中开发了高效的基于预计算的机制。我们的机制可以避免代价高昂的对数变换，并通过快速查找表来识别量化因子值，从而大大加快了具有优异压缩比的相对误差有限压缩。此外，我们的机制还有助于减少霍夫曼解码的遍历操作，从而显著加快SZ中的解压缩过程。在四个著名的真实科学仿真数据集上的实验表明，在大多数情况下，我们的解决方案可以将压缩率提高约30%，解压率提高约70%，使我们设计的有损压缩策略在大多数情况下是类中的最佳选择。

{"title":"Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based Mechanisms","authors":"Xiangyu Zou, Tao Lu, Wen Xia, Xuan Wang, Weizhe Zhang, S. Di, Dingwen Tao, F. Cappello","doi":"10.1109/MSST.2019.00-15","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-15","url":null,"abstract":"Scientific simulations in high-performance computing (HPC) environments are producing vast volume of data, which may cause a severe I/O bottleneck at runtime and a huge burden on storage space for post-analysis. Unlike the traditional data reduction schemes (such as deduplication or lossless compression), not only can error-controlled lossy compression significantly reduce the data size but it can also hold the promise to satisfy user demand on error control. Point-wise relative error bounds (i.e., compression errors depends on the data values) are widely used by many scientific applications in the lossy compression, since error control can adapt to the precision in the dataset automatically. Point-wise relative error bounded compression is complicated and time consuming. In this work, we develop efficient precomputation-based mechanisms in the SZ lossy compression framework. Our mechanisms can avoid costly logarithmic transformation and identify quantization factor values via a fast table lookup, greatly accelerating the relative-error bounded compression with excellent compression ratios. In addition, our mechanisms also help reduce traversing operations for Huffman decoding, and thus significantly accelerate the decompression process in SZ. Experiments with four well-known real-world scientific simulation datasets show that our solution can improve the compression rate by about 30% and decompression rate by about 70% in most of cases, making our designed lossy compression strategy the best choice in class in most cases.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"48 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125731455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Economics of Information Storage: The Value in Storing the Long Tail 信息存储经济学:存储长尾的价值

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.000-4

J. Hughes

We have witnessed a 50 million-fold increase in hard disk drive density without a similar increase in performance. How can this unbalanced growth be possible? Can it continue? Can similar unbalanced growth happen in other media? To answer these questions we contrast the value of information storage services with the value of physical storage services. We describe a methodology that separates the costs of capturing, storing and accessing information, and we will show that these aspects of storage systems are independent of each other. We provide arguments for what can happen if the cost of storage continues to decrease. The conclusions are three-fold. First, as capacity of any storage media grows, there is no inherent requirement that performance increase at the same rate. Second, the value of increased capacity devices can be quantified. Third, as the cost of storing information approaches zero, the quantity of information stored will grow without limit.

我们已经见证了硬盘驱动器密度增加了5000万倍，而性能却没有类似的提高。这种不平衡的增长怎么可能呢?它能继续下去吗?其他媒体是否也会出现类似的不平衡增长?为了回答这些问题，我们将信息存储服务的价值与物理存储服务的价值进行对比。我们描述了一种分离捕获、存储和访问信息成本的方法，我们将展示存储系统的这些方面是相互独立的。如果存储成本继续下降，将会发生什么?结论有三点。首先，随着任何存储介质容量的增长，并不必然要求性能以相同的速度增长。其次，增加容量设备的价值可以量化。第三，随着存储信息的成本趋近于零，存储的信息量将无限制地增长。

引用次数: 1