ACM Transactions on Storage (TOS)最新文献_第6页

Everyone Loves File 人人爱文件

ACM Transactions on Storage (TOS)

Pub Date : 2020-03-05 DOI: 10.1145/3377877

Bradley C. Kuszmaul, Matteo Frigo, Justin Mazzola Paluska, Alexander Sandler

Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key minitransactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018 and holds tens of thousands of customer file systems comprising many petabytes of data.

Oracle文件存储服务(FSS)是一种弹性文件系统，作为托管NFS服务提供。流水线化的Paxos实现支持可扩展的块存储，提供可线性化的多页有限大小的事务。在块存储之上，一个可扩展的b树保存文件系统元数据，并提供可线性化的多键有限大小的事务。作为单独事务执行的自我验证b树节点和整理操作允许b树事务中的每个键只需要底层块事务中的一个页面。文件系统通过使用版本化的键值对提供快照。该系统采用非阻塞无锁编程风格进行编程。表示服务器不维护持久的本地状态，这使得它们可伸缩且易于故障转移。不可伸缩的paxos复制哈希表保存引导系统所需的配置信息。另外一个b树为控制平面信息提供会话式多键微型事务。通过比较复制所需的网络带宽和硬件提供的网络带宽，可以预测系统吞吐量。卸载系统上的延迟大约是NVMe支持的Linux NFS服务器的4倍，这反映了复制的成本。FSS自2018年1月开始投入生产，拥有数万个客户文件系统，其中包含许多pb的数据。

{"title":"Everyone Loves File","authors":"Bradley C. Kuszmaul, Matteo Frigo, Justin Mazzola Paluska, Alexander Sandler","doi":"10.1145/3377877","DOIUrl":"https://doi.org/10.1145/3377877","url":null,"abstract":"Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key minitransactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018 and holds tens of thousands of customer file systems comprising many petabytes of data.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123754454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

EIC Message 启德集团信息

ACM Transactions on Storage (TOS)

Pub Date : 2020-01-29 DOI: 10.1145/3372345

K. Wagner, Y. Zorian

As I start my second three-year term as Editor-in-Chief (EiC) of ACM TOS, I would like to take this opportunity to announce some shuffling of Associate Editors. Those leaving are (in alphabetical order) Nitin Agrawal, Sangyeon Cho, Cheng Huang, Onur Mutlu, Michael Swift, Nisha Talagala, Andy Wang, and Tong Zhang. I thank them for their devoted services the last three years. Without their sacrifice, it would have been impossible to run this journal. I am also appointing a new batch of Associate Editors. Namely (again, alphabetically), Yuan Hao Chang, Jooyoung Hwang, Geoff Kuenning, Philip Shilane, Devesh Tiwali, Swami Sundararaman, and Ming Zhao. As their short bios that follow shows, they are all respected experts in the field of storage. I am sure they will contribute immensely to the continued success of our journal.

在我开始担任ACM TOS总编辑的第二个三年任期之际，我想借此机会宣布一些副编辑的洗牌。离开的是(按字母顺序排列)尼廷·阿格拉瓦尔、赵相妍、黄诚、奥努·穆特鲁、迈克尔·斯威夫特、尼莎·塔拉加拉、安迪·王和张彤。我感谢他们在过去三年中所作的奉献服务。没有他们的牺牲，这本杂志是不可能经营下去的。我还任命了一批新的副主编。也就是(按字母顺序)，张元昊，黄卓荣，杰夫·昆宁，菲利普·希拉内，德维什·蒂瓦利，斯瓦米·孙达拉曼和赵明。正如他们简短的个人简介所示，他们都是存储领域受人尊敬的专家。我相信他们将为我们杂志的持续成功做出巨大贡献。

引用次数: 0

INSTalytics INSTalytics

ACM Transactions on Storage (TOS)

Pub Date : 2020-01-16 DOI: 10.1145/3369738

Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia

We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

我们介绍了INSTalytics的设计、实现和评估，INSTalytics是一个由集群文件系统和计算层共同设计的堆栈，用于大规模数据中心的高效大数据分析。INSTalytics放大了分析系统中众所周知的数据分区的好处;与传统的一维分区不同，INSTalytics支持以相同的存储成本在四个不同的维度上同时对数据进行分区，从而使更大比例的查询受益于分区过滤和连接，而无需网络洗换。为了实现这一点，INSTalytics使用计算感知来定制集群文件系统用于可用性的三向复制。新的异构复制布局使INSTalytics能够保持与传统复制相同的恢复成本和可用性。INSTalytics还使用计算感知来公开一个新的切片读取API，通过在存储节点上协调请求调度和选择性缓存，使多个计算节点能够有效地读取数据块的切片，从而提高连接的性能。我们已经在生产分析堆栈中构建了INSTalytics的原型实现，我们展示了恢复性能和可用性与物理复制相似，同时在查询性能方面提供了显着改进，提出了设计云规模大数据分析系统的新方法。

{"title":"INSTalytics","authors":"Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia","doi":"10.1145/3369738","DOIUrl":"https://doi.org/10.1145/3369738","url":null,"abstract":"We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121822298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Countering Fragmentation in an Enterprise Storage System 企业存储系统防碎片化

ACM Transactions on Storage (TOS)

Pub Date : 2020-01-16 DOI: 10.1145/3366173

R. Kesavan, Matthew Curtis-Maury, V. Devadas, K. Mishra

As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. Similarly, wasted space can also occur when a file system writes a collection of blocks out to object storage as a single large object, because the constituent blocks can become free at different times. The impact of fragmentation also depends on the underlying storage media. This article studies each form of fragmentation in the NetApp® WAFL®file system, and explains how the file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The article analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.

随着文件系统的老化，它可能会经历多种形式的碎片。文件系统中可用空间的碎片会降低写性能和随后的读性能。客户端操作和内部操作(如重复数据删除)都可能对单个文件的布局造成碎片化，这也会影响文件读取性能。允许子块粒度寻址的文件系统可能会收集块内部碎片，从而导致空闲空间的浪费。类似地，当文件系统将一组块作为单个大对象写入对象存储时，也会出现空间浪费，因为组成块可能在不同的时间变得空闲。碎片的影响还取决于底层存储介质。本文研究了NetApp®WAFL®文件系统中的每种碎片形式，并解释了文件系统如何利用存储虚拟化层进行碎片整理技术，从而有效地物理重新定位块，包括只读快照中的块。本文分析了这些技术在减少碎片和提高跨各种存储介质的整体性能方面的有效性。

引用次数: 6

GraphOne GraphOne

ACM Transactions on Storage (TOS)

Pub Date : 2020-01-16 DOI: 10.1145/3364180

P. Kumar, H. H. Huang

There is a growing need to perform a diverse set of real-time analytics (batch and stream analytics) on evolving graphs to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations; rather, they excel in one area that keeps a private data store in a specialized way to favor their operations only. To address this challenge, we have designed and developed GraphOne, a graph data store that abstracts the graph data store away from the specialized systems to solve the fundamental research problems associated with the data store design. It combines two complementary graph storage formats (edge list and adjacency list) and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions (called data visibility) for concurrent execution of diverse classes of real-time graph analytics with only a small data duplication. Experimental results show that GraphOne is able to deliver 11.40× and 5.36× average speedup in ingestion rate against LLAMA and Stinger, the two state-of-the-art dynamic graph systems, respectively. Further, they achieve an average speedup of 8.75× and 4.14× against LLAMA and 12.80× and 3.18× against Stinger for BFS and PageRank analytics (batch version), respectively. GraphOne also gains over 2,000× speedup against Kickstarter, a state-of-the-art stream analytics engine in ingesting the streaming edges and performing streaming BFS when treating first half as a base snapshot and rest as streaming edge in a synthetic graph. GraphOne also achieves an ingestion rate of two to three orders of magnitude higher than graph databases. Finally, we demonstrate that it is possible to run concurrent stream analytics from the same data store.

越来越多的人需要对不断发展的图形执行各种实时分析(批处理和流分析)，以向用户提供大数据的价值。这类应用程序的关键需求是拥有一个数据存储来有效地支持其不同的数据访问，同时以高速同时摄取细粒度更新。不幸的是，当前的图形系统，无论是图形数据库还是分析引擎，都无法实现这两种操作的高性能;相反，它们在一个领域表现出色，即以一种专门的方式保留私有数据存储，只对它们的操作有利。为了应对这一挑战，我们设计并开发了GraphOne，这是一个图形数据存储，它将图形数据存储从专门的系统中抽象出来，以解决与数据存储设计相关的基本研究问题。它结合了两种互补的图存储格式(边表和邻接表)，并使用双版本控制将图计算与更新解耦。重要的是，它提出了一个新的数据抽象，GraphView，使数据访问在数据摄取的两个不同粒度(称为数据可见性)，以并行执行不同类别的实时图形分析，只有一个小的数据复制。实验结果表明，与LLAMA和Stinger这两种最先进的动态图形系统相比，GraphOne的平均摄取速度分别提高了11.40倍和5.36倍。此外，对于BFS和PageRank分析(批处理版本)，它们相对LLAMA的平均加速分别为8.75倍和4.14倍，相对Stinger的平均加速分别为12.80倍和3.18倍。GraphOne还获得了超过2000倍的加速，Kickstarter是一个最先进的流分析引擎，在将前半部分作为基本快照并将其余部分作为合成图中的流边时，可以摄取流边并执行流BFS。GraphOne还实现了比图形数据库高两到三个数量级的摄取速率。最后，我们演示了从同一数据存储运行并发流分析是可能的。

{"title":"GraphOne","authors":"P. Kumar, H. H. Huang","doi":"10.1145/3364180","DOIUrl":"https://doi.org/10.1145/3364180","url":null,"abstract":"There is a growing need to perform a diverse set of real-time analytics (batch and stream analytics) on evolving graphs to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations; rather, they excel in one area that keeps a private data store in a specialized way to favor their operations only. To address this challenge, we have designed and developed GraphOne, a graph data store that abstracts the graph data store away from the specialized systems to solve the fundamental research problems associated with the data store design. It combines two complementary graph storage formats (edge list and adjacency list) and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions (called data visibility) for concurrent execution of diverse classes of real-time graph analytics with only a small data duplication. Experimental results show that GraphOne is able to deliver 11.40× and 5.36× average speedup in ingestion rate against LLAMA and Stinger, the two state-of-the-art dynamic graph systems, respectively. Further, they achieve an average speedup of 8.75× and 4.14× against LLAMA and 12.80× and 3.18× against Stinger for BFS and PageRank analytics (batch version), respectively. GraphOne also gains over 2,000× speedup against Kickstarter, a state-of-the-art stream analytics engine in ingesting the streaming edges and performing streaming BFS when treating first half as a base snapshot and rest as streaming edge in a synthetic graph. GraphOne also achieves an ingestion rate of two to three orders of magnitude higher than graph databases. Finally, we demonstrate that it is possible to run concurrent stream analytics from the same data store.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"483 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114280338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

LDJ

ACM Transactions on Storage (TOS)

Pub Date : 2019-11-28 DOI: 10.1145/3365918

Donghyun Kang, Sang-Won Lee, Y. Eom

In this article, we propose a simple but practical and efficient optimization scheme for journaling in ext4, called lightweight data journaling (LDJ). By compressing journaled data prior to writing, LDJ can perform comparable to or even faster than the default ordered journaling (OJ) mode in ext4 on top of both HDDs and flash storage devices, while still guaranteeing the version consistency of the data journaling (DJ) mode. This surprising result can be explained with three main reasons. First, on modern storage devices, the sequential write pattern dominating in DJ mode is more and more high-performant than the random one in OJ mode. Second, the compression significantly reduces the amount of journal writes, which will in turn make the write completion faster and prolong the lifespan of storage devices. Third, the compression also enables the atomicity of each journal write without issuing an intervening FLUSH command between journal data blocks and commit block, thus halving the number of costly FLUSH calls in LDJ. We have prototyped our LDJ by slightly modifying the existing ext4 with jbd2 for journaling and also e2fsck for recovery; less than 300 lines of source code were changed. Also, we carried out a comprehensive evaluation using four standard benchmarks and three real applications. Our evaluation results clearly show that LDJ outperforms the OJ mode by up to 9.6× on the real applications.

在本文中，我们为ext4中的日志记录提出了一个简单但实用且高效的优化方案，称为轻量级数据日志记录(LDJ)。通过在写入之前压缩日志数据，LDJ可以在hdd和闪存存储设备上执行与ext4中默认的有序日志记录(OJ)模式相当甚至更快的操作，同时仍然保证数据日志记录(DJ)模式的版本一致性。这个令人惊讶的结果可以用三个主要原因来解释。首先，在现代存储设备上，DJ模式下的顺序写入模式比OJ模式下的随机写入模式性能越来越高。其次，压缩大大减少了日志写的数量，这反过来又会使写完成得更快，延长存储设备的使用寿命。第三，压缩还支持每个日志写入的原子性，而无需在日志数据块和提交块之间发出中间的FLUSH命令，从而将LDJ中代价高昂的FLUSH调用数量减半。我们通过略微修改现有的ext4来创建LDJ的原型，其中jbd2用于日志记录，e2fsck用于恢复;修改了不到300行源代码。此外，我们使用四个标准基准和三个实际应用程序进行了全面的评估。我们的评估结果清楚地表明，在实际应用中，LDJ比OJ模式的性能高出9.6倍。

{"title":"LDJ","authors":"Donghyun Kang, Sang-Won Lee, Y. Eom","doi":"10.1145/3365918","DOIUrl":"https://doi.org/10.1145/3365918","url":null,"abstract":"In this article, we propose a simple but practical and efficient optimization scheme for journaling in ext4, called lightweight data journaling (LDJ). By compressing journaled data prior to writing, LDJ can perform comparable to or even faster than the default ordered journaling (OJ) mode in ext4 on top of both HDDs and flash storage devices, while still guaranteeing the version consistency of the data journaling (DJ) mode. This surprising result can be explained with three main reasons. First, on modern storage devices, the sequential write pattern dominating in DJ mode is more and more high-performant than the random one in OJ mode. Second, the compression significantly reduces the amount of journal writes, which will in turn make the write completion faster and prolong the lifespan of storage devices. Third, the compression also enables the atomicity of each journal write without issuing an intervening FLUSH command between journal data blocks and commit block, thus halving the number of costly FLUSH calls in LDJ. We have prototyped our LDJ by slightly modifying the existing ext4 with jbd2 for journaling and also e2fsck for recovery; less than 300 lines of source code were changed. Also, we carried out a comprehensive evaluation using four standard benchmarks and three real applications. Our evaluation results clearly show that LDJ outperforms the OJ mode by up to 9.6× on the real applications.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130082045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems 大型存储系统中硬盘状态监测的注意力增强深度体系结构

ACM Transactions on Storage (TOS)

Pub Date : 2019-08-13 DOI: 10.1145/3340290

Ji Wang, Weidong Bao, Lei Zheng, Xiaomin Zhu, Philip S. Yu

Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.

配备大规模存储系统的数据中心是大数据时代的关键基础设施。由于存储系统中硬盘数量庞大，故障率大大提高，可能会给数据服务用户和数据服务提供商带来巨大的损失。尽管有诸如RAID等一系列反应性容错措施，但提高大规模存储系统的可靠性仍然是一个棘手的问题。主动预测是提前避免硬盘故障的有效方法。提出了一系列基于SMART统计的模型来预测即将发生的硬盘故障。尽管如此，仍然存在一些严重的尚未解决的挑战，如缺乏预测结果的可解释性。为了解决这些问题，我们仔细分析了从现实世界的大型存储系统收集的数据集，然后设计了一个用于硬盘健康状态评估和故障预测的注意力增强深度架构。深层架构由特征集成层、时间依赖提取层、关注层和分类层组成，不仅可以监控硬盘状态，还可以辅助故障原因诊断。基于实际数据集的实验表明，所提出的深度架构能够准确地评估硬盘状态并预测即将发生的故障。此外，实验结果表明，注意力增强深度架构可以自动揭示硬盘的退化过程，并帮助管理员跟踪硬盘故障的原因。

{"title":"An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems","authors":"Ji Wang, Weidong Bao, Lei Zheng, Xiaomin Zhu, Philip S. Yu","doi":"10.1145/3340290","DOIUrl":"https://doi.org/10.1145/3340290","url":null,"abstract":"Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116262087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Repair Pipelining for Erasure-coded Storage: Algorithms and Evaluation 擦除编码存储的修复流水线:算法和评估

ACM Transactions on Storage (TOS)

Pub Date : 2019-08-05 DOI: 10.1145/3436890

Xiaolu Li, Zuoru Yang, Jinhong Li, Runhui Li, P. Lee, Qun Huang, Yuchong Hu

We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design different extensions of repair pipelining algorithms for heterogeneous environments and multi-block repair operations. We implement a repair pipelining prototype, called ECPipe, and integrate it as a middleware system into two versions of Hadoop Distributed File System (HDFS) (namely, HDFS-RAID and HDFS-3) as well as Quantcast File System. Experiments on a local testbed and Amazon EC2 show that repair pipelining significantly improves the performance of degraded reads and full-node recovery over existing repair techniques.

我们提出了修复流水线，这是一种提高一般擦除编码存储的修复性能的技术。通过以流水线的方式在存储节点上以小单元仔细地调度故障数据的修复，修复流水线将单个块的修复时间减少到与同构环境中单个块的正常读取时间大致相同。我们进一步为异构环境和多块修复操作设计了不同的修复流水线算法扩展。我们实现了一个名为ECPipe的修复管道原型，并将其作为中间件系统集成到两个版本的HDFS(即HDFS- raid和HDFS-3)以及Quantcast文件系统中。在本地测试平台和Amazon EC2上的实验表明，与现有的修复技术相比，修复流水线显著提高了降级读取和全节点恢复的性能。

引用次数: 6

ZoneTier

ACM Transactions on Storage (TOS)

Pub Date : 2019-07-10 DOI: 10.1145/3335548

Xuchao Xie, Liquan Xiao, D. H. Du

Integrating solid-state drives (SSDs) and host-aware shingled magnetic recording (HA-SMR) drives can potentially build a cost-effective high-performance storage system. However, existing SSD tiering and caching designs in such a hybrid system are not fully matched with the intrinsic properties of HA-SMR drives due to their lacking consideration of how to handle non-sequential writes (NSWs). We propose ZoneTier, a zone-based storage tiering and caching co-design, to effectively control all the NSWs by leveraging the host-aware property of HA-SMR drives. ZoneTier exploits real-time data layout of SMR zones to optimize zone placement, reshapes NSWs generated from zone demotions to SMR preferred sequential writes, and transforms the inevitable NSWs to cleaning-friendly write traffics for SMR zones. ZoneTier can be easily extended to match host-managed SMR drives using proactive cleaning policy. We implemented a prototype of ZoneTier with user space data management algorithms and real SSD and HA-SMR drives, which are manipulated by the functions provided by libzbc and libaio. Our experiments show that ZoneTier can reduce zone relocation overhead by 29.41% on average, shorten performance recovery time of HA-SMR drives from cleaning by up to 33.37%, and improve performance by up to 32.31% than existing hybrid storage designs.

集成固态硬盘(ssd)和主机感知的带状磁记录(HA-SMR)驱动器可以潜在地构建一个经济高效的高性能存储系统。然而，在这种混合系统中，现有的SSD分层和缓存设计并不能完全匹配HA-SMR驱动器的固有特性，因为它们没有考虑如何处理非顺序写(nsw)。我们提出了基于区域的存储分层和缓存协同设计ZoneTier，通过利用HA-SMR驱动器的主机感知特性来有效控制所有nsw。ZoneTier利用SMR分区的实时数据布局优化分区布局，将分区降级产生的nsw重塑为SMR优先顺序写，并将不可避免的nsw转换为SMR分区的清洁友好写流量。ZoneTier可以很容易地扩展，以匹配主机管理的SMR驱动器使用主动清洁策略。我们使用用户空间数据管理算法和真实的SSD和HA-SMR驱动器实现了ZoneTier的原型，这些驱动器由libzbc和libaio提供的功能进行操作。实验表明，与现有的混合存储设计相比，ZoneTier可以平均减少29.41%的区域迁移开销，缩短HA-SMR驱动器清理后的性能恢复时间，最多可缩短33.37%，提高32.31%的性能。

{"title":"ZoneTier","authors":"Xuchao Xie, Liquan Xiao, D. H. Du","doi":"10.1145/3335548","DOIUrl":"https://doi.org/10.1145/3335548","url":null,"abstract":"Integrating solid-state drives (SSDs) and host-aware shingled magnetic recording (HA-SMR) drives can potentially build a cost-effective high-performance storage system. However, existing SSD tiering and caching designs in such a hybrid system are not fully matched with the intrinsic properties of HA-SMR drives due to their lacking consideration of how to handle non-sequential writes (NSWs). We propose ZoneTier, a zone-based storage tiering and caching co-design, to effectively control all the NSWs by leveraging the host-aware property of HA-SMR drives. ZoneTier exploits real-time data layout of SMR zones to optimize zone placement, reshapes NSWs generated from zone demotions to SMR preferred sequential writes, and transforms the inevitable NSWs to cleaning-friendly write traffics for SMR zones. ZoneTier can be easily extended to match host-managed SMR drives using proactive cleaning policy. We implemented a prototype of ZoneTier with user space data management algorithms and real SSD and HA-SMR drives, which are manipulated by the functions provided by libzbc and libaio. Our experiments show that ZoneTier can reduce zone relocation overhead by 29.41% on average, shorten performance recovery time of HA-SMR drives from cleaning by up to 33.37%, and improve performance by up to 32.31% than existing hybrid storage designs.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125571672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CORES 核

ACM Transactions on Storage (TOS)

Pub Date : 2019-06-26 DOI: 10.1145/3321704

Weidong Wen, Yang Li, Wenhai Li, Lingfeng Deng, Yanxiang He

The relatively high cost of record deserialization is increasingly becoming the bottleneck of column-based storage systems in tree-structured applications [58]. Due to record transformation in the storage layer, unnecessary processing costs derived from fields and rows irrelevant to queries may be very heavy in nested schemas, significantly wasting the computational resources in large-scale analytical workloads. This leads to the question of how to reduce both the deserialization and IO costs of queries with highly selective filters following arbitrary paths in a nested schema. We present CORES (Column-Oriented Regeneration Embedding Scheme) to push highly selective filters down into column-based storage engines, where each filter consists of several filtering conditions on a field. By applying highly selective filters in the storage layer, we demonstrate that both the deserialization and IO costs could be significantly reduced. We show how to introduce fine-grained composition on filtering results. We generalize this technique by two pair-wise operations, rollup and drilldown, such that a series of conjunctive filters can effectively deliver their payloads in nested schema. The proposed methods are implemented on an open-source platform. For practical purposes, we highlight how to build a column storage engine and how to drive a query efficiently based on a cost model. We apply this design to the nested relational model especially when hierarchical entities are frequently required by ad hoc queries. The experiments, including a real workload and the modified TPCH benchmark, demonstrate that CORES improves the performance by 0.7×--26.9× compared to state-of-the-art platforms in scan-intensive workloads.

相对较高的记录反序列化成本日益成为树结构应用中基于列的存储系统的瓶颈。由于存储层的记录转换，在嵌套模式中，与查询无关的字段和行产生的不必要的处理成本可能非常高，在大规模分析工作负载中严重浪费计算资源。这就导致了一个问题，即如何在嵌套模式中使用遵循任意路径的高度选择性过滤器来减少查询的反序列化和IO成本。我们提出了core(面向列的再生嵌入方案)，将高选择性过滤器推送到基于列的存储引擎中，其中每个过滤器由字段上的几个过滤条件组成。通过在存储层中应用高选择性滤波器，我们证明反序列化和IO成本都可以显着降低。我们将展示如何在过滤结果中引入细粒度组合。我们通过两个成对操作(rollup和drilldown)对该技术进行了推广，这样一系列联合过滤器就可以在嵌套模式中有效地交付它们的有效负载。所提出的方法在一个开源平台上实现。出于实际目的，我们重点介绍了如何构建列存储引擎以及如何基于成本模型高效地驱动查询。我们将这种设计应用于嵌套关系模型，特别是当特别查询经常需要分层实体时。包括真实工作负载和修改后的TPCH基准测试在内的实验表明，在扫描密集型工作负载中，与最先进的平台相比，内核的性能提高了0.7 -26.9倍。

{"title":"CORES","authors":"Weidong Wen, Yang Li, Wenhai Li, Lingfeng Deng, Yanxiang He","doi":"10.1145/3321704","DOIUrl":"https://doi.org/10.1145/3321704","url":null,"abstract":"The relatively high cost of record deserialization is increasingly becoming the bottleneck of column-based storage systems in tree-structured applications [58]. Due to record transformation in the storage layer, unnecessary processing costs derived from fields and rows irrelevant to queries may be very heavy in nested schemas, significantly wasting the computational resources in large-scale analytical workloads. This leads to the question of how to reduce both the deserialization and IO costs of queries with highly selective filters following arbitrary paths in a nested schema. We present CORES (Column-Oriented Regeneration Embedding Scheme) to push highly selective filters down into column-based storage engines, where each filter consists of several filtering conditions on a field. By applying highly selective filters in the storage layer, we demonstrate that both the deserialization and IO costs could be significantly reduced. We show how to introduce fine-grained composition on filtering results. We generalize this technique by two pair-wise operations, rollup and drilldown, such that a series of conjunctive filters can effectively deliver their payloads in nested schema. The proposed methods are implemented on an open-source platform. For practical purposes, we highlight how to build a column storage engine and how to drive a query efficiently based on a cost model. We apply this design to the nested relational model especially when hierarchical entities are frequently required by ad hoc queries. The experiments, including a real workload and the modified TPCH benchmark, demonstrate that CORES improves the performance by 0.7×--26.9× compared to state-of-the-art platforms in scan-intensive workloads.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13