ACM Transactions on Storage (TOS)最新文献_第7页

SolarDB SolarDB

ACM Transactions on Storage (TOS)

Pub Date : 2019-06-25 DOI: 10.1145/3318158

Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong-Ye Xie, Ryan Stutsman, HaiNing Li, Huiqi Hu

Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Although modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This article presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include (1) a shared-everything architecture based on a two-layer log-structured merge-tree; (2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; and (3) find-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank, and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.

大型数据库上的高效事务处理是许多关键任务应用程序的关键需求。尽管现代数据库通过水平分区获得了良好的性能，但是当必须执行跨分区分布式事务时，它们的性能会下降。本文介绍了SolarDB，这是一个分布式关系数据库系统，已在一家大型商业银行成功地进行了测试。SolarDB的主要特点包括:(1)基于两层日志结构合并树的共享一切架构;(2)一种新的与日志结构存储相结合的并发控制算法，即使存储层在后台对节点之间的数据进行压缩，也能保证事务处理的高效和无阻塞;(3)寻找粒度数据访问，有效地减少和平衡集群内的网络通信。根据我们对TPC-C、Smallbank和实际工作负载的经验评估，当分布式事务接近或超过5%时，SolarDB的性能比现有的无共享系统高出50倍。

引用次数: 18

Level Hashing 水平哈希

ACM Transactions on Storage (TOS)

Pub Date : 2019-06-21 DOI: 10.1145/3322096

Pengfei Zuo, Yu Hua, Jie Wu

Non-volatile memory (NVM) technologies as persistent memory are promising candidates to complement or replace DRAM for building future memory systems, due to having the advantages of high density, low power, and non-volatility. In main memory systems, hashing index structures are fundamental building blocks to provide fast query responses. However, hashing index structures originally designed for dynamic random access memory (DRAM) become inefficient for persistent memory due to new challenges including hardware limitations of NVM and the requirement of data consistency. To address these challenges, this article proposes level hashing, a write-optimized and high-performance hashing index scheme with low-overhead consistency guarantee and cost-efficient resizing. Level hashing provides a sharing-based two-level hash table, which achieves constant-scale worst-case time complexity for search, insertion, deletion, and update operations, and rarely incurs extra NVM writes. To guarantee the consistency with low overhead, level hashing leverages log-free consistency schemes for deletion, insertion, and resizing operations, and an opportunistic log-free scheme for update operation. To cost-efficiently resize this hash table, level hashing leverages an in-place resizing scheme that only needs to rehash 1/3 of buckets instead of the entire table to expand a hash table and rehash 2/3 of buckets to shrink a hash table, thus significantly improving the resizing performance and reducing the number of rehashed buckets. Extensive experimental results show that the level hashing speeds up insertions by 1.4×−3.0×, updates by 1.2×−2.1×, expanding by over 4.3×, and shrinking by over 1.4× while maintaining high search and deletion performance compared with start-of-the-art hashing schemes.

作为持久存储器的非易失性存储器(NVM)技术由于具有高密度、低功耗和非易失性等优点，有望补充或取代DRAM，用于构建未来的存储系统。在主存系统中，散列索引结构是提供快速查询响应的基本构建块。然而，由于NVM的硬件限制和数据一致性要求等新挑战，最初为动态随机存取存储器(DRAM)设计的哈希索引结构在持久存储器中变得效率低下。为了解决这些挑战，本文提出了级别哈希，这是一种写优化的高性能哈希索引方案，具有低开销的一致性保证和经济高效的调整大小。级别哈希提供了一个基于共享的两级哈希表，它为搜索、插入、删除和更新操作实现了恒定规模的最坏情况时间复杂度，并且很少引起额外的NVM写操作。为了保证低开销的一致性，级别散列对删除、插入和调整大小操作使用无日志一致性方案，对更新操作使用机会性无日志方案。为了经济有效地调整这个哈希表的大小，级别哈希利用了一个就地调整大小的方案，只需要重新哈希1/3的桶而不是整个表来扩展哈希表，重新哈希2/3的桶来缩小哈希表，从而显着提高了调整大小的性能并减少了重新哈希桶的数量。大量的实验结果表明，与最先进的哈希方案相比，级别哈希的插入速度为1.4× ~ 3.0×，更新速度为1.2× ~ 2.1×，扩展速度超过4.3×，收缩速度超过1.4×，同时保持了较高的搜索和删除性能。

{"title":"Level Hashing","authors":"Pengfei Zuo, Yu Hua, Jie Wu","doi":"10.1145/3322096","DOIUrl":"https://doi.org/10.1145/3322096","url":null,"abstract":"Non-volatile memory (NVM) technologies as persistent memory are promising candidates to complement or replace DRAM for building future memory systems, due to having the advantages of high density, low power, and non-volatility. In main memory systems, hashing index structures are fundamental building blocks to provide fast query responses. However, hashing index structures originally designed for dynamic random access memory (DRAM) become inefficient for persistent memory due to new challenges including hardware limitations of NVM and the requirement of data consistency. To address these challenges, this article proposes level hashing, a write-optimized and high-performance hashing index scheme with low-overhead consistency guarantee and cost-efficient resizing. Level hashing provides a sharing-based two-level hash table, which achieves constant-scale worst-case time complexity for search, insertion, deletion, and update operations, and rarely incurs extra NVM writes. To guarantee the consistency with low overhead, level hashing leverages log-free consistency schemes for deletion, insertion, and resizing operations, and an opportunistic log-free scheme for update operation. To cost-efficiently resize this hash table, level hashing leverages an in-place resizing scheme that only needs to rehash 1/3 of buckets instead of the entire table to expand a hash table and rehash 2/3 of buckets to shrink a hash table, thus significantly improving the resizing performance and reducing the number of rehashed buckets. Extensive experimental results show that the level hashing speeds up insertions by 1.4×−3.0×, updates by 1.2×−2.1×, expanding by over 4.3×, and shrinking by over 1.4× while maintaining high search and deletion performance compared with start-of-the-art hashing schemes.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130498617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Introduction to the Special Section on OSDI’18 介绍OSDI ' 18的特别部分

ACM Transactions on Storage (TOS)

Pub Date : 2019-05-31 DOI: 10.1145/3322101

A. Arpaci-Dusseau, G. Voelker

This special section of the ACM Transactions on Storage presents two articles from the 13th USENIX Symposium on Operating System Design and Implementation (OSDI’18). OSDI’18 contained 47 exceptionally strong articles across a range of topics: file and storage systems, networking, scheduling, security, formal verification, graph processing, machine learning, programming languages, fault-tolerance and reliability, debugging, and, of course, operating systems design and implementation. The two high-quality articles we have selected for TOS focus, not surprisingly, on file and storage systems. The first article is “Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory” by Pengfei Zuo, Yu Hua, and Jie Wu. This article introduces an elegant hashing data structure for non-volatile memory (NVM), called level hashing. Level hashing optimizes for NVM with low-overhead consistency mechanisms and by reducing the number of write operations. Level hashing has particularly interesting algorithms for performing in-place resizing of the hash table. The second article is “Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing” by Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, and Vijay Chidambaram. CrashMonkey and Ace are a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey tests a target file system by simulating power-loss crashes and then checks if the file system recovers to a correct state. Ace automatically generates interesting workloads to stress the target file system; Ace is particularly innovative in how it explores the infinite space of possible workloads. With these new tools, the authors have found many difficult crash-consistency bugs, including 10 previously unknown bugs in widely used, mature Linux file systems and one in FSCQ, a verified file system. We hope you will find these articles interesting and inspiring!

ACM存储事务的这个特殊部分介绍了来自第13届USENIX操作系统设计和实现研讨会(OSDI ' 18)的两篇文章。OSDI ' 18收录了47篇非常有说服力的文章，涵盖了一系列主题:文件和存储系统、网络、调度、安全、形式化验证、图形处理、机器学习、编程语言、容错和可靠性、调试，当然还有操作系统的设计和实现。毫不奇怪，我们为TOS选择的两篇高质量文章都是关于文件和存储系统的。第一篇文章是由左鹏飞、华宇和吴杰撰写的“持久内存的写优化和高性能哈希索引方案”。本文介绍了用于非易失性内存(NVM)的一种优雅的散列数据结构，称为级别散列。通过低开销的一致性机制和减少写操作的数量，级别哈希对NVM进行了优化。级别哈希有特别有趣的算法来执行哈希表的就地调整大小。第二篇文章是Jayashree Mohan、Ashlie Martinez、Soujanya Ponnapalli、Pandian Raju和Vijay Chidambaram撰写的“用有界黑盒崩溃测试发现崩溃一致性漏洞”。CrashMonkey和Ace是一组工具，用于系统地查找Linux文件系统中的崩溃一致性错误。CrashMonkey通过模拟断电崩溃来测试目标文件系统，然后检查文件系统是否恢复到正确的状态。Ace自动生成感兴趣的工作负载以对目标文件系统施加压力;Ace在探索可能工作负载的无限空间方面尤其具有创新性。通过这些新工具，作者发现了许多难以解决的崩溃一致性错误，包括在广泛使用的成熟Linux文件系统中发现的10个以前未知的错误，以及在FSCQ(一个经过验证的文件系统)中发现的一个错误。我们希望你会发现这些文章有趣和鼓舞人心!

{"title":"Introduction to the Special Section on OSDI’18","authors":"A. Arpaci-Dusseau, G. Voelker","doi":"10.1145/3322101","DOIUrl":"https://doi.org/10.1145/3322101","url":null,"abstract":"This special section of the ACM Transactions on Storage presents two articles from the 13th USENIX Symposium on Operating System Design and Implementation (OSDI’18). OSDI’18 contained 47 exceptionally strong articles across a range of topics: file and storage systems, networking, scheduling, security, formal verification, graph processing, machine learning, programming languages, fault-tolerance and reliability, debugging, and, of course, operating systems design and implementation. The two high-quality articles we have selected for TOS focus, not surprisingly, on file and storage systems. The first article is “Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory” by Pengfei Zuo, Yu Hua, and Jie Wu. This article introduces an elegant hashing data structure for non-volatile memory (NVM), called level hashing. Level hashing optimizes for NVM with low-overhead consistency mechanisms and by reducing the number of write operations. Level hashing has particularly interesting algorithms for performing in-place resizing of the hash table. The second article is “Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing” by Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, and Vijay Chidambaram. CrashMonkey and Ace are a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey tests a target file system by simulating power-loss crashes and then checks if the file system recovers to a correct state. Ace automatically generates interesting workloads to stress the target file system; Ace is particularly innovative in how it explores the infinite space of possible workloads. With these new tools, the authors have found many difficult crash-consistency bugs, including 10 previously unknown bugs in widely used, mature Linux file systems and one in FSCQ, a verified file system. We hope you will find these articles interesting and inspiring!","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134072306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigating Synchronous I/O Overhead in File Systems on Open-Channel SSDs 减少开放通道ssd上文件系统的同步I/O开销

ACM Transactions on Storage (TOS)

Pub Date : 2019-05-31 DOI: 10.1145/3319369

Youyou Lu, J. Shu, Jiacheng Zhang

Synchronous I/O has long been a design challenge in file systems. Although open-channel solid state drives (SSDs) provide better performance and endurance to file systems, they still suffer from synchronous I/Os due to the amplified writes and worse hot/cold data grouping. The reason lies in the controversy design choices between flash write and read/erase operations. While fine-grained logging improves performance and endurance in writes, it hurts indexing and data grouping efficiency in read and erase operations. In this article, we propose a flash-friendly data layout by introducing a built-in persistent staging layer to provide balanced read, write, and garbage collection performance. Based on this, we design a new flash file system (FS) named StageFS, which decouples the content and structure updates. Content updates are logically logged to the staging layer in a persistence-efficient way, which achieves better write performance and lower write amplification. The updated contents are reorganized into the normal data area for structure updates, with improved hot/cold grouping and in a page-level indexing way, which is more friendly to read and garbage collection operations. Evaluation results show that, compared to recent flash-friendly file system (F2FS), StageFS effectively improves performance by up to 211.4% and achieves low garbage collection overhead for workloads with frequent synchronization.

同步I/O一直是文件系统中的设计难题。尽管开放通道固态硬盘(ssd)为文件系统提供了更好的性能和耐用性，但由于写入量增加和热/冷数据分组较差，它们仍然受到同步I/ o的影响。原因在于闪存写操作和读/擦除操作之间存在争议的设计选择。虽然细粒度日志可以提高写操作的性能和持久性，但它会损害读和擦除操作中的索引和数据分组效率。在本文中，我们通过引入内置的持久staging层来提供均衡的读、写和垃圾收集性能，从而提出一种闪存友好的数据布局。在此基础上，设计了一种新的flash文件系统StageFS，实现了内容更新和结构更新的解耦。内容更新以持久高效的方式在逻辑上记录到登台层，从而实现更好的写性能和更低的写放大。更新后的内容被重新组织到正常的数据区域以进行结构更新，并采用改进的热/冷分组和页面级索引方式，这对读取和垃圾收集操作更友好。评估结果表明，与最近的闪存友好文件系统(F2FS)相比，StageFS有效地提高了高达211.4%的性能，并且对于频繁同步的工作负载实现了较低的垃圾收集开销。

{"title":"Mitigating Synchronous I/O Overhead in File Systems on Open-Channel SSDs","authors":"Youyou Lu, J. Shu, Jiacheng Zhang","doi":"10.1145/3319369","DOIUrl":"https://doi.org/10.1145/3319369","url":null,"abstract":"Synchronous I/O has long been a design challenge in file systems. Although open-channel solid state drives (SSDs) provide better performance and endurance to file systems, they still suffer from synchronous I/Os due to the amplified writes and worse hot/cold data grouping. The reason lies in the controversy design choices between flash write and read/erase operations. While fine-grained logging improves performance and endurance in writes, it hurts indexing and data grouping efficiency in read and erase operations. In this article, we propose a flash-friendly data layout by introducing a built-in persistent staging layer to provide balanced read, write, and garbage collection performance. Based on this, we design a new flash file system (FS) named StageFS, which decouples the content and structure updates. Content updates are logically logged to the staging layer in a persistence-efficient way, which achieves better write performance and lower write amplification. The updated contents are reorganized into the normal data area for structure updates, with improved hot/cold grouping and in a page-level indexing way, which is more friendly to read and garbage collection operations. Evaluation results show that, compared to recent flash-friendly file system (F2FS), StageFS effectively improves performance by up to 211.4% and achieves low garbage collection overhead for workloads with frequent synchronization.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128622446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

An Exploratory Study on Software-Defined Data Center Hard Disk Drives 软件定义数据中心硬盘驱动器的探索性研究

ACM Transactions on Storage (TOS)

Pub Date : 2019-05-21 DOI: 10.1145/3319405

Yin Li, Xubin Chen, Ning Zheng, Jingpeng Hao, T. Zhang

This article presents a design framework aiming to reduce mass data storage cost in data centers. Its underlying principle is simple: Assume one may noticeably reduce the HDD manufacturing cost by significantly (i.e., at least several orders of magnitude) relaxing raw HDD reliability, which ensures the eventual data storage integrity via low-cost system-level redundancy. This is called system-assisted HDD bit cost reduction. To better utilize both capacity and random IOPS of HDDs, it is desirable to mix data with complementary requirements on capacity and random IOPS in each HDD. Nevertheless, different capacity and random IOPS requirements may demand different raw HDD reliability vs. bit cost trade-offs and hence different forms of system-assisted bit cost reduction. This article presents a software-centric design framework to realize data-adaptive system-assisted bit cost reduction for data center HDDs. Implementation is solely handled by the filesystem and demands only minor change of the error correction coding (ECC) module inside HDDs. Hence, it is completely transparent to all the other components in the software stack (e.g., applications, OS kernel, and drivers) and keeps fundamental HDD design practice (e.g., firmware, media, head, and servo) intact. We carried out analysis and experiments to evaluate its implementation feasibility and effectiveness. We integrated the design techniques into ext4 to further quantitatively measure its impact on system speed performance.

本文提出了一个旨在降低数据中心大量数据存储成本的设计框架。它的基本原理很简单:假设可以显著降低HDD的制造成本(即，至少几个数量级)，从而降低原始HDD的可靠性，从而通过低成本的系统级冗余确保最终的数据存储完整性。这被称为系统辅助HDD位成本降低。为了更好地利用HDD的容量和随机IOPS，建议在每个HDD中混合对容量和随机IOPS有互补需求的数据。然而，不同的容量和随机IOPS需求可能需要不同的原始硬盘可靠性和比特成本权衡，因此需要不同形式的系统辅助比特成本降低。本文提出了一个以软件为中心的设计框架，以实现数据中心hdd的自适应系统辅助比特成本降低。实现完全由文件系统处理，只需要对hdd内部的纠错编码(ECC)模块进行很小的更改。因此，它对软件栈中的所有其他组件(例如，应用程序，操作系统内核和驱动程序)是完全透明的，并保持基本的HDD设计实践(例如，固件，介质，磁头和伺服)完整。我们进行了分析和实验，以评估其实施的可行性和有效性。我们将设计技术集成到ext4中，以进一步定量测量其对系统速度性能的影响。

{"title":"An Exploratory Study on Software-Defined Data Center Hard Disk Drives","authors":"Yin Li, Xubin Chen, Ning Zheng, Jingpeng Hao, T. Zhang","doi":"10.1145/3319405","DOIUrl":"https://doi.org/10.1145/3319405","url":null,"abstract":"This article presents a design framework aiming to reduce mass data storage cost in data centers. Its underlying principle is simple: Assume one may noticeably reduce the HDD manufacturing cost by significantly (i.e., at least several orders of magnitude) relaxing raw HDD reliability, which ensures the eventual data storage integrity via low-cost system-level redundancy. This is called system-assisted HDD bit cost reduction. To better utilize both capacity and random IOPS of HDDs, it is desirable to mix data with complementary requirements on capacity and random IOPS in each HDD. Nevertheless, different capacity and random IOPS requirements may demand different raw HDD reliability vs. bit cost trade-offs and hence different forms of system-assisted bit cost reduction. This article presents a software-centric design framework to realize data-adaptive system-assisted bit cost reduction for data center HDDs. Implementation is solely handled by the filesystem and demands only minor change of the error correction coding (ECC) module inside HDDs. Hence, it is completely transparent to all the other components in the software stack (e.g., applications, OS kernel, and drivers) and keeps fundamental HDD design practice (e.g., firmware, media, head, and servo) intact. We carried out analysis and experiments to evaluate its implementation feasibility and effectiveness. We integrated the design techniques into ext4 to further quantitatively measure its impact on system speed performance.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127447699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Introduction to the Special Section on the 2018 USENIX Annual Technical Conference (ATC’18) 2018年USENIX年度技术会议(ATC ' 18)特别部分介绍

ACM Transactions on Storage (TOS)

Pub Date : 2019-05-13 DOI: 10.1145/3322100

Haryadi S. Gunawi, B. Reed

This special section of the ACM Transactions on Storage presents some of the highlights of the 2018 USENIX Annual Technical Conference (ATC’18). Over the years, USENIX ATC has evolved into a community of researchers and practitioners working on a diverse and expanding set of research topics; the conference represents some of the latest and best work being done, and this year was no different. ATC’18 received a record number of 377 submissions. Of these, we selected three high-quality storage-related articles for publication in this special section of ACM Transactions on Storage. The first article, which was also selected as one of the best papers in ATC’18, is “TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions” by Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. Building transactional systems is complex and error-prone. The authors of this article introduce a novel approach to build a transactional file system by taking advantage of the mature, well-tested filesystem journal feature. Compared to earlier transactional file systems, it is easy to develop and use. It also demonstrates performance boosts for a number of different workloads. The second article is “CGraph: A Distributed Storage and Processing System for Concurrent Iterative Graph Analysis Jobs” by Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, and Ligang He. Nowadays, distributed graph processing platform, which handles massive Concurrent iterative Graph Processing (CGP) jobs, is widely used. However, existing distributed systems face a high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. The authors observed that this phenomenon happened because these CGP jobs need to repeatedly traverse the shared graph structure. They then propose exploiting the observed spatial and temporal correlations between the data accesses of these jobs to enable multiple concurrent iterative graph processing jobs. Hence, it efficiently shares the graph data and provides higher throughput. The final article is “SolarDB: Towards a Shared-Everything Database on Distributed Log-Structured Storage” by Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, and Huiqi Hu. Supporting transactions in distributed database with outstanding performance has been a complex issue to tackle. In this work, the authors describe how to build a shared-everything distributed relational database that achieves dramatically faster, fine-grained, and non-blocking transactions, based on log-structured trees and new control mechanisms.

ACM存储事务的这个特殊部分介绍了2018年USENIX年度技术会议(ATC ' 18)的一些亮点。多年来，USENIX ATC已经发展成为一个研究人员和实践者的社区，致力于多样化和不断扩展的研究主题;会议代表了一些最新和最好的工作，今年也不例外。ATC ' 18收到了创纪录的377份意见书。其中，我们选择了三篇与存储相关的高质量文章发表在ACM Transactions on Storage的这个特殊部分。第一篇文章是“TxFS:利用文件系统崩溃一致性提供ACID事务”，作者是Yige Hu、Zhiting Zhu、Ian Neal、Youngjin Kwon、Tianyu Cheng、Vijay Chidambaram和Emmett Witchel。这篇文章也入选了ATC’18的最佳论文之一。构建事务系统是复杂且容易出错的。本文的作者介绍了一种利用成熟的、经过良好测试的文件系统日志特性来构建事务性文件系统的新方法。与早期的事务性文件系统相比，它易于开发和使用。它还演示了对许多不同工作负载的性能提升。第二篇文章是张宇、赵金、廖晓飞、金海、顾林、刘海坤、何炳生、何利刚的《CGraph:面向并行迭代图分析作业的分布式存储和处理系统》。目前，分布式图处理平台被广泛应用于处理大量并行迭代图处理(CGP)任务。然而，现有的分布式系统面临着CGP作业的数据访问成本与计算成本的高比率，这导致了低吞吐量。作者观察到，之所以会出现这种现象，是因为这些CGP作业需要重复遍历共享图结构。然后，他们建议利用这些作业的数据访问之间观察到的空间和时间相关性来实现多个并发迭代图处理作业。因此，它可以有效地共享图形数据并提供更高的吞吐量。最后一篇文章是由朱涛、赵卓玥、李菲菲、钱伟宁、周奥英、谢东、Ryan Stutsman、李海宁和胡慧琪撰写的《SolarDB:迈向分布式日志结构存储上的共享一切数据库》。在分布式数据库中支持具有出色性能的事务一直是一个复杂的问题。在这本书中，作者描述了如何构建一个共享一切的分布式关系数据库，该数据库基于日志结构树和新的控制机制，可以实现更快、细粒度和无阻塞的事务。

{"title":"Introduction to the Special Section on the 2018 USENIX Annual Technical Conference (ATC’18)","authors":"Haryadi S. Gunawi, B. Reed","doi":"10.1145/3322100","DOIUrl":"https://doi.org/10.1145/3322100","url":null,"abstract":"This special section of the ACM Transactions on Storage presents some of the highlights of the 2018 USENIX Annual Technical Conference (ATC’18). Over the years, USENIX ATC has evolved into a community of researchers and practitioners working on a diverse and expanding set of research topics; the conference represents some of the latest and best work being done, and this year was no different. ATC’18 received a record number of 377 submissions. Of these, we selected three high-quality storage-related articles for publication in this special section of ACM Transactions on Storage. The first article, which was also selected as one of the best papers in ATC’18, is “TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions” by Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. Building transactional systems is complex and error-prone. The authors of this article introduce a novel approach to build a transactional file system by taking advantage of the mature, well-tested filesystem journal feature. Compared to earlier transactional file systems, it is easy to develop and use. It also demonstrates performance boosts for a number of different workloads. The second article is “CGraph: A Distributed Storage and Processing System for Concurrent Iterative Graph Analysis Jobs” by Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, and Ligang He. Nowadays, distributed graph processing platform, which handles massive Concurrent iterative Graph Processing (CGP) jobs, is widely used. However, existing distributed systems face a high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. The authors observed that this phenomenon happened because these CGP jobs need to repeatedly traverse the shared graph structure. They then propose exploiting the observed spatial and temporal correlations between the data accesses of these jobs to enable multiple concurrent iterative graph processing jobs. Hence, it efficiently shares the graph data and provides higher throughput. The final article is “SolarDB: Towards a Shared-Everything Database on Distributed Log-Structured Storage” by Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, and Huiqi Hu. Supporting transactions in distributed database with outstanding performance has been a complex issue to tackle. In this work, the authors describe how to build a shared-everything distributed relational database that achieves dramatically faster, fine-grained, and non-blocking transactions, based on log-structured trees and new control mechanisms.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"60 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114003710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TxFS

ACM Transactions on Storage (TOS)

Pub Date : 2019-05-08 DOI: 10.1145/3318159

Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, T. Cheng, Vijay Chidambaram, E. Witchel

We introduce TxFS, a transactional file system that builds upon a file system’s atomic-update mechanism such as journaling. Though prior work has explored a number of transactional file systems, TxFS has a unique set of properties: a simple API, portability across different hardware, high performance, low complexity (by building on the file-system journal), and full ACID transactions. We port SQLite, OpenLDAP, and Git to use TxFS and experimentally show that TxFS provides strong crash consistency while providing equal or better performance.

我们介绍TxFS，这是一种事务性文件系统，它建立在文件系统的原子更新机制(如日志记录)之上。虽然以前的工作已经探索了许多事务性文件系统，但TxFS具有一组独特的属性:简单的API、跨不同硬件的可移植性、高性能、低复杂性(通过构建文件系统日志)和完整的ACID事务。我们将SQLite、OpenLDAP和Git移植到TxFS，并通过实验证明TxFS在提供相同或更好的性能的同时提供了强大的崩溃一致性。

引用次数: 25

Performance and Resource Utilization of FUSE User-Space File Systems FUSE用户空间文件系统的性能和资源利用

ACM Transactions on Storage (TOS)

Pub Date : 2019-05-08 DOI: 10.1145/3310148

Bharath Kumar Reddy Vangoor, Prafful Agarwal, Manu Mathew, Arun Ramachandran, Swaminathan Sivaraman, Vasily Tarasov, E. Zadok

Traditionally, file systems were implemented as part of operating systems kernels, which provide a limited set of tools and facilities to a programmer. As the complexity of file systems grew, many new file systems began being developed in user space. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. As a result, the topic of user-space file systems remains rather controversial: while some consider user-space file systems a “toy” not to be used in production, others develop full-fledged production file systems in user space. In this article, we analyze the design and implementation of a well-known user-space file system framework, FUSE, for Linux. We characterize its performance and resource utilization for a wide range of workloads. We present FUSE performance and also resource utilization with various mount and configuration options, using 45 different workloads that were generated using Filebench on two different hardware configurations. We instrumented FUSE to extract useful statistics and traces, which helped us analyze its performance bottlenecks and present our analysis results. Our experiments indicate that depending on the workload and hardware used, performance degradation (throughput) caused by FUSE can be completely imperceptible or as high as −83%, even when optimized; and latencies of FUSE file system operations can be increased from none to 4× when compared to Ext4. On the resource utilization side, FUSE can increase relative CPU utilization by up to 31% and underutilize disk bandwidth by as much as −80% compared to Ext4, though for many data-intensive workloads the impact was statistically indistinguishable. Our conclusion is that user-space file systems can indeed be used in production (non-“toy”) settings, but their applicability depends on the expected workloads.

传统上，文件系统是作为操作系统内核的一部分实现的，内核为程序员提供了一组有限的工具和设施。随着文件系统复杂性的增长，许多新的文件系统开始在用户空间中开发。低性能被认为是用户空间文件系统的主要缺点，但是这个问题的严重程度从来没有被系统地探讨过。因此，用户空间文件系统的主题仍然颇具争议:有些人认为用户空间文件系统是不应该用于生产的“玩具”，而另一些人则在用户空间中开发成熟的生产文件系统。在本文中，我们分析了一个著名的用于Linux的用户空间文件系统框架FUSE的设计和实现。我们描述了它在各种工作负载下的性能和资源利用率。我们使用在两种不同硬件配置上使用Filebench生成的45种不同工作负载，通过各种挂载和配置选项展示FUSE性能和资源利用率。我们利用FUSE提取有用的统计信息和跟踪信息，这有助于我们分析其性能瓶颈并呈现分析结果。我们的实验表明，根据所使用的工作负载和硬件，FUSE引起的性能下降(吞吐量)可以完全无法察觉，甚至高达- 83%，即使在优化时也是如此;与Ext4相比，FUSE文件系统操作的延迟可以从零增加到4倍。在资源利用方面，与Ext4相比，FUSE可以将相对CPU利用率提高31%，并将未充分利用的磁盘带宽提高- 80%，尽管对于许多数据密集型工作负载来说，这种影响在统计上是无法区分的。我们的结论是，用户空间文件系统确实可以用于生产(非“玩具”)设置，但是它们的适用性取决于预期的工作负载。

{"title":"Performance and Resource Utilization of FUSE User-Space File Systems","authors":"Bharath Kumar Reddy Vangoor, Prafful Agarwal, Manu Mathew, Arun Ramachandran, Swaminathan Sivaraman, Vasily Tarasov, E. Zadok","doi":"10.1145/3310148","DOIUrl":"https://doi.org/10.1145/3310148","url":null,"abstract":"Traditionally, file systems were implemented as part of operating systems kernels, which provide a limited set of tools and facilities to a programmer. As the complexity of file systems grew, many new file systems began being developed in user space. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. As a result, the topic of user-space file systems remains rather controversial: while some consider user-space file systems a “toy” not to be used in production, others develop full-fledged production file systems in user space. In this article, we analyze the design and implementation of a well-known user-space file system framework, FUSE, for Linux. We characterize its performance and resource utilization for a wide range of workloads. We present FUSE performance and also resource utilization with various mount and configuration options, using 45 different workloads that were generated using Filebench on two different hardware configurations. We instrumented FUSE to extract useful statistics and traces, which helped us analyze its performance bottlenecks and present our analysis results. Our experiments indicate that depending on the workload and hardware used, performance degradation (throughput) caused by FUSE can be completely imperceptible or as high as −83%, even when optimized; and latencies of FUSE file system operations can be increased from none to 4× when compared to Ext4. On the resource utilization side, FUSE can increase relative CPU utilization by up to 31% and underutilize disk bandwidth by as much as −80% compared to Ext4, though for many data-intensive workloads the impact was statistically indistinguishable. Our conclusion is that user-space file systems can indeed be used in production (non-“toy”) settings, but their applicability depends on the expected workloads.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128665544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

CrashMonkey and ACE CrashMonkey和ACE

ACM Transactions on Storage (TOS)

Pub Date : 2019-04-20 DOI: 10.1145/3320275

Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, P. Raju, Vijay Chidambaram

We present CrashMonkey and Ace, a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey is a record-and-replay framework which tests a given workload on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. Ace automatically generates all the workloads to be run on the target file system. We build CrashMonkey and Ace based on a new approach to test file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last 5 years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly created file system, and that all reported bugs result from crashes after fsync()-related system calls. CrashMonkey and Ace are able to find 24 out of the 26 crash-consistency bugs reported in the last 5 years. Our tools also revealed 10 new crash-consistency bugs in widely used, mature Linux file systems, 7 of which existed in the kernel since 2014. Additionally, our tools found a crash-consistency bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity, loss of persisted files and directories, and data loss.

我们介绍了CrashMonkey和Ace，这是一套系统地发现Linux文件系统中崩溃一致性错误的工具。CrashMonkey是一个记录和重放框架，它通过模拟工作负载正在执行时的断电崩溃来测试目标文件系统上给定的工作负载，并检查每次崩溃后文件系统是否恢复到正确的状态。Ace自动生成要在目标文件系统上运行的所有工作负载。我们基于一种测试文件系统崩溃一致性的新方法构建了CrashMonkey和Ace:有界黑盒崩溃测试(B3)。B3使用文件系统操作的工作负载以黑盒方式测试文件系统。由于可能的工作负载空间是无限的，因此B3根据文件系统操作的数量或要包含的操作等参数对该空间进行限定，并在这个限定的空间内详尽地生成工作负载。B3基于我们在过去5年中对Linux文件系统中报告的崩溃一致性错误的研究得出的见解。我们观察到，大多数报告的bug都可以在新创建的文件系统上使用三个或更少的文件系统操作的小工作负载来重现，并且所有报告的bug都是由与fsync()相关的系统调用之后的崩溃造成的。CrashMonkey和Ace能够发现过去5年报告的26个崩溃一致性漏洞中的24个。我们的工具还在广泛使用的成熟Linux文件系统中发现了10个新的崩溃一致性错误，其中7个自2014年以来就存在于内核中。此外，我们的工具在经过验证的文件系统FSCQ中发现了一个崩溃一致性错误。这些新bug会导致严重的后果，比如重命名原子性被破坏、持久化文件和目录丢失以及数据丢失。

{"title":"CrashMonkey and ACE","authors":"Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, P. Raju, Vijay Chidambaram","doi":"10.1145/3320275","DOIUrl":"https://doi.org/10.1145/3320275","url":null,"abstract":"We present CrashMonkey and Ace, a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey is a record-and-replay framework which tests a given workload on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. Ace automatically generates all the workloads to be run on the target file system. We build CrashMonkey and Ace based on a new approach to test file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last 5 years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly created file system, and that all reported bugs result from crashes after fsync()-related system calls. CrashMonkey and Ace are able to find 24 out of the 26 crash-consistency bugs reported in the last 5 years. Our tools also revealed 10 new crash-consistency bugs in widely used, mature Linux file systems, 7 of which existed in the kernel since 2014. Additionally, our tools found a crash-consistency bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity, loss of persisted files and directories, and data loss.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127747867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CGraph CGraph

ACM Transactions on Storage (TOS)

Pub Date : 2019-04-20 DOI: 10.1145/3319406

Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ligang He

Distributed graph processing platforms usually need to handle massive Concurrent iterative Graph Processing (CGP) jobs for different purposes. However, existing distributed systems face high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. We observed that there are strong spatial and temporal correlations among the data accesses issued by different CGP jobs, because these concurrently running jobs usually need to repeatedly traverse the shared graph structure for the iterative processing of each vertex. Based on this observation, this article proposes a distributed storage and processing system CGraph for the CGP jobs to efficiently handle the underlying static/evolving graph for high throughput. It uses a data-centric load-trigger-pushing model, together with several optimizations, to enable the CGP jobs to efficiently share the graph structure data in the cache/memory and their accesses by fully exploiting such correlations, where the graph structure data is decoupled from the vertex state associated with each job. It can deliver much higher throughput for the CGP jobs by effectively reducing their average ratio of data access cost to computation. Experimental results show that CGraph improves the throughput of the CGP jobs by up to 3.47× in comparison with existing solutions on distributed platforms.

分布式图处理平台通常需要处理大量不同目的的并行迭代图处理(CGP)任务。然而，现有的分布式系统面临着CGP作业的数据访问成本与计算成本之比过高的问题，从而导致了较低的吞吐量。我们观察到，不同CGP作业发出的数据访问之间存在很强的空间和时间相关性，因为这些并发运行的作业通常需要重复遍历共享图结构以迭代处理每个顶点。基于这一观察，本文提出了一种用于CGP作业的分布式存储和处理系统CGraph，以有效地处理底层静态/演化图，实现高吞吐量。它使用以数据为中心的负载触发器推送模型，以及一些优化，通过充分利用这种相关性，使CGP作业能够有效地共享缓存/内存中的图结构数据及其访问，其中图结构数据与与每个作业相关的顶点状态解耦。它可以通过有效地降低数据访问成本与计算的平均比率，为CGP作业提供更高的吞吐量。实验结果表明，与现有分布式平台上的解决方案相比，CGraph将CGP作业的吞吐量提高了3.47倍。

{"title":"CGraph","authors":"Yu Zhang, Jin Zhao, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ligang He","doi":"10.1145/3319406","DOIUrl":"https://doi.org/10.1145/3319406","url":null,"abstract":"Distributed graph processing platforms usually need to handle massive Concurrent iterative Graph Processing (CGP) jobs for different purposes. However, existing distributed systems face high ratio of data access cost to computation for the CGP jobs, which incurs low throughput. We observed that there are strong spatial and temporal correlations among the data accesses issued by different CGP jobs, because these concurrently running jobs usually need to repeatedly traverse the shared graph structure for the iterative processing of each vertex. Based on this observation, this article proposes a distributed storage and processing system CGraph for the CGP jobs to efficiently handle the underlying static/evolving graph for high throughput. It uses a data-centric load-trigger-pushing model, together with several optimizations, to enable the CGP jobs to efficiently share the graph structure data in the cache/memory and their accesses by fully exploiting such correlations, where the graph structure data is decoupled from the vertex state associated with each job. It can deliver much higher throughput for the CGP jobs by effectively reducing their average ratio of data access cost to computation. Experimental results show that CGraph improves the throughput of the CGP jobs by up to 3.47× in comparison with existing solutions on distributed platforms.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131245406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8