Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855543
Giorgos Kappes, A. Hatzieleftheriou, S. Anastasiadis
In a virtualization environment that serves multiple tenants, storage consolidation at the filesystem level is desirable because it enables data sharing, administration efficiency, and performance optimizations. The scalable deployment of filesystems in such environments is challenging due to intermediate translation layers required for networked file access or identity management. First we present several security requirements in multitenant filesystems. Then we introduce the design of the Dike authorization architecture. It combines native access control with tenant namespace isolation and compatibility to object-based filesystems. We use a public cloud to experimentally evaluate a prototype implementation of Dike that we developed. At several thousand tenants, our prototype incurs limited performance overhead up to 16%, unlike an existing solution whose multitenancy overhead approaches 84% in some cases.
{"title":"Virtualization-aware access control for multitenant filesystems","authors":"Giorgos Kappes, A. Hatzieleftheriou, S. Anastasiadis","doi":"10.1109/MSST.2014.6855543","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855543","url":null,"abstract":"In a virtualization environment that serves multiple tenants, storage consolidation at the filesystem level is desirable because it enables data sharing, administration efficiency, and performance optimizations. The scalable deployment of filesystems in such environments is challenging due to intermediate translation layers required for networked file access or identity management. First we present several security requirements in multitenant filesystems. Then we introduce the design of the Dike authorization architecture. It combines native access control with tenant namespace isolation and compatibility to object-based filesystems. We use a public cloud to experimentally evaluate a prototype implementation of Dike that we developed. At several thousand tenants, our prototype incurs limited performance overhead up to 16%, unlike an existing solution whose multitenancy overhead approaches 84% in some cases.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127301539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855540
Pilar González-Férez, A. Bilas
Current technology trends for efficient use of infrastructures dictate that storage converges with computation by placing storage devices, such as NVM-based cards and drives, in the servers themselves. With converged storage the role of the interconnect among servers becomes more important for achieving high I/O throughput. Given that Ethernet is emerging as the dominant technology for datacenters, it becomes imperative to examine how to reduce protocol overheads for accessing remote storage over Ethernet interconnects. In this paper we propose Tyche, a network storage protocol directly on top of Ethernet, which does not require any hardware support from the network interface. Therefore, Tyche can be deployed in existing infrastructures and to co-exist with other Ethernet-based protocols. Tyche presents remote storage as a local block device and can support any existing filesystem. At the heart of our approach, there are two main axis: reduction of host-level overheads and scaling with the number of cores and network interfaces in a server. Both target at achieving high I/O throughput in future servers. We reduce overheads via a copy-reduction technique, storage-specific packet processing, pre-allocation of memory, and using RDMA-like operations without requiring hardware support. We transparently handle multiple NICs and offer improved scaling with the number of links and cores via reduced synchronization, proper packet queue design, and NUMA affinity management. Our results show that Tyche achieves scalable I/O throughput, up to 6.4 GB/s for reads and 6.8 GB/s for writes with 6 × 10 GigE NICs. Our analysis shows that although multiple aspects of the protocol play a role for performance, NUMA affinity is particularly important. When comparing to NBD, Tyche performs better by up to one order of magnitude.
{"title":"Tyche: An efficient Ethernet-based protocol for converged networked storage","authors":"Pilar González-Férez, A. Bilas","doi":"10.1109/MSST.2014.6855540","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855540","url":null,"abstract":"Current technology trends for efficient use of infrastructures dictate that storage converges with computation by placing storage devices, such as NVM-based cards and drives, in the servers themselves. With converged storage the role of the interconnect among servers becomes more important for achieving high I/O throughput. Given that Ethernet is emerging as the dominant technology for datacenters, it becomes imperative to examine how to reduce protocol overheads for accessing remote storage over Ethernet interconnects. In this paper we propose Tyche, a network storage protocol directly on top of Ethernet, which does not require any hardware support from the network interface. Therefore, Tyche can be deployed in existing infrastructures and to co-exist with other Ethernet-based protocols. Tyche presents remote storage as a local block device and can support any existing filesystem. At the heart of our approach, there are two main axis: reduction of host-level overheads and scaling with the number of cores and network interfaces in a server. Both target at achieving high I/O throughput in future servers. We reduce overheads via a copy-reduction technique, storage-specific packet processing, pre-allocation of memory, and using RDMA-like operations without requiring hardware support. We transparently handle multiple NICs and offer improved scaling with the number of links and cores via reduced synchronization, proper packet queue design, and NUMA affinity management. Our results show that Tyche achieves scalable I/O throughput, up to 6.4 GB/s for reads and 6.8 GB/s for writes with 6 × 10 GigE NICs. Our analysis shows that although multiple aspects of the protocol play a role for performance, NUMA affinity is particularly important. When comparing to NBD, Tyche performs better by up to one order of magnitude.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130966115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855536
Jian Liu, Yunpeng Chai, X. Qin, Y. Xiao
Data deduplication techniques improve cost efficiency by dramatically reducing space needs of storage systems. SSD-based data cache has been adopted to remedy the declining I/O performance induced by deduplication operations in the latency-sensitive primary storage. Unfortunately, frequent data updates caused by classical cache algorithms (e.g., FIFO, LRU, and LFU) inevitably slow down SSDs' I/O processing speed while significantly shortening SSDs' lifetime. To address this problem, we propose a new approach-PLC-Cache-to greatly improve the I/O performance as well as write durability of SSDs. PLC-Cache is conducive to amplifying the proportion of the Popular and Long-term Cached (PLC) data, which is infrequently written and kept in SSD cache in a long time period to catalyze cache hits, in an entire SSD written data set. PLC-Cache advocates a two-phase approach. First, non-popular data are ruled out from being written into SSDs. Second, PLC-Cache makes an effort to convert SSD written data into PLC-data as much as possible. Our experimental results based on a practical deduplication system indicate that compared with the existing caching schemes, PLC-Cache shortens data access latency by an average of 23.4%. Importantly, PLC-Cache improves the lifetime of SSD-based caches by reducing the amount of data written to SSDs by a factor of 15.7.
重复数据删除技术通过大幅降低存储系统对空间的需求,提高了成本效益。针对对延迟敏感的主存储中重复数据删除操作导致I/O性能下降的问题,采用了基于ssd的数据缓存。不幸的是,经典缓存算法(如FIFO、LRU和LFU)导致的频繁数据更新不可避免地降低了ssd的I/O处理速度,同时显著缩短了ssd的使用寿命。为了解决这个问题,我们提出了一种新的方法- plc - cache -大大提高了固态硬盘的I/O性能和写入持久性。PLC- cache有利于扩大PLC (Popular and long -term Cached)数据在整个SSD写入数据集中的比例,这些数据不经常写入,并且在很长一段时间内保存在SSD缓存中以催化缓存命中。PLC-Cache提倡两阶段方法。首先,不流行的数据被排除在写入ssd之外。其次,PLC-Cache尽可能地将SSD写入的数据转换为plc数据。基于一个实际的重复数据删除系统的实验结果表明,与现有的缓存方案相比,PLC-Cache平均缩短了23.4%的数据访问延迟。重要的是,PLC-Cache通过将写入ssd的数据量减少15.7倍,提高了基于ssd的缓存的寿命。
{"title":"PLC-cache: Endurable SSD cache for deduplication-based primary storage","authors":"Jian Liu, Yunpeng Chai, X. Qin, Y. Xiao","doi":"10.1109/MSST.2014.6855536","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855536","url":null,"abstract":"Data deduplication techniques improve cost efficiency by dramatically reducing space needs of storage systems. SSD-based data cache has been adopted to remedy the declining I/O performance induced by deduplication operations in the latency-sensitive primary storage. Unfortunately, frequent data updates caused by classical cache algorithms (e.g., FIFO, LRU, and LFU) inevitably slow down SSDs' I/O processing speed while significantly shortening SSDs' lifetime. To address this problem, we propose a new approach-PLC-Cache-to greatly improve the I/O performance as well as write durability of SSDs. PLC-Cache is conducive to amplifying the proportion of the Popular and Long-term Cached (PLC) data, which is infrequently written and kept in SSD cache in a long time period to catalyze cache hits, in an entire SSD written data set. PLC-Cache advocates a two-phase approach. First, non-popular data are ruled out from being written into SSDs. Second, PLC-Cache makes an effort to convert SSD written data into PLC-data as much as possible. Our experimental results based on a practical deduplication system indicate that compared with the existing caching schemes, PLC-Cache shortens data access latency by an average of 23.4%. Importantly, PLC-Cache improves the lifetime of SSD-based caches by reducing the amount of data written to SSDs by a factor of 15.7.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131084956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855548
Mingqiang Li, P. Lee
Although RAID is a well-known technique to protect data against disk errors, it is vulnerable to silent data corruptions that cannot be detected by disk drives. Existing integrity protection schemes designed for RAID arrays often introduce high I/O overhead. Our key insight is that by properly designing an integrity protection scheme that adapts to the read/write characteristics of storage workloads, the I/O overhead can be significantly mitigated. In view of this, this paper presents a systematic study on I/O-efficient integrity protection against silent data corruptions in RAID arrays. We formalize an integrity checking model, and justify that a large proportion of disk reads can be checked with simpler and more I/O-efficient integrity checking mechanisms. Based on this integrity checking model, we construct two integrity protection schemes that provide complementary performance advantages for storage workloads with different user write sizes. We further propose a quantitative method for choosing between the two schemes in real-world scenarios. Our trace-driven simulation results show that with the appropriate integrity protection scheme, we can reduce the I/O overhead to below 15%.
{"title":"Toward I/O-efficient protection against silent data corruptions in RAID arrays","authors":"Mingqiang Li, P. Lee","doi":"10.1109/MSST.2014.6855548","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855548","url":null,"abstract":"Although RAID is a well-known technique to protect data against disk errors, it is vulnerable to silent data corruptions that cannot be detected by disk drives. Existing integrity protection schemes designed for RAID arrays often introduce high I/O overhead. Our key insight is that by properly designing an integrity protection scheme that adapts to the read/write characteristics of storage workloads, the I/O overhead can be significantly mitigated. In view of this, this paper presents a systematic study on I/O-efficient integrity protection against silent data corruptions in RAID arrays. We formalize an integrity checking model, and justify that a large proportion of disk reads can be checked with simpler and more I/O-efficient integrity checking mechanisms. Based on this integrity checking model, we construct two integrity protection schemes that provide complementary performance advantages for storage workloads with different user write sizes. We further propose a quantitative method for choosing between the two schemes in real-world scenarios. Our trace-driven simulation results show that with the appropriate integrity protection scheme, we can reduce the I/O overhead to below 15%.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127066673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855556
O. Shimizu, T. Harasawa, H. Noguchi
We surveyed the history of using metal particulate media in linear tape systems to enhance cartridge capacity, discussed the metal particulate media limitations, and introduced advanced barium-ferrite-particulate-media-based magnetic tape technology, focusing on the use of magnetic particles, surface profile design, and particle orientation control. The increase in cartridge capacity has been accelerated by combining barium ferrite particles with ultrathin layer coating technology and by controlling the barium ferrite particle orientation and surface asperities, which reduce the surface frictional force without increasing the head-to-media spacing.
{"title":"Advanced magnetic tape technology for linear tape systems: Barium ferrite technology beyond the limitation of metal particulate media","authors":"O. Shimizu, T. Harasawa, H. Noguchi","doi":"10.1109/MSST.2014.6855556","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855556","url":null,"abstract":"We surveyed the history of using metal particulate media in linear tape systems to enhance cartridge capacity, discussed the metal particulate media limitations, and introduced advanced barium-ferrite-particulate-media-based magnetic tape technology, focusing on the use of magnetic particles, surface profile design, and particle orientation control. The increase in cartridge capacity has been accelerated by combining barium ferrite particles with ultrathin layer coating technology and by controlling the barium ferrite particle orientation and surface asperities, which reduce the surface frictional force without increasing the head-to-media spacing.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115436274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855549
S. Odeh, Yuval Cassuto
Multi-write codes hold great promise to reduce write amplification in flash-based storage devices. In this work we propose two novel mapping architectures that show clear advantage over known schemes using multi-write codes, and over schemes not using such codes. We demonstrate the advantage of the proposed architectures by evaluating them with industry-accepted benchmark traces. The results show write amplification savings of double-digit percentages, for as low as 10% over-provisioning. In addition to showing the superiority of the new architectures on real-world workloads, the paper includes a study of the write-amplification performance on synthetically-generated workloads with time locality. In addition, some analytical insight is provided to assist the deployment of the architectures in real storage devices with varying device parameters.
{"title":"NAND flash architectures reducing write amplification through multi-write codes","authors":"S. Odeh, Yuval Cassuto","doi":"10.1109/MSST.2014.6855549","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855549","url":null,"abstract":"Multi-write codes hold great promise to reduce write amplification in flash-based storage devices. In this work we propose two novel mapping architectures that show clear advantage over known schemes using multi-write codes, and over schemes not using such codes. We demonstrate the advantage of the proposed architectures by evaluating them with industry-accepted benchmark traces. The results show write amplification savings of double-digit percentages, for as low as 10% over-provisioning. In addition to showing the superiority of the new architectures on real-world workloads, the paper includes a study of the write-amplification performance on synthetically-generated workloads with time locality. In addition, some analytical insight is provided to assist the deployment of the architectures in real storage devices with varying device parameters.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115525001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855555
Abdullah Gharaibeh, C. Constantinescu, Maohua Lu, R. Routray, Anurag Sharma, P. Sarkar, David A. Pease, M. Ripeanu
Deduplication is a commonly-used technique on disk-based storage pools. However, deduplication has not been used for tape-based pools: tape characteristics, such as high mount and seek times combined with data fragmentation resulting from deduplication create a toxic combination that leads to unacceptably high retrieval times. This work proposes DedupT, a system that efficiently supports deduplication on tape pools. This paper (i) details the main challenges to enable efficient deduplication on tape libraries, (ii) presents a class of solutions based on graph-modeling of similarity between data items that enables efficient placement on tapes; and (iii) presents the design and evaluation of novel cross-tape and on-tape chunk placement algorithms that alleviate tape mount time overhead and reduce on-tape data fragmentation. Using 4.5 TB of real-world workloads, we show that DedupT retains at least 95% of the deduplication efficiency. We show that DedupT mitigates major retrieval time overheads, and, due to reading less data, is able to offer better restore performance compared to the case of restoring non-deduplicated data.
{"title":"DedupT: Deduplication for tape systems","authors":"Abdullah Gharaibeh, C. Constantinescu, Maohua Lu, R. Routray, Anurag Sharma, P. Sarkar, David A. Pease, M. Ripeanu","doi":"10.1109/MSST.2014.6855555","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855555","url":null,"abstract":"Deduplication is a commonly-used technique on disk-based storage pools. However, deduplication has not been used for tape-based pools: tape characteristics, such as high mount and seek times combined with data fragmentation resulting from deduplication create a toxic combination that leads to unacceptably high retrieval times. This work proposes DedupT, a system that efficiently supports deduplication on tape pools. This paper (i) details the main challenges to enable efficient deduplication on tape libraries, (ii) presents a class of solutions based on graph-modeling of similarity between data items that enables efficient placement on tapes; and (iii) presents the design and evaluation of novel cross-tape and on-tape chunk placement algorithms that alleviate tape mount time overhead and reduce on-tape data fragmentation. Using 4.5 TB of real-world workloads, we show that DedupT retains at least 95% of the deduplication efficiency. We show that DedupT mitigates major retrieval time overheads, and, due to reading less data, is able to offer better restore performance compared to the case of restoring non-deduplicated data.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"17 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120845366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855552
Lipeng Wan, Zheng Lu, Qing Cao, Feiyi Wang, S. Oral, B. Settlemyer
In recent years, non-volatile memory devices such as SSD drives have emerged as a viable storage solution due to their increasing capacity and decreasing cost. Due to the unique capability and capacity requirements in large scale HPC (High Performance Computing) storage environment, a hybrid configuration (SSD and HDD) may represent one of the most available and balanced solutions considering the cost and performance. Under this setting, effective data placement as well as movement with controlled overhead become a pressing challenge. In this paper, we propose an integrated object placement and movement framework and adaptive learning algorithms to address these issues. Specifically, we present a method that shuffle data objects across storage tiers to optimize the data access performance. The method also integrates an adaptive learning algorithm where realtime classification is employed to predict the popularity of data object accesses, so that they can be placed on, or migrate between SSD or HDD drives in the most efficient manner. We discuss preliminary results based on this approach using a simulator we developed to show that the proposed methods can dynamically adapt storage placements and access pattern as workloads evolve to achieve the best system level performance such as throughput.
{"title":"SSD-optimized workload placement with adaptive learning and classification in HPC environments","authors":"Lipeng Wan, Zheng Lu, Qing Cao, Feiyi Wang, S. Oral, B. Settlemyer","doi":"10.1109/MSST.2014.6855552","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855552","url":null,"abstract":"In recent years, non-volatile memory devices such as SSD drives have emerged as a viable storage solution due to their increasing capacity and decreasing cost. Due to the unique capability and capacity requirements in large scale HPC (High Performance Computing) storage environment, a hybrid configuration (SSD and HDD) may represent one of the most available and balanced solutions considering the cost and performance. Under this setting, effective data placement as well as movement with controlled overhead become a pressing challenge. In this paper, we propose an integrated object placement and movement framework and adaptive learning algorithms to address these issues. Specifically, we present a method that shuffle data objects across storage tiers to optimize the data access performance. The method also integrates an adaptive learning algorithm where realtime classification is employed to predict the popularity of data object accesses, so that they can be placed on, or migrate between SSD or HDD drives in the most efficient manner. We discuss preliminary results based on this approach using a simulator we developed to show that the proposed methods can dynamically adapt storage placements and access pattern as workloads evolve to achieve the best system level performance such as throughput.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124030793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855538
Stelios Mavridis, Yannis Sfakianakis, Anastasios Papagiannis, M. Marazakis, A. Bilas
Achieving high I/O throughput on modern servers presents significant challenges. With increasing core counts, server memory architectures become less uniform, both in terms of latency as well as bandwidth. In particular, the bandwidth of the interconnect among NUMA nodes is limited compared to local memory bandwidth. Moreover, interconnect congestion and contention introduce additional latency on remote accesses. These challenges severely limit the maximum achievable storage throughput and IOPS rate. Therefore, data and thread placement are critical for data-intensive applications running on NUMA architectures. In this paper we present Jericho, a new I/O stack for the Linux kernel that improves affinity between application threads, kernel threads, and buffers in the storage I/O path. Jericho consists of a NUMA-aware filesystem and a DRAM cache organized in slices mapped to NUMA nodes. The Jericho filesystem implements our task placement policy by dynamically migrating application threads that issue I/Os based on the location of the corresponding I/O buffers. The Jericho DRAM I/O cache, a replacement for the Linux page-cache, splits buffer memory in slices, and uses per-slice kernel I/O threads for I/O request processing. Our evaluation shows that running the FIO microbenchmark on a modern 64-core server with an unmodified Linux kernel results in only 5% of the memory accesses being served by local memory. With Jericho, more than 95% of accesses become local, with a corresponding 2x performance improvement.
在现代服务器上实现高I/O吞吐量提出了重大挑战。随着核心数量的增加,服务器内存体系结构在延迟和带宽方面变得不那么统一。特别是,与本地内存带宽相比,NUMA节点之间互连的带宽是有限的。此外,互连拥塞和争用会给远程访问带来额外的延迟。这些挑战严重限制了可实现的最大存储吞吐量和IOPS。因此,对于在NUMA体系结构上运行的数据密集型应用程序,数据和线程放置是至关重要的。在本文中,我们介绍了Jericho,一个用于Linux内核的新的I/O堆栈,它改善了应用程序线程、内核线程和存储I/O路径中的缓冲区之间的亲和性。Jericho由一个NUMA感知的文件系统和一个DRAM缓存组成,这些缓存组织在映射到NUMA节点的片中。Jericho文件系统通过基于相应I/O缓冲区的位置动态迁移发出I/O的应用程序线程来实现我们的任务放置策略。Jericho DRAM I/O缓存是Linux页缓存的替代品,它将缓冲内存分成片,并使用每片内核I/O线程进行I/O请求处理。我们的评估表明,在使用未修改的Linux内核的现代64核服务器上运行FIO微基准测试,结果只有5%的内存访问由本地内存提供。有了Jericho, 95%以上的访问变成了本地访问,相应的性能提高了2倍。
{"title":"Jericho: Achieving scalability through optimal data placement on multicore systems","authors":"Stelios Mavridis, Yannis Sfakianakis, Anastasios Papagiannis, M. Marazakis, A. Bilas","doi":"10.1109/MSST.2014.6855538","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855538","url":null,"abstract":"Achieving high I/O throughput on modern servers presents significant challenges. With increasing core counts, server memory architectures become less uniform, both in terms of latency as well as bandwidth. In particular, the bandwidth of the interconnect among NUMA nodes is limited compared to local memory bandwidth. Moreover, interconnect congestion and contention introduce additional latency on remote accesses. These challenges severely limit the maximum achievable storage throughput and IOPS rate. Therefore, data and thread placement are critical for data-intensive applications running on NUMA architectures. In this paper we present Jericho, a new I/O stack for the Linux kernel that improves affinity between application threads, kernel threads, and buffers in the storage I/O path. Jericho consists of a NUMA-aware filesystem and a DRAM cache organized in slices mapped to NUMA nodes. The Jericho filesystem implements our task placement policy by dynamically migrating application threads that issue I/Os based on the location of the corresponding I/O buffers. The Jericho DRAM I/O cache, a replacement for the Linux page-cache, splits buffer memory in slices, and uses per-slice kernel I/O threads for I/O request processing. Our evaluation shows that running the FIO microbenchmark on a modern 64-core server with an unmodified Linux kernel results in only 5% of the memory accesses being served by local memory. With Jericho, more than 95% of accesses become local, with a corresponding 2x performance improvement.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121081522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-06-02DOI: 10.1109/MSST.2014.6855554
Feng Chen, M. Mesnier, Scott Hahn
Cloud storage is receiving high interest in both academia and industry. As a new storage model, it provides many attractive features, such as high availability, resilience, and cost efficiency. Yet, cloud storage also brings many new challenges. In particular, it widens the already-significant semantic gap between applications, which generate data, and storage systems, which manage data. This widening semantic gap makes end-to-end differentiated services extremely difficult. In this paper, we present a client-aware cloud storage framework, which allows semantic information to flow from clients, across multiple intermediate layers, to the cloud storage system. In turn, the storage system can differentiate various data classes and enforce predefined policies. We showcase the effectiveness of enabling such client awareness by using Intel's Differentiated Storage Services (DSS) to enhance persistent disk caching and to control I/O traffic to different storage devices. We find that we can significantly outperform LRU-style caching, improving upload bandwidth by 5x and download bandwidth by 1.6x. Further, we can achieve 85% of the performance of a full-SSD solution at only a fraction (14%) of the cost.
{"title":"Client-aware cloud storage","authors":"Feng Chen, M. Mesnier, Scott Hahn","doi":"10.1109/MSST.2014.6855554","DOIUrl":"https://doi.org/10.1109/MSST.2014.6855554","url":null,"abstract":"Cloud storage is receiving high interest in both academia and industry. As a new storage model, it provides many attractive features, such as high availability, resilience, and cost efficiency. Yet, cloud storage also brings many new challenges. In particular, it widens the already-significant semantic gap between applications, which generate data, and storage systems, which manage data. This widening semantic gap makes end-to-end differentiated services extremely difficult. In this paper, we present a client-aware cloud storage framework, which allows semantic information to flow from clients, across multiple intermediate layers, to the cloud storage system. In turn, the storage system can differentiate various data classes and enforce predefined policies. We showcase the effectiveness of enabling such client awareness by using Intel's Differentiated Storage Services (DSS) to enhance persistent disk caching and to control I/O traffic to different storage devices. We find that we can significantly outperform LRU-style caching, improving upload bandwidth by 5x and download bandwidth by 1.6x. Further, we can achieve 85% of the performance of a full-SSD solution at only a fraction (14%) of the cost.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121038819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}