Pub Date : 2012-04-16DOI: 10.1109/MSST.2012.6232382
John Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, J. Woodring
In the petascale era, the storage stack used by the extreme scale high performance computing community is fairly homogeneous across sites. On the compute edge of the stack, file system clients or IO forwarding services direct IO over an interconnect network to a relatively small set of IO nodes. These nodes forward the requests over a secondary storage network to a spindle-based parallel file system. Unfortunately, this architecture will become unviable in the exascale era. As the density growth of disks continues to outpace increases in their rotational speeds, disks are becoming increasingly cost-effective for capacity but decreasingly so for bandwidth. Fortunately, new storage media such as solid state devices are filling this gap; although not cost-effective for capacity, they are so for performance. This suggests that the storage stack at exascale will incorporate solid state storage between the compute nodes and the parallel file systems. There are three natural places into which to position this new storage layer: within the compute nodes, the IO nodes, or the parallel file system. In this paper, we argue that the IO nodes are the appropriate location for HPC workloads and show results from a prototype system that we have built accordingly. Running a pipeline of computational simulation and visualization, we show that our prototype system reduces total time to completion by up to 30%.
{"title":"Jitter-free co-processing on a prototype exascale storage stack","authors":"John Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, J. Woodring","doi":"10.1109/MSST.2012.6232382","DOIUrl":"https://doi.org/10.1109/MSST.2012.6232382","url":null,"abstract":"In the petascale era, the storage stack used by the extreme scale high performance computing community is fairly homogeneous across sites. On the compute edge of the stack, file system clients or IO forwarding services direct IO over an interconnect network to a relatively small set of IO nodes. These nodes forward the requests over a secondary storage network to a spindle-based parallel file system. Unfortunately, this architecture will become unviable in the exascale era. As the density growth of disks continues to outpace increases in their rotational speeds, disks are becoming increasingly cost-effective for capacity but decreasingly so for bandwidth. Fortunately, new storage media such as solid state devices are filling this gap; although not cost-effective for capacity, they are so for performance. This suggests that the storage stack at exascale will incorporate solid state storage between the compute nodes and the parallel file systems. There are three natural places into which to position this new storage layer: within the compute nodes, the IO nodes, or the parallel file system. In this paper, we argue that the IO nodes are the appropriate location for HPC workloads and show results from a prototype system that we have built accordingly. Running a pipeline of computational simulation and visualization, we show that our prototype system reduces total time to completion by up to 30%.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128579581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-04-16DOI: 10.1109/MSST.2012.6232384
Sheng Qiu, A. L. Narasimha Reddy
Emerging nonvolatile memory technologies (sometimes referred as Storage Class Memory (SCM)), are poised to close the enormous performance gap between persistent storage and main memory. The SCM devices can be attached directly to memory bus and accessed like normal DRAM. It becomes then possible to exploit memory management hardware resources to improve file system performance. However, in this case, SCM may share critical system resources such as the TLB, page table with DRAM which can potentially impact SCM's performance. In this paper, we propose to solve this problem by employing superpages to reduce the pressure on memory management resources such as the TLB. As a result, the file system performance is further improved. We also analyze the space utilization efficiency of superpages. We improve space efficiency of the file system by allocating normal pages (4KB) for small files while allocating super pages (2MB on ×86) for large files. We show that it is possible to achieve better performance without loss of space utilization efficiency of nonvolatile memory.
{"title":"Exploiting superpages in a nonvolatile memory file system","authors":"Sheng Qiu, A. L. Narasimha Reddy","doi":"10.1109/MSST.2012.6232384","DOIUrl":"https://doi.org/10.1109/MSST.2012.6232384","url":null,"abstract":"Emerging nonvolatile memory technologies (sometimes referred as Storage Class Memory (SCM)), are poised to close the enormous performance gap between persistent storage and main memory. The SCM devices can be attached directly to memory bus and accessed like normal DRAM. It becomes then possible to exploit memory management hardware resources to improve file system performance. However, in this case, SCM may share critical system resources such as the TLB, page table with DRAM which can potentially impact SCM's performance. In this paper, we propose to solve this problem by employing superpages to reduce the pressure on memory management resources such as the TLB. As a result, the file system performance is further improved. We also analyze the space utilization efficiency of superpages. We improve space efficiency of the file system by allocating normal pages (4KB) for small files while allocating super pages (2MB on ×86) for large files. We show that it is possible to achieve better performance without loss of space utilization efficiency of nonvolatile memory.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125407327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-04-16DOI: 10.1109/MSST.2012.6232390
Guanlin Lu, Youngjin Nam, D. Du
Due to its better scalability, Key-Value (KV) store has superseded traditional relational databases for many applications, such as data deduplication, on-line multi-player gaming, and Internet services like Amazon and Facebook. The KV store efficiently supports two operations (key lookup and KV pair insertion) through an index structure that maps keys to their associated values. The KV store is also commonly used to implement the chunk index in data deduplication, where a chunk ID (SHA1 value computed based on the chunk's content) is a key and its associative chunk metadata (e.g., physical storage location, stream ID) is the value. For a deduplication system, typically the number of chunks is too large to store the KV store solely in RAM. Thus, the KV store maintains a large (hash-table based) index structure in RAM to index all KV pairs stored on secondary storage. Hence, its available RAM space limits the maximum number of KV pairs that can be stored. Moving the index data structure from RAM to flash can possibly overcome the space limitation. In this paper, we propose efficient KV store on flash with a Bloom Filter based index structure called BloomStore. The unique features of the BloomStore include (1) no index structure is required to be stored in RAM so that a small RAM space can support a large number of KV pairs and (2) both index structure and KV pairs are stored compactly on flash memory to improve its performance. Compared with the state-of-the-art KV store designs, the BloomStore achieves a significantly better key lookup performance and roughly the same insertion performance with multiple times less RAM usage based on our experiments with deduplication workloads.
{"title":"BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data deduplication on flash","authors":"Guanlin Lu, Youngjin Nam, D. Du","doi":"10.1109/MSST.2012.6232390","DOIUrl":"https://doi.org/10.1109/MSST.2012.6232390","url":null,"abstract":"Due to its better scalability, Key-Value (KV) store has superseded traditional relational databases for many applications, such as data deduplication, on-line multi-player gaming, and Internet services like Amazon and Facebook. The KV store efficiently supports two operations (key lookup and KV pair insertion) through an index structure that maps keys to their associated values. The KV store is also commonly used to implement the chunk index in data deduplication, where a chunk ID (SHA1 value computed based on the chunk's content) is a key and its associative chunk metadata (e.g., physical storage location, stream ID) is the value. For a deduplication system, typically the number of chunks is too large to store the KV store solely in RAM. Thus, the KV store maintains a large (hash-table based) index structure in RAM to index all KV pairs stored on secondary storage. Hence, its available RAM space limits the maximum number of KV pairs that can be stored. Moving the index data structure from RAM to flash can possibly overcome the space limitation. In this paper, we propose efficient KV store on flash with a Bloom Filter based index structure called BloomStore. The unique features of the BloomStore include (1) no index structure is required to be stored in RAM so that a small RAM space can support a large number of KV pairs and (2) both index structure and KV pairs are stored compactly on flash memory to improve its performance. Compared with the state-of-the-art KV store designs, the BloomStore achieves a significantly better key lookup performance and roughly the same insertion performance with multiple times less RAM usage based on our experiments with deduplication workloads.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129234491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-04-16DOI: 10.1109/MSST.2012.6232383
Ji-guang Wan, Jibin Wang, Yan Liu, Qing Yang, Jianzong Wang, C. Xie
Enterprise storage systems are generally shared by multiple servers in a SAN environment. Our experiments as well as industry reports have shown that disk arrays show poor performance when multiple servers share one RAID due to resource contention as well as frequent disk head movements. We have studied IO performance characteristics of several shared storage settings of practical business operations. To avoid the IO contention, we propose a new dynamic data relocation technique on shared RAID storages, referred to as DROP, Dynamic data Relocation to Optimize Performance. DROP allocates/manages a group of cache data areas and relocates/drops the portion of hot data at a predefined sub array that is a physical partition on the top of the entire shared array. By analyzing profiling data to make each cache area owned by one server, we are able to determine optimal data relocation and partition of disks in the RAID to maximize large sequential block accesses on individual disks and at the same time maximize parallel accesses across disks in the array. As a result, DROP minimizes disk head movements in the array at run time giving rise to high IO performance. A prototype DROP has been implemented as a software module at the storage target controller. Extensive experiments have been carried out using real world IO workloads to evaluate the performance of the DROP implementation. Experimental results have shown that DROP improves shared IO performance greatly. The performance improvements in terms of average IO response time range from 20% to a factor 2.5 at no additional hardware cost.
企业存储系统通常由SAN环境中的多个服务器共享。我们的实验和行业报告都表明,当多个服务器共享一个RAID时,由于资源争用和频繁的磁盘磁头移动,磁盘阵列的性能很差。我们研究了实际业务操作中几种共享存储设置的IO性能特征。为了避免IO争用,我们提出了一种新的基于共享RAID存储的动态数据重定位技术,称为DROP (dynamic data relocation To Optimize Performance)。DROP分配/管理一组缓存数据区域,并在预定义的子数组中重新定位/删除热数据部分,该子数组是整个共享数组顶部的物理分区。通过分析概要数据,使每个缓存区域由一台服务器拥有,我们能够确定RAID中磁盘的最佳数据重定位和分区,以最大限度地提高对单个磁盘的大顺序块访问,同时最大限度地提高阵列中磁盘的并行访问。因此,DROP在运行时最大限度地减少了阵列中的磁盘磁头移动,从而提高了IO性能。在存储目标控制器上实现了一个原型DROP作为软件模块。使用真实的IO工作负载进行了大量的实验,以评估DROP实现的性能。实验结果表明,DROP可以显著提高共享IO性能。在不增加硬件成本的情况下,平均IO响应时间的性能改进范围从20%到2.5倍。
{"title":"Enhancing shared RAID performance through online profiling","authors":"Ji-guang Wan, Jibin Wang, Yan Liu, Qing Yang, Jianzong Wang, C. Xie","doi":"10.1109/MSST.2012.6232383","DOIUrl":"https://doi.org/10.1109/MSST.2012.6232383","url":null,"abstract":"Enterprise storage systems are generally shared by multiple servers in a SAN environment. Our experiments as well as industry reports have shown that disk arrays show poor performance when multiple servers share one RAID due to resource contention as well as frequent disk head movements. We have studied IO performance characteristics of several shared storage settings of practical business operations. To avoid the IO contention, we propose a new dynamic data relocation technique on shared RAID storages, referred to as DROP, Dynamic data Relocation to Optimize Performance. DROP allocates/manages a group of cache data areas and relocates/drops the portion of hot data at a predefined sub array that is a physical partition on the top of the entire shared array. By analyzing profiling data to make each cache area owned by one server, we are able to determine optimal data relocation and partition of disks in the RAID to maximize large sequential block accesses on individual disks and at the same time maximize parallel accesses across disks in the array. As a result, DROP minimizes disk head movements in the array at run time giving rise to high IO performance. A prototype DROP has been implemented as a software module at the storage target controller. Extensive experiments have been carried out using real world IO workloads to evaluate the performance of the DROP implementation. Experimental results have shown that DROP improves shared IO performance greatly. The performance improvements in terms of average IO response time range from 20% to a factor 2.5 at no additional hardware cost.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124473017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-04-16DOI: 10.1109/MSST.2012.6232376
John Bent, G. Grider, B. Kettering, A. Manzanares, Meghan McClelland, Aaron Torres, Alfred Torrez
There yet exist no truly parallel file systems. Those that make the claim fall short when it comes to providing adequate concurrent write performance at large scale. This limitation causes large usability headaches in HPC. Users need two major capabilities missing from current parallel file systems. One, they need low latency interactivity. Two, they need high bandwidth for large parallel IO; this capability must be resistant to IO patterns and should not require tuning. There are no existing parallel file systems which provide these features. Frighteningly, exascale renders these features even less attainable from currently available parallel file systems. Fortunately, there is a path forward.
{"title":"Storage challenges at Los Alamos National Lab","authors":"John Bent, G. Grider, B. Kettering, A. Manzanares, Meghan McClelland, Aaron Torres, Alfred Torrez","doi":"10.1109/MSST.2012.6232376","DOIUrl":"https://doi.org/10.1109/MSST.2012.6232376","url":null,"abstract":"There yet exist no truly parallel file systems. Those that make the claim fall short when it comes to providing adequate concurrent write performance at large scale. This limitation causes large usability headaches in HPC. Users need two major capabilities missing from current parallel file systems. One, they need low latency interactivity. Two, they need high bandwidth for large parallel IO; this capability must be resistant to IO patterns and should not require tuning. There are no existing parallel file systems which provide these features. Frighteningly, exascale renders these features even less attainable from currently available parallel file systems. Fortunately, there is a path forward.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"394 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115684929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-04-01DOI: 10.1109/MSST.2012.6232377
Jingwei Ma, Bin Zhao, G. Wang, X. Liu
Deduplication has become one of the hottest topics in the field of data storage. Quite a few methods towards reducing disk I/O caused by deduplication have been proposed. Some methods also have been studied to accelerate computational sub-tasks in deduplication. However, the order of computational sub-tasks can affect overall deduplication throughput significantly, because computational sub-tasks exhibit quite different workload and concurrency in different orders and with different data sets. This paper proposes an adaptive pipelining model for the computational sub-tasks in deduplication. It takes both data type and hardware platform into account. Taking the compression ratio and the duplicate ratio of the data stream, and the compression speed and the fingerprinting speed on different processing units as parameters, it determines the optimal order of the pipeline stages (computational sub-tasks) and assigns each stage to the processing unit which processes it fastest. That is, “adaptive” refers to both data adaptive and hardware adaptive. Experimental results show that the adaptive pipeline improves the deduplication throughput up to 50% compared with the plain fixed pipeline, which implies that it is suitable for simultaneous deduplication of various data types on modern heterogeneous multi-core systems.
{"title":"Adaptive pipeline for deduplication","authors":"Jingwei Ma, Bin Zhao, G. Wang, X. Liu","doi":"10.1109/MSST.2012.6232377","DOIUrl":"https://doi.org/10.1109/MSST.2012.6232377","url":null,"abstract":"Deduplication has become one of the hottest topics in the field of data storage. Quite a few methods towards reducing disk I/O caused by deduplication have been proposed. Some methods also have been studied to accelerate computational sub-tasks in deduplication. However, the order of computational sub-tasks can affect overall deduplication throughput significantly, because computational sub-tasks exhibit quite different workload and concurrency in different orders and with different data sets. This paper proposes an adaptive pipelining model for the computational sub-tasks in deduplication. It takes both data type and hardware platform into account. Taking the compression ratio and the duplicate ratio of the data stream, and the compression speed and the fingerprinting speed on different processing units as parameters, it determines the optimal order of the pipeline stages (computational sub-tasks) and assigns each stage to the processing unit which processes it fastest. That is, “adaptive” refers to both data adaptive and hardware adaptive. Experimental results show that the adaptive pipeline improves the deduplication throughput up to 50% compared with the plain fixed pipeline, which implies that it is suitable for simultaneous deduplication of various data types on modern heterogeneous multi-core systems.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126491353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}