In this paper, we describe an algorithm to improve dictionary based lossless data compression on GPGPUs. The presented algorithm uses bit-wise computations and leverages bit parallelism for the core part of the algorithm which is the longest prefix match calculations. Using bit parallelism, also known as bit-vector approach, is a fundamentally new approach for data compression and promising in performance for hybrid CPU-GPU environments.The implementation of the new compression algorithm on GPUs improves the performance of the compression process compared to the previous attempts. Moreover, the bit-vector approach opens new opportunities for improvement and increases the applicability of popular heterogeneous environments.
{"title":"CULZSS-Bit: A Bit-Vector Algorithm for Lossless Data Compression on GPGPUs","authors":"Adnan Ozsoy","doi":"10.1109/DISCS.2014.9","DOIUrl":"https://doi.org/10.1109/DISCS.2014.9","url":null,"abstract":"In this paper, we describe an algorithm to improve dictionary based lossless data compression on GPGPUs. The presented algorithm uses bit-wise computations and leverages bit parallelism for the core part of the algorithm which is the longest prefix match calculations. Using bit parallelism, also known as bit-vector approach, is a fundamentally new approach for data compression and promising in performance for hybrid CPU-GPU environments.The implementation of the new compression algorithm on GPUs improves the performance of the compression process compared to the previous attempts. Moreover, the bit-vector approach opens new opportunities for improvement and increases the applicability of popular heterogeneous environments.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115084649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a graph search, it is possible to capitalize on this characteristic and cache relevant information in high-degree vertexes. We applied this idea by caching remote vertex ids in a parallel breadth-first search implementation, and demonstrated 1.6x to 2.4x speedup over the reference implementation on 64 to 1024 cores. We proposed a system design in which resources are dedicated exclusively to caching, and shared among a set of nodes. Our evaluation demonstrates that this design has the potential to reduce communication and improve performance over large scale systems. Finally, we used a memcached system as the cache server finding that a generic protocol that does not match the usage semantics may hinder the potential performance improvements.
{"title":"A Caching Approach to Reduce Communication in Graph Search Algorithms","authors":"Pietro Cicotti, L. Carrington","doi":"10.1109/DISCS.2014.8","DOIUrl":"https://doi.org/10.1109/DISCS.2014.8","url":null,"abstract":"In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a graph search, it is possible to capitalize on this characteristic and cache relevant information in high-degree vertexes. We applied this idea by caching remote vertex ids in a parallel breadth-first search implementation, and demonstrated 1.6x to 2.4x speedup over the reference implementation on 64 to 1024 cores. We proposed a system design in which resources are dedicated exclusively to caching, and shared among a set of nodes. Our evaluation demonstrates that this design has the potential to reduce communication and improve performance over large scale systems. Finally, we used a memcached system as the cache server finding that a generic protocol that does not match the usage semantics may hinder the potential performance improvements.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126875567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For systems executing a mixture of different data intensive applications in parallel there is always the question about the impact that each application has on the storage subsystem. From the perspective of storage, I/O is typically anonymous as it does not contain user identifiers or similar information. This paper focuses on the analysis of performance data collected on shared system components like global file systems that can not be mapped back to user activities immediately. Our approach classifies user jobs based on their properties into classes and correlates these classes with global timelines. Within the paper we will show details of the clustering algorithm, depict our measurement environment and present first results. The results are valuable for tuning HPC storage system to achieve an optimized behavior on a global system level or to separate users into classes with different I/O demands.
{"title":"Mapping of RAID Controller Performance Data to the Job History on Large Computing Systems","authors":"Marc Hartung, Michael Kluge","doi":"10.1109/DISCS.2014.7","DOIUrl":"https://doi.org/10.1109/DISCS.2014.7","url":null,"abstract":"For systems executing a mixture of different data intensive applications in parallel there is always the question about the impact that each application has on the storage subsystem. From the perspective of storage, I/O is typically anonymous as it does not contain user identifiers or similar information. This paper focuses on the analysis of performance data collected on shared system components like global file systems that can not be mapped back to user activities immediately. Our approach classifies user jobs based on their properties into classes and correlates these classes with global timelines. Within the paper we will show details of the clustering algorithm, depict our measurement environment and present first results. The results are valuable for tuning HPC storage system to achieve an optimized behavior on a global system level or to separate users into classes with different I/O demands.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117084940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teng Wang, K. Vasko, Zhuo Liu, Hui Chen, Weikuan Yu
In today's "Big Data" era, developers have adopted I/O techniques such as MPI-IO, Parallel NetCDF and HDF5 to garner enough performance to manage the vast amount of data that scientific applications require. These I/O techniques offer parallel access to shared datasets and together with a set of optimizations such as data sieving and two-phase I/O to boost I/O throughput. While most of these techniques focus on optimizing the access pattern on a single file or file extent, few of these techniques consider cross-file I/O optimizations. This paper aims to explore the potential benefit from cross-file I/O aggregation. We propose a Bundle-based PARallel Aggregation framework (BPAR) and design three partitioning schemes under such framework that targets at improving the I/O performance of a mission-critical application GEOS-5, as well as a broad range of other scientific applications. The results of our experiments reveal that BPAR can achieve on average 2.1× performance improvement over the baseline GEOS-5.
{"title":"BPAR: A Bundle-Based Parallel Aggregation Framework for Decoupled I/O Execution","authors":"Teng Wang, K. Vasko, Zhuo Liu, Hui Chen, Weikuan Yu","doi":"10.1109/DISCS.2014.6","DOIUrl":"https://doi.org/10.1109/DISCS.2014.6","url":null,"abstract":"In today's \"Big Data\" era, developers have adopted I/O techniques such as MPI-IO, Parallel NetCDF and HDF5 to garner enough performance to manage the vast amount of data that scientific applications require. These I/O techniques offer parallel access to shared datasets and together with a set of optimizations such as data sieving and two-phase I/O to boost I/O throughput. While most of these techniques focus on optimizing the access pattern on a single file or file extent, few of these techniques consider cross-file I/O optimizations. This paper aims to explore the potential benefit from cross-file I/O aggregation. We propose a Bundle-based PARallel Aggregation framework (BPAR) and design three partitioning schemes under such framework that targets at improving the I/O performance of a mission-critical application GEOS-5, as well as a broad range of other scientific applications. The results of our experiments reveal that BPAR can achieve on average 2.1× performance improvement over the baseline GEOS-5.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129058836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.
{"title":"Efficient, Failure Resilient Transactions for Parallel and Distributed Computing","authors":"J. Lofstead, Jai Dayal, I. Jimenez, C. Maltzahn","doi":"10.1109/DISCS.2014.13","DOIUrl":"https://doi.org/10.1109/DISCS.2014.13","url":null,"abstract":"Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133574163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multipath routing has been studied in diverse contexts such as wide-area networks and wireless networks in order to minimize the finish time of data transfer or the latency of message sending. The fast adoption of cloud computing for various applications including high-performance computing applications has drawn more attention to efficient network utilization through adaptive or multipath routing methods. However, the previous studies have not exploited multiple paths in an optimized way while scaling well with a large number of hosts for some reasons such as high time complexity of algorithms.In this paper, we propose a scalable distributed flow scheduling algorithm that can exploit multiple paths in data center networks. We develop our algorithm based on linear programming and evaluate the algorithm in FatTree network topologies, one of several advanced data center network topologies. The results show that the distributed algorithm performs much better than the centralized algorithm in terms of running time and is comparable to the centralized algorithm within 10% increased finish time in terms of data transfer time.
{"title":"Distributed Multipath Routing Algorithm for Data Center Networks","authors":"Eun-Sung Jung, V. Vishwanath, R. Kettimuthu","doi":"10.1109/DISCS.2014.14","DOIUrl":"https://doi.org/10.1109/DISCS.2014.14","url":null,"abstract":"Multipath routing has been studied in diverse contexts such as wide-area networks and wireless networks in order to minimize the finish time of data transfer or the latency of message sending. The fast adoption of cloud computing for various applications including high-performance computing applications has drawn more attention to efficient network utilization through adaptive or multipath routing methods. However, the previous studies have not exploited multiple paths in an optimized way while scaling well with a large number of hosts for some reasons such as high time complexity of algorithms.In this paper, we propose a scalable distributed flow scheduling algorithm that can exploit multiple paths in data center networks. We develop our algorithm based on linear programming and evaluate the algorithm in FatTree network topologies, one of several advanced data center network topologies. The results show that the distributed algorithm performs much better than the centralized algorithm in terms of running time and is comparable to the centralized algorithm within 10% increased finish time in terms of data transfer time.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127984220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The underlying storage of hybrid parallel file systems (PFS) is composed of both SSD-based file servers (SServer) and HDD-based file servers (HServer). Unlike a traditional HServer, an SServer consistently provides improved storage performance but lacks storage space. However, most current data layout schemes do not consider the differences in performance and space between heterogeneous servers, and may significantly degrade the performance of the hybrid PFSs. In this paper, we propose PSA, a novel data layout scheme, which maximizes the hybrid PFSs performance by applying adaptive varied-size file stripes. PSA dispatches data on heterogeneous file servers not only based on storage performance but also storage space. We have implemented PSA within OrangeFS, a popular parallel file system in the HPC domain. Our extensive experiments using a representative benchmark show that PSA provides superior I/O throughput than the default and performance-aware file data layout schemes.
{"title":"PSA: A Performance and Space-Aware Data Layout Scheme for Hybrid Parallel File Systems","authors":"Shuibing He, Yan Liu, Xian-He Sun","doi":"10.1109/DISCS.2014.10","DOIUrl":"https://doi.org/10.1109/DISCS.2014.10","url":null,"abstract":"The underlying storage of hybrid parallel file systems (PFS) is composed of both SSD-based file servers (SServer) and HDD-based file servers (HServer). Unlike a traditional HServer, an SServer consistently provides improved storage performance but lacks storage space. However, most current data layout schemes do not consider the differences in performance and space between heterogeneous servers, and may significantly degrade the performance of the hybrid PFSs. In this paper, we propose PSA, a novel data layout scheme, which maximizes the hybrid PFSs performance by applying adaptive varied-size file stripes. PSA dispatches data on heterogeneous file servers not only based on storage performance but also storage space. We have implemented PSA within OrangeFS, a popular parallel file system in the HPC domain. Our extensive experiments using a representative benchmark show that PSA provides superior I/O throughput than the default and performance-aware file data layout schemes.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122578336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}