Heterogeneous platforms are mixes of different processing units. The key factor to their efficient usage is workload partitioning. Both static and dynamic partitioning strategies have been defined in previous work, but their applicability and performance differ significantly depending on the application to execute. In this paper, we propose an application-driven method to select the best partitioning strategy for a given workload. To this end, we define an application classification based on the application kernel structure -- i.e., The number of kernels in the application and their execution flow. We also enable five different partitioning strategies, which mix the best features of both static and dynamic approaches. We further define the performance-driven ranking of all suitable strategies for each application class. Finally, we match the best partitioning to a given application by simply determining its class and selecting the best ranked strategy for that class. We test the matchmaking on six representative applications, and demonstrate that the defined performance ranking is correct. Moreover, by choosing the best performing partitioning strategy, we can significantly improve application performance, leading to average speedup of 3.0x/5.3x over the Only-GPU/Only-CPU execution, respectively.
{"title":"Matchmaking Applications and Partitioning Strategies for Efficient Execution on Heterogeneous Platforms","authors":"Jie Shen, A. Varbanescu, X. Martorell, H. Sips","doi":"10.1109/ICPP.2015.65","DOIUrl":"https://doi.org/10.1109/ICPP.2015.65","url":null,"abstract":"Heterogeneous platforms are mixes of different processing units. The key factor to their efficient usage is workload partitioning. Both static and dynamic partitioning strategies have been defined in previous work, but their applicability and performance differ significantly depending on the application to execute. In this paper, we propose an application-driven method to select the best partitioning strategy for a given workload. To this end, we define an application classification based on the application kernel structure -- i.e., The number of kernels in the application and their execution flow. We also enable five different partitioning strategies, which mix the best features of both static and dynamic approaches. We further define the performance-driven ranking of all suitable strategies for each application class. Finally, we match the best partitioning to a given application by simply determining its class and selecting the best ranked strategy for that class. We test the matchmaking on six representative applications, and demonstrate that the defined performance ranking is correct. Moreover, by choosing the best performing partitioning strategy, we can significantly improve application performance, leading to average speedup of 3.0x/5.3x over the Only-GPU/Only-CPU execution, respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127816564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis
Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.
{"title":"Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units","authors":"D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis","doi":"10.1109/ICPP.2015.15","DOIUrl":"https://doi.org/10.1109/ICPP.2015.15","url":null,"abstract":"Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121370993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prateek Nagar, Fengguang Song, Luoding Zhu, Lan Lin
Deformable structures are abundant in various domains such as biology, medicine, life sciences, and ocean engineering. Our previous work created a numerical method, named LBM-IB method [1], to solve the fluid-structure interaction (FSI) problems. Our LBM-IB method is particularly suitable for simulating flexible (or elastic) structures immersed in a moving viscous fluid. Fluid-structure interaction problems are well known for their heavy demands on computing resources. Today, it is still challenging to resolve many real-world FSI problems. In order to solve large-scale fluid-structure interactions more efficiently, in this paper, we design a parallel LBM-IB library on shared memory many core architectures. We start from a sequential version, which is extended to two different parallel versions. The paper first introduces the mathematical background of the LBM-IB method, then uses the sequential version as a ground to present our implemented computational kernels and the algorithm. Next, it describes the two parallel programs: an Open MP implementation and a cube-based parallel implementation using Pthreads. The cube-based implementation builds upon our new cube-centric algorithm where all the data are stored in cubes and computations are performed on individual cubes in a data-centric manner. By exploiting better data locality and fine-grain block parallelism, the cube-based parallel implementation is able to outperform the Open MP implementation by up to 53% on 64-core computer systems.
{"title":"LBM-IB: A Parallel Library to Solve 3D Fluid-Structure Interaction Problems on Manycore Systems","authors":"Prateek Nagar, Fengguang Song, Luoding Zhu, Lan Lin","doi":"10.1109/ICPP.2015.14","DOIUrl":"https://doi.org/10.1109/ICPP.2015.14","url":null,"abstract":"Deformable structures are abundant in various domains such as biology, medicine, life sciences, and ocean engineering. Our previous work created a numerical method, named LBM-IB method [1], to solve the fluid-structure interaction (FSI) problems. Our LBM-IB method is particularly suitable for simulating flexible (or elastic) structures immersed in a moving viscous fluid. Fluid-structure interaction problems are well known for their heavy demands on computing resources. Today, it is still challenging to resolve many real-world FSI problems. In order to solve large-scale fluid-structure interactions more efficiently, in this paper, we design a parallel LBM-IB library on shared memory many core architectures. We start from a sequential version, which is extended to two different parallel versions. The paper first introduces the mathematical background of the LBM-IB method, then uses the sequential version as a ground to present our implemented computational kernels and the algorithm. Next, it describes the two parallel programs: an Open MP implementation and a cube-based parallel implementation using Pthreads. The cube-based implementation builds upon our new cube-centric algorithm where all the data are stored in cubes and computations are performed on individual cubes in a data-centric manner. By exploiting better data locality and fine-grain block parallelism, the cube-based parallel implementation is able to outperform the Open MP implementation by up to 53% on 64-core computer systems.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125161027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingxing Zhang, Yongwei Wu, Kang Chen, Weimin Zheng
Along with the prevalence of distributed systems, more and more applications require the ability of reliably transferring messages across a network. However, passing messages in a convenient and dependable way is both difficult and error prone. Thus the existing messaging products usually suffer from numerous software bugs. And these bugs are particularly difficult to be diagnosed or avoided. Therefore, in order to improve the methods for handling them, we need a better understanding of their characteristics. This paper provides the first (to the best of our knowledge)comprehensive characteristic study on message passing related bugs (MP-bugs). We have carefully examined the pattern, manifestation, fixing and other characteristics of 349 randomly selected real world MP-bugs from 3 representative open-source applications (Open MPI, Zero MQ, and Active MQ). Surprisingly, we found that nearly 60% of the non-latent MP-bugs can be categorised into two simple patterns: the message level bugs and the connection level bugs, which implies a promising perspective of detecting/tolerating tools for MP-bugs. Apart from this finding, our study have also uncovered many new (and sometimes surprising)insights of the message passing systems' developing process. The results should be useful for the design of corresponding bug detecting, exposing and tolerating tools.
{"title":"What Is Wrong with the Transmission? A Comprehensive Study on Message Passing Related Bugs","authors":"Mingxing Zhang, Yongwei Wu, Kang Chen, Weimin Zheng","doi":"10.1109/ICPP.2015.50","DOIUrl":"https://doi.org/10.1109/ICPP.2015.50","url":null,"abstract":"Along with the prevalence of distributed systems, more and more applications require the ability of reliably transferring messages across a network. However, passing messages in a convenient and dependable way is both difficult and error prone. Thus the existing messaging products usually suffer from numerous software bugs. And these bugs are particularly difficult to be diagnosed or avoided. Therefore, in order to improve the methods for handling them, we need a better understanding of their characteristics. This paper provides the first (to the best of our knowledge)comprehensive characteristic study on message passing related bugs (MP-bugs). We have carefully examined the pattern, manifestation, fixing and other characteristics of 349 randomly selected real world MP-bugs from 3 representative open-source applications (Open MPI, Zero MQ, and Active MQ). Surprisingly, we found that nearly 60% of the non-latent MP-bugs can be categorised into two simple patterns: the message level bugs and the connection level bugs, which implies a promising perspective of detecting/tolerating tools for MP-bugs. Apart from this finding, our study have also uncovered many new (and sometimes surprising)insights of the message passing systems' developing process. The results should be useful for the design of corresponding bug detecting, exposing and tolerating tools.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131287677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the base infrastructure to support various cloud services, data center draws more and more attractions from both academia and industry. A stable, effective, and robust data center network (DCN) management system is urgently required from institutions and corporations. However, existing management schemes have several problems, including the difficulty to manage the entire network with heterogeneous network components by a centralized controller, and the short-sighted mechanism to deal with resource allocation, congestion control, and VM migration. In this paper, we design Sheriff: a distributed pre-alert and management scheme for DCN management. Sheriff is a regional self-automatic control scheme at end host side to balance network traffic and workload. It includes two phases: prediction and management. Each end-host predicts possible overload and congestion by prediction strategy based on ARIMA and Neural Network methodology, and perform an Alert message. Delegated local controllers then monitor their dominating region and activate localized protocols VmMigration to manage the network. We illustrate the predication accuracy by network traces from a local data center service provider, examine the management efficiency by simulations on both Fat-Tree topology and Bcube topology, and prove that VmMigration is an approximation with ratio 3+2/p where p is a constant predefined in local search algorithm. Both numerical simulations and theoretical analysis validate the efficiency of our design. In all, Sheriff is a fast and effective scheme to better improve the performance of DCN.
{"title":"Sheriff: A Regional Pre-alert Management Scheme in Data Center Networks","authors":"Xiaofeng Gao, Wen Xu, Fan Wu, Guihai Chen","doi":"10.1109/ICPP.2015.76","DOIUrl":"https://doi.org/10.1109/ICPP.2015.76","url":null,"abstract":"As the base infrastructure to support various cloud services, data center draws more and more attractions from both academia and industry. A stable, effective, and robust data center network (DCN) management system is urgently required from institutions and corporations. However, existing management schemes have several problems, including the difficulty to manage the entire network with heterogeneous network components by a centralized controller, and the short-sighted mechanism to deal with resource allocation, congestion control, and VM migration. In this paper, we design Sheriff: a distributed pre-alert and management scheme for DCN management. Sheriff is a regional self-automatic control scheme at end host side to balance network traffic and workload. It includes two phases: prediction and management. Each end-host predicts possible overload and congestion by prediction strategy based on ARIMA and Neural Network methodology, and perform an Alert message. Delegated local controllers then monitor their dominating region and activate localized protocols VmMigration to manage the network. We illustrate the predication accuracy by network traces from a local data center service provider, examine the management efficiency by simulations on both Fat-Tree topology and Bcube topology, and prove that VmMigration is an approximation with ratio 3+2/p where p is a constant predefined in local search algorithm. Both numerical simulations and theoretical analysis validate the efficiency of our design. In all, Sheriff is a fast and effective scheme to better improve the performance of DCN.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132785275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Failure detection is a crucial service for dependable distributed systems. Traditional failure detector implementations usually target homogeneous and static configurations, as their performance relies heavily on the connectivity of each network node. In this paper we propose a new approach towards the implementation of failure detectors for large and dynamic networks: we study reputation systems as a means to detect failures. The reputation mechanism allows efficient node cooperation via the sharing of views about other nodes. Our experimental results show that a simple prototype of a reputation-based detection service performs better than other known adaptive failure detectors, with improved flexibility. It can thus be used in a dynamic environment with a large and variable number of nodes.
{"title":"RepFD - Using Reputation Systems to Detect Failures in Large Dynamic Networks","authors":"M. Veron, O. Marin, Sébastien Monnet, Pierre Sens","doi":"10.1109/ICPP.2015.18","DOIUrl":"https://doi.org/10.1109/ICPP.2015.18","url":null,"abstract":"Failure detection is a crucial service for dependable distributed systems. Traditional failure detector implementations usually target homogeneous and static configurations, as their performance relies heavily on the connectivity of each network node. In this paper we propose a new approach towards the implementation of failure detectors for large and dynamic networks: we study reputation systems as a means to detect failures. The reputation mechanism allows efficient node cooperation via the sharing of views about other nodes. Our experimental results show that a simple prototype of a reputation-based detection service performs better than other known adaptive failure detectors, with improved flexibility. It can thus be used in a dynamic environment with a large and variable number of nodes.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134220022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since scale-up machines perform better for jobs with small and median (KB, MB) data sizes while scale-out machines perform better for jobs with large (GB, TB) data size, and a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time.
{"title":"Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements for High Application Performance","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1109/ICPP.2015.11","DOIUrl":"https://doi.org/10.1109/ICPP.2015.11","url":null,"abstract":"Since scale-up machines perform better for jobs with small and median (KB, MB) data sizes while scale-out machines perform better for jobs with large (GB, TB) data size, and a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"330 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134071961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A major challenge in the design of contemporary microprocessors is the increasing number of cores in conjunction with the persevering need for cache coherence. To achieve this, the memory subsystem steadily gains complexity that has evolved to levels beyond comprehension of most application performance analysts. The Intel Has well-EP architecture is such an example. It includes considerable advancements regarding memory hierarchy, on-chip communication, and cache coherence mechanisms compared to the previous generation. We have developed sophisticated benchmarks that allow us to perform in-depth investigations with full memory location and coherence state control. Using these benchmarks we investigate performance data and architectural properties of the Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers. This allows us to further the understanding of such complex designs by documenting implementation details the are either not publicly available at all, or only indirectly documented through patents.
当代微处理器设计的一个主要挑战是核心数量的增加以及对缓存一致性的持续需求。为了实现这一点,内存子系统稳定地增加复杂性,其复杂性已经发展到大多数应用程序性能分析人员无法理解的程度。Intel Has well-EP架构就是这样一个例子。与上一代相比,它在内存层次结构、片上通信和缓存一致性机制方面有了相当大的进步。我们开发了复杂的基准,使我们能够对全内存位置和相干状态控制进行深入的调查。使用这些基准测试,我们研究了Has well- ep微架构的性能数据和架构属性,包括重要的内存延迟和带宽特性,以及核心到核心传输的成本。这允许我们通过记录实现细节来进一步理解这种复杂的设计,这些细节要么根本无法公开获得,要么只能通过专利间接记录。
{"title":"Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture","authors":"Daniel Molka, D. Hackenberg, R. Schöne, W. Nagel","doi":"10.1109/ICPP.2015.83","DOIUrl":"https://doi.org/10.1109/ICPP.2015.83","url":null,"abstract":"A major challenge in the design of contemporary microprocessors is the increasing number of cores in conjunction with the persevering need for cache coherence. To achieve this, the memory subsystem steadily gains complexity that has evolved to levels beyond comprehension of most application performance analysts. The Intel Has well-EP architecture is such an example. It includes considerable advancements regarding memory hierarchy, on-chip communication, and cache coherence mechanisms compared to the previous generation. We have developed sophisticated benchmarks that allow us to perform in-depth investigations with full memory location and coherence state control. Using these benchmarks we investigate performance data and architectural properties of the Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers. This allows us to further the understanding of such complex designs by documenting implementation details the are either not publicly available at all, or only indirectly documented through patents.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121667890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Wen-mei W. Hwu, Nicolás Guil Mata
In-place data manipulation is very desirable in many-core architectures with limited on-board memory. This paper deals with the in-place implementation of a class of primitives that perform data movements in one direction. We call these primitives Data Sliding (DS) algorithms. Notable among them are relational algebra primitives (such as select and unique), padding to insert empty elements in a data structure, and stream compaction to reduce memory requirements. Their in-place implementation in a bulk synchronous parallel model, such as GPUs, is specially challenging due to the difficulties in synchronizing threads executing on different compute units. Using a novel adjacent work-group synchronization technique, we propose two algorithmic schemes for regular and irregular DS algorithms. With a set of 5 benchmarks, we validate our approaches and compare them to the state-of-the-art implementations of these benchmarks. Our regular DS algorithms demonstrate up to 9.11x and 73.25x on NVIDIA and AMD GPUs, respectively, the throughput of their competitors. Our irregular DS algorithms outperform NVIDIA Thrust library by up to 3.24x on the three most recent generations of NVIDIA GPUs.
{"title":"In-Place Data Sliding Algorithms for Many-Core Architectures","authors":"Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Wen-mei W. Hwu, Nicolás Guil Mata","doi":"10.1109/ICPP.2015.30","DOIUrl":"https://doi.org/10.1109/ICPP.2015.30","url":null,"abstract":"In-place data manipulation is very desirable in many-core architectures with limited on-board memory. This paper deals with the in-place implementation of a class of primitives that perform data movements in one direction. We call these primitives Data Sliding (DS) algorithms. Notable among them are relational algebra primitives (such as select and unique), padding to insert empty elements in a data structure, and stream compaction to reduce memory requirements. Their in-place implementation in a bulk synchronous parallel model, such as GPUs, is specially challenging due to the difficulties in synchronizing threads executing on different compute units. Using a novel adjacent work-group synchronization technique, we propose two algorithmic schemes for regular and irregular DS algorithms. With a set of 5 benchmarks, we validate our approaches and compare them to the state-of-the-art implementations of these benchmarks. Our regular DS algorithms demonstrate up to 9.11x and 73.25x on NVIDIA and AMD GPUs, respectively, the throughput of their competitors. Our irregular DS algorithms outperform NVIDIA Thrust library by up to 3.24x on the three most recent generations of NVIDIA GPUs.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121768051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jacob Brock, Chencheng Ye, C. Ding, Yechen Li, Xiaolin Wang, Yingwei Luo
When a cache is shared by multiple cores, its space may be allocated either by sharing, partitioning, or both. We call the last case partition-sharing. This paper studies partition-sharing as a general solution, and presents a theory an technique for optimizing partition-sharing. We present a theory and a technique to optimize partition sharing. The theory shows that the problem of partition-sharing is reducible to the problem of partitioning. The technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness. Finally, the paper evaluates the effect of optimal cache sharing and compares it with conventional solutions for thousands of 4-program co-run groups, with nearly 180 million different ways to share the cache by each co-run group. Optimal partition-sharing is on average 26% better than free-for-all sharing, and 98% better than equal partitioning. We also demonstrate the trade-off between optimal partitioning and fair partitioning.
{"title":"Optimal Cache Partition-Sharing","authors":"Jacob Brock, Chencheng Ye, C. Ding, Yechen Li, Xiaolin Wang, Yingwei Luo","doi":"10.1109/ICPP.2015.84","DOIUrl":"https://doi.org/10.1109/ICPP.2015.84","url":null,"abstract":"When a cache is shared by multiple cores, its space may be allocated either by sharing, partitioning, or both. We call the last case partition-sharing. This paper studies partition-sharing as a general solution, and presents a theory an technique for optimizing partition-sharing. We present a theory and a technique to optimize partition sharing. The theory shows that the problem of partition-sharing is reducible to the problem of partitioning. The technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness. Finally, the paper evaluates the effect of optimal cache sharing and compares it with conventional solutions for thousands of 4-program co-run groups, with nearly 180 million different ways to share the cache by each co-run group. Optimal partition-sharing is on average 26% better than free-for-all sharing, and 98% better than equal partitioning. We also demonstrate the trade-off between optimal partitioning and fair partitioning.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131289768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}