2015 44th International Conference on Parallel Processing最新文献_第2页

Matchmaking Applications and Partitioning Strategies for Efficient Execution on Heterogeneous Platforms 异构平台上高效执行的配对应用和分区策略

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.65

Jie Shen, A. Varbanescu, X. Martorell, H. Sips

Heterogeneous platforms are mixes of different processing units. The key factor to their efficient usage is workload partitioning. Both static and dynamic partitioning strategies have been defined in previous work, but their applicability and performance differ significantly depending on the application to execute. In this paper, we propose an application-driven method to select the best partitioning strategy for a given workload. To this end, we define an application classification based on the application kernel structure -- i.e., The number of kernels in the application and their execution flow. We also enable five different partitioning strategies, which mix the best features of both static and dynamic approaches. We further define the performance-driven ranking of all suitable strategies for each application class. Finally, we match the best partitioning to a given application by simply determining its class and selecting the best ranked strategy for that class. We test the matchmaking on six representative applications, and demonstrate that the defined performance ranking is correct. Moreover, by choosing the best performing partitioning strategy, we can significantly improve application performance, leading to average speedup of 3.0x/5.3x over the Only-GPU/Only-CPU execution, respectively.

异构平台是不同处理单元的混合。有效使用它们的关键因素是工作负载分区。在以前的工作中已经定义了静态和动态分区策略，但是根据要执行的应用程序的不同，它们的适用性和性能有很大差异。在本文中，我们提出了一种应用程序驱动的方法来为给定的工作负载选择最佳分区策略。为此，我们定义了基于应用程序内核结构的应用程序分类——即应用程序中的内核数量及其执行流。我们还启用了五种不同的分区策略，它们混合了静态和动态方法的最佳特性。我们进一步定义了每个应用程序类的所有合适策略的性能驱动排名。最后，通过简单地确定应用程序的类并为该类选择最佳排序策略，我们将最佳分区匹配到给定的应用程序。我们在六个具有代表性的应用程序上进行了匹配测试，并证明了所定义的性能排名是正确的。此外，通过选择性能最佳的分区策略，我们可以显著提高应用程序的性能，与仅gpu /仅cpu执行相比，平均速度分别提高3.0倍/5.3倍。

{"title":"Matchmaking Applications and Partitioning Strategies for Efficient Execution on Heterogeneous Platforms","authors":"Jie Shen, A. Varbanescu, X. Martorell, H. Sips","doi":"10.1109/ICPP.2015.65","DOIUrl":"https://doi.org/10.1109/ICPP.2015.65","url":null,"abstract":"Heterogeneous platforms are mixes of different processing units. The key factor to their efficient usage is workload partitioning. Both static and dynamic partitioning strategies have been defined in previous work, but their applicability and performance differ significantly depending on the application to execute. In this paper, we propose an application-driven method to select the best partitioning strategy for a given workload. To this end, we define an application classification based on the application kernel structure -- i.e., The number of kernels in the application and their execution flow. We also enable five different partitioning strategies, which mix the best features of both static and dynamic approaches. We further define the performance-driven ranking of all suitable strategies for each application class. Finally, we match the best partitioning to a given application by simply determining its class and selecting the best ranked strategy for that class. We test the matchmaking on six representative applications, and demonstrate that the defined performance ranking is correct. Moreover, by choosing the best performing partitioning strategy, we can significantly improve application performance, leading to average speedup of 3.0x/5.3x over the Only-GPU/Only-CPU execution, respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127816564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units 基于数千个图形处理单元的驻留块结构自适应网格细化

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.15

D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis

Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.

块结构自适应网格细化(AMR)是一种可以在求解偏微分方程时使用的技术，可以减少在感兴趣的区域达到所需精度所需的单元数。这些区域(冲击前沿，材料界面等)被递归地覆盖上更精细的网格补丁，这些网格补丁被分组成细化级别的层次结构。尽管AMR在计算需求和内存使用方面有很大的节省潜力，但没有相应的准确性降低，但AMR增加了管理网格层次结构的开销，为模拟增加了复杂的通信和数据移动需求。在本文中，我们描述了一个基于常驻gpu的AMR库的设计和实现，包括:用于管理网格补丁上的数据的类，用于在不同节点上的gpu之间传输数据的例程，以及用于粗化和细化网格数据的数据并行算子。我们使用三个测试问题和两个架构来验证我们实现的性能和准确性:一个8节点集群和橡树岭国家实验室的泰坦超级计算机的4196个节点。我们基于gpu的AMR流体力学代码的执行速度比基于cpu的实现快4.87倍，并且可以使用MPI和CUDA的组合在4,196 K20x gpu上进行扩展。

{"title":"Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units","authors":"D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis","doi":"10.1109/ICPP.2015.15","DOIUrl":"https://doi.org/10.1109/ICPP.2015.15","url":null,"abstract":"Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121370993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

LBM-IB: A Parallel Library to Solve 3D Fluid-Structure Interaction Problems on Manycore Systems 求解多核系统三维流固耦合问题的并行库

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.14

Prateek Nagar, Fengguang Song, Luoding Zhu, Lan Lin

Deformable structures are abundant in various domains such as biology, medicine, life sciences, and ocean engineering. Our previous work created a numerical method, named LBM-IB method [1], to solve the fluid-structure interaction (FSI) problems. Our LBM-IB method is particularly suitable for simulating flexible (or elastic) structures immersed in a moving viscous fluid. Fluid-structure interaction problems are well known for their heavy demands on computing resources. Today, it is still challenging to resolve many real-world FSI problems. In order to solve large-scale fluid-structure interactions more efficiently, in this paper, we design a parallel LBM-IB library on shared memory many core architectures. We start from a sequential version, which is extended to two different parallel versions. The paper first introduces the mathematical background of the LBM-IB method, then uses the sequential version as a ground to present our implemented computational kernels and the algorithm. Next, it describes the two parallel programs: an Open MP implementation and a cube-based parallel implementation using Pthreads. The cube-based implementation builds upon our new cube-centric algorithm where all the data are stored in cubes and computations are performed on individual cubes in a data-centric manner. By exploiting better data locality and fine-grain block parallelism, the cube-based parallel implementation is able to outperform the Open MP implementation by up to 53% on 64-core computer systems.

可变形结构在生物、医学、生命科学和海洋工程等领域有着广泛的应用。我们之前的工作创建了一种求解流固耦合问题的数值方法，称为LBM-IB方法[1]。我们的LBM-IB方法特别适用于模拟沉浸在移动粘性流体中的柔性(或弹性)结构。流固耦合问题以其对计算资源的大量需求而闻名。今天，解决许多现实世界的FSI问题仍然具有挑战性。为了更有效地解决大规模流固耦合问题，本文设计了一个基于多核心体系结构共享内存的并行LBM-IB库。我们从一个连续的版本开始，扩展到两个不同的并行版本。本文首先介绍了LBM-IB方法的数学背景，然后以序列版本为基础，给出了我们实现的计算核和算法。接下来，介绍两个并行程序:一个Open MP实现和一个使用pthread的基于多维数据集的并行实现。基于多维数据集的实现建立在我们新的以多维数据集为中心的算法之上，其中所有数据都存储在多维数据集中，并且以以数据为中心的方式在单个多维数据集上执行计算。通过利用更好的数据局部性和细粒度块并行性，基于立方体的并行实现能够在64核计算机系统上比Open MP实现高出53%。

{"title":"LBM-IB: A Parallel Library to Solve 3D Fluid-Structure Interaction Problems on Manycore Systems","authors":"Prateek Nagar, Fengguang Song, Luoding Zhu, Lan Lin","doi":"10.1109/ICPP.2015.14","DOIUrl":"https://doi.org/10.1109/ICPP.2015.14","url":null,"abstract":"Deformable structures are abundant in various domains such as biology, medicine, life sciences, and ocean engineering. Our previous work created a numerical method, named LBM-IB method [1], to solve the fluid-structure interaction (FSI) problems. Our LBM-IB method is particularly suitable for simulating flexible (or elastic) structures immersed in a moving viscous fluid. Fluid-structure interaction problems are well known for their heavy demands on computing resources. Today, it is still challenging to resolve many real-world FSI problems. In order to solve large-scale fluid-structure interactions more efficiently, in this paper, we design a parallel LBM-IB library on shared memory many core architectures. We start from a sequential version, which is extended to two different parallel versions. The paper first introduces the mathematical background of the LBM-IB method, then uses the sequential version as a ground to present our implemented computational kernels and the algorithm. Next, it describes the two parallel programs: an Open MP implementation and a cube-based parallel implementation using Pthreads. The cube-based implementation builds upon our new cube-centric algorithm where all the data are stored in cubes and computations are performed on individual cubes in a data-centric manner. By exploiting better data locality and fine-grain block parallelism, the cube-based parallel implementation is able to outperform the Open MP implementation by up to 53% on 64-core computer systems.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125161027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

What Is Wrong with the Transmission? A Comprehensive Study on Message Passing Related Bugs 变速器出了什么问题?消息传递相关bug的综合研究

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.50

Mingxing Zhang, Yongwei Wu, Kang Chen, Weimin Zheng

Along with the prevalence of distributed systems, more and more applications require the ability of reliably transferring messages across a network. However, passing messages in a convenient and dependable way is both difficult and error prone. Thus the existing messaging products usually suffer from numerous software bugs. And these bugs are particularly difficult to be diagnosed or avoided. Therefore, in order to improve the methods for handling them, we need a better understanding of their characteristics. This paper provides the first (to the best of our knowledge)comprehensive characteristic study on message passing related bugs (MP-bugs). We have carefully examined the pattern, manifestation, fixing and other characteristics of 349 randomly selected real world MP-bugs from 3 representative open-source applications (Open MPI, Zero MQ, and Active MQ). Surprisingly, we found that nearly 60% of the non-latent MP-bugs can be categorised into two simple patterns: the message level bugs and the connection level bugs, which implies a promising perspective of detecting/tolerating tools for MP-bugs. Apart from this finding, our study have also uncovered many new (and sometimes surprising)insights of the message passing systems' developing process. The results should be useful for the design of corresponding bug detecting, exposing and tolerating tools.

随着分布式系统的普及，越来越多的应用程序需要能够在网络上可靠地传输消息。然而，以方便可靠的方式传递消息既困难又容易出错。因此，现有的消息传递产品通常存在大量的软件错误。这些细菌尤其难以诊断或避免。因此，为了改进处理它们的方法，我们需要更好地了解它们的特点。本文首次(据我们所知)对消息传递相关bug (MP-bugs)进行了全面的特性研究。我们仔细研究了从3个有代表性的开源应用程序(Open MPI、Zero MQ和Active MQ)中随机选择的349个真实mp bug的模式、表现、修复和其他特征。令人惊讶的是，我们发现近60%的非潜伏mp -bug可以分为两种简单的模式:消息级bug和连接级bug，这意味着mp -bug检测/容忍工具的前景很光明。除了这个发现，我们的研究还发现了许多关于消息传递系统发展过程的新见解(有时是令人惊讶的)。研究结果对于设计相应的漏洞检测、暴露和容忍工具应该是有用的。

{"title":"What Is Wrong with the Transmission? A Comprehensive Study on Message Passing Related Bugs","authors":"Mingxing Zhang, Yongwei Wu, Kang Chen, Weimin Zheng","doi":"10.1109/ICPP.2015.50","DOIUrl":"https://doi.org/10.1109/ICPP.2015.50","url":null,"abstract":"Along with the prevalence of distributed systems, more and more applications require the ability of reliably transferring messages across a network. However, passing messages in a convenient and dependable way is both difficult and error prone. Thus the existing messaging products usually suffer from numerous software bugs. And these bugs are particularly difficult to be diagnosed or avoided. Therefore, in order to improve the methods for handling them, we need a better understanding of their characteristics. This paper provides the first (to the best of our knowledge)comprehensive characteristic study on message passing related bugs (MP-bugs). We have carefully examined the pattern, manifestation, fixing and other characteristics of 349 randomly selected real world MP-bugs from 3 representative open-source applications (Open MPI, Zero MQ, and Active MQ). Surprisingly, we found that nearly 60% of the non-latent MP-bugs can be categorised into two simple patterns: the message level bugs and the connection level bugs, which implies a promising perspective of detecting/tolerating tools for MP-bugs. Apart from this finding, our study have also uncovered many new (and sometimes surprising)insights of the message passing systems' developing process. The results should be useful for the design of corresponding bug detecting, exposing and tolerating tools.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131287677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Sheriff: A Regional Pre-alert Management Scheme in Data Center Networks 治安官:数据中心网络的区域预警管理方案

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.76

Xiaofeng Gao, Wen Xu, Fan Wu, Guihai Chen

As the base infrastructure to support various cloud services, data center draws more and more attractions from both academia and industry. A stable, effective, and robust data center network (DCN) management system is urgently required from institutions and corporations. However, existing management schemes have several problems, including the difficulty to manage the entire network with heterogeneous network components by a centralized controller, and the short-sighted mechanism to deal with resource allocation, congestion control, and VM migration. In this paper, we design Sheriff: a distributed pre-alert and management scheme for DCN management. Sheriff is a regional self-automatic control scheme at end host side to balance network traffic and workload. It includes two phases: prediction and management. Each end-host predicts possible overload and congestion by prediction strategy based on ARIMA and Neural Network methodology, and perform an Alert message. Delegated local controllers then monitor their dominating region and activate localized protocols VmMigration to manage the network. We illustrate the predication accuracy by network traces from a local data center service provider, examine the management efficiency by simulations on both Fat-Tree topology and Bcube topology, and prove that VmMigration is an approximation with ratio 3+2/p where p is a constant predefined in local search algorithm. Both numerical simulations and theoretical analysis validate the efficiency of our design. In all, Sheriff is a fast and effective scheme to better improve the performance of DCN.

数据中心作为支持各种云服务的基础设施，越来越受到学术界和工业界的关注。一个稳定、有效、健壮的数据中心网络管理系统是各企事业单位迫切需要的。然而，现有的管理方案存在一些问题，包括难以通过一个集中的控制器来管理异构网络组件的整个网络，以及在处理资源分配、拥塞控制和VM迁移方面的短视机制。本文设计了一种分布式DCN预警与管理方案Sheriff。Sheriff是一种在终端主机端实现网络流量和工作负载均衡的区域自动控制方案。它包括两个阶段:预测和管理。每个终端主机通过基于ARIMA和神经网络方法的预测策略预测可能出现的过载和拥塞，并执行警报消息。然后，委托的本地控制器监视其主导区域并激活本地化协议VmMigration来管理网络。我们通过来自本地数据中心服务提供商的网络轨迹来说明预测的准确性，通过在Fat-Tree拓扑和Bcube拓扑上的模拟来检验管理效率，并证明VmMigration是一个比率为3+2/p的近似值，其中p是本地搜索算法中预定义的常数。数值模拟和理论分析验证了设计的有效性。总之，Sheriff是一种快速有效的方案，可以更好地提高DCN的性能。

{"title":"Sheriff: A Regional Pre-alert Management Scheme in Data Center Networks","authors":"Xiaofeng Gao, Wen Xu, Fan Wu, Guihai Chen","doi":"10.1109/ICPP.2015.76","DOIUrl":"https://doi.org/10.1109/ICPP.2015.76","url":null,"abstract":"As the base infrastructure to support various cloud services, data center draws more and more attractions from both academia and industry. A stable, effective, and robust data center network (DCN) management system is urgently required from institutions and corporations. However, existing management schemes have several problems, including the difficulty to manage the entire network with heterogeneous network components by a centralized controller, and the short-sighted mechanism to deal with resource allocation, congestion control, and VM migration. In this paper, we design Sheriff: a distributed pre-alert and management scheme for DCN management. Sheriff is a regional self-automatic control scheme at end host side to balance network traffic and workload. It includes two phases: prediction and management. Each end-host predicts possible overload and congestion by prediction strategy based on ARIMA and Neural Network methodology, and perform an Alert message. Delegated local controllers then monitor their dominating region and activate localized protocols VmMigration to manage the network. We illustrate the predication accuracy by network traces from a local data center service provider, examine the management efficiency by simulations on both Fat-Tree topology and Bcube topology, and prove that VmMigration is an approximation with ratio 3+2/p where p is a constant predefined in local search algorithm. Both numerical simulations and theoretical analysis validate the efficiency of our design. In all, Sheriff is a fast and effective scheme to better improve the performance of DCN.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132785275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RepFD - Using Reputation Systems to Detect Failures in Large Dynamic Networks 使用信誉系统来检测大型动态网络中的故障

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.18

M. Veron, O. Marin, Sébastien Monnet, Pierre Sens

Failure detection is a crucial service for dependable distributed systems. Traditional failure detector implementations usually target homogeneous and static configurations, as their performance relies heavily on the connectivity of each network node. In this paper we propose a new approach towards the implementation of failure detectors for large and dynamic networks: we study reputation systems as a means to detect failures. The reputation mechanism allows efficient node cooperation via the sharing of views about other nodes. Our experimental results show that a simple prototype of a reputation-based detection service performs better than other known adaptive failure detectors, with improved flexibility. It can thus be used in a dynamic environment with a large and variable number of nodes.

故障检测是可靠分布式系统的一项关键服务。传统的故障检测器实现通常针对同构和静态配置，因为它们的性能严重依赖于每个网络节点的连通性。在本文中，我们提出了一种实现大型动态网络故障检测器的新方法:我们研究声誉系统作为检测故障的一种手段。信誉机制通过共享其他节点的观点来实现高效的节点合作。我们的实验结果表明，基于声誉的检测服务的简单原型比其他已知的自适应故障检测器性能更好，并且具有更高的灵活性。因此，它可以在具有大量可变节点的动态环境中使用。

引用次数: 8

Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements for High Application Performance 设计一个基于高性能应用性能测量的混合Scale-Up/Out Hadoop架构

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.11

Zhuozhao Li, Haiying Shen

Since scale-up machines perform better for jobs with small and median (KB, MB) data sizes while scale-out machines perform better for jobs with large (GB, TB) data size, and a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time.

由于扩展机器对于小数据量和中位数(KB, MB)的作业表现更好，而扩展机器对于大数据量(GB, TB)的作业表现更好，并且工作负载通常由不同数据量级别的作业组成，我们建议构建一个混合Hadoop架构，包括扩展和扩展机器，但这不是微不足道的。第一个挑战是工作负载数据存储。在一个工作负载中，数以千计的小数据量作业可能会使扩展机器有限的本地磁盘过载。来自扩展和扩展机器的作业可能都请求相同的数据集，这会导致机器之间的数据传输。第二个挑战是自动将作业调度到向上扩展或向外扩展集群，以实现最佳性能。我们对不同的应用程序在scale-up和scale-out集群上进行了全面的性能测量，分别配置了Hadoop分布式文件系统(HDFS)和远程文件系统(即OFS)。我们发现使用OFS而不是HDFS可以解决数据存储的挑战。此外，我们还确定了决定向上扩展和向外扩展集群上的性能差异的因素，以及它们的交叉点，以便做出选择。因此，我们设计并实现了混合的scale-up/out Hadoop架构。我们的跟踪驱动实验结果表明，我们的混合架构在作业完成时间方面优于传统的HDFS Hadoop架构和OFS架构。

{"title":"Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements for High Application Performance","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1109/ICPP.2015.11","DOIUrl":"https://doi.org/10.1109/ICPP.2015.11","url":null,"abstract":"Since scale-up machines perform better for jobs with small and median (KB, MB) data sizes while scale-out machines perform better for jobs with large (GB, TB) data size, and a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"330 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134071961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture Intel Haswell-EP架构的缓存一致性协议和内存性能

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.83

Daniel Molka, D. Hackenberg, R. Schöne, W. Nagel

A major challenge in the design of contemporary microprocessors is the increasing number of cores in conjunction with the persevering need for cache coherence. To achieve this, the memory subsystem steadily gains complexity that has evolved to levels beyond comprehension of most application performance analysts. The Intel Has well-EP architecture is such an example. It includes considerable advancements regarding memory hierarchy, on-chip communication, and cache coherence mechanisms compared to the previous generation. We have developed sophisticated benchmarks that allow us to perform in-depth investigations with full memory location and coherence state control. Using these benchmarks we investigate performance data and architectural properties of the Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers. This allows us to further the understanding of such complex designs by documenting implementation details the are either not publicly available at all, or only indirectly documented through patents.

当代微处理器设计的一个主要挑战是核心数量的增加以及对缓存一致性的持续需求。为了实现这一点，内存子系统稳定地增加复杂性，其复杂性已经发展到大多数应用程序性能分析人员无法理解的程度。Intel Has well-EP架构就是这样一个例子。与上一代相比，它在内存层次结构、片上通信和缓存一致性机制方面有了相当大的进步。我们开发了复杂的基准，使我们能够对全内存位置和相干状态控制进行深入的调查。使用这些基准测试，我们研究了Has well- ep微架构的性能数据和架构属性，包括重要的内存延迟和带宽特性，以及核心到核心传输的成本。这允许我们通过记录实现细节来进一步理解这种复杂的设计，这些细节要么根本无法公开获得，要么只能通过专利间接记录。

引用次数: 67

In-Place Data Sliding Algorithms for Many-Core Architectures 多核架构的就地数据滑动算法

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.30

Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Wen-mei W. Hwu, Nicolás Guil Mata

In-place data manipulation is very desirable in many-core architectures with limited on-board memory. This paper deals with the in-place implementation of a class of primitives that perform data movements in one direction. We call these primitives Data Sliding (DS) algorithms. Notable among them are relational algebra primitives (such as select and unique), padding to insert empty elements in a data structure, and stream compaction to reduce memory requirements. Their in-place implementation in a bulk synchronous parallel model, such as GPUs, is specially challenging due to the difficulties in synchronizing threads executing on different compute units. Using a novel adjacent work-group synchronization technique, we propose two algorithmic schemes for regular and irregular DS algorithms. With a set of 5 benchmarks, we validate our approaches and compare them to the state-of-the-art implementations of these benchmarks. Our regular DS algorithms demonstrate up to 9.11x and 73.25x on NVIDIA and AMD GPUs, respectively, the throughput of their competitors. Our irregular DS algorithms outperform NVIDIA Thrust library by up to 3.24x on the three most recent generations of NVIDIA GPUs.

就地数据操作在许多具有有限板载内存的核心体系结构中是非常理想的。本文讨论了在一个方向上执行数据移动的一类原语的就地实现。我们称这些原语为数据滑动(DS)算法。其中值得注意的是关系代数原语(例如select和unique)、填充(用于在数据结构中插入空元素)和流压缩(用于减少内存需求)。它们在批量同步并行模型(如gpu)中的就地实现特别具有挑战性，因为难以同步在不同计算单元上执行的线程。利用一种新的相邻工作组同步技术，提出了规则和不规则DS算法的两种算法方案。通过一组5个基准测试，我们验证了我们的方法，并将它们与这些基准测试的最新实现进行了比较。我们的常规DS算法在NVIDIA和AMD gpu上分别显示高达9.11倍和73.25倍的吞吐量，其竞争对手。我们的不规则DS算法在最新三代NVIDIA gpu上的性能优于NVIDIA Thrust库高达3.24倍。

{"title":"In-Place Data Sliding Algorithms for Many-Core Architectures","authors":"Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Wen-mei W. Hwu, Nicolás Guil Mata","doi":"10.1109/ICPP.2015.30","DOIUrl":"https://doi.org/10.1109/ICPP.2015.30","url":null,"abstract":"In-place data manipulation is very desirable in many-core architectures with limited on-board memory. This paper deals with the in-place implementation of a class of primitives that perform data movements in one direction. We call these primitives Data Sliding (DS) algorithms. Notable among them are relational algebra primitives (such as select and unique), padding to insert empty elements in a data structure, and stream compaction to reduce memory requirements. Their in-place implementation in a bulk synchronous parallel model, such as GPUs, is specially challenging due to the difficulties in synchronizing threads executing on different compute units. Using a novel adjacent work-group synchronization technique, we propose two algorithmic schemes for regular and irregular DS algorithms. With a set of 5 benchmarks, we validate our approaches and compare them to the state-of-the-art implementations of these benchmarks. Our regular DS algorithms demonstrate up to 9.11x and 73.25x on NVIDIA and AMD GPUs, respectively, the throughput of their competitors. Our irregular DS algorithms outperform NVIDIA Thrust library by up to 3.24x on the three most recent generations of NVIDIA GPUs.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121768051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Optimal Cache Partition-Sharing 最佳缓存分区共享

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.84

Jacob Brock, Chencheng Ye, C. Ding, Yechen Li, Xiaolin Wang, Yingwei Luo

When a cache is shared by multiple cores, its space may be allocated either by sharing, partitioning, or both. We call the last case partition-sharing. This paper studies partition-sharing as a general solution, and presents a theory an technique for optimizing partition-sharing. We present a theory and a technique to optimize partition sharing. The theory shows that the problem of partition-sharing is reducible to the problem of partitioning. The technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness. Finally, the paper evaluates the effect of optimal cache sharing and compares it with conventional solutions for thousands of 4-program co-run groups, with nearly 180 million different ways to share the cache by each co-run group. Optimal partition-sharing is on average 26% better than free-for-all sharing, and 98% better than equal partitioning. We also demonstrate the trade-off between optimal partitioning and fair partitioning.

当一个缓存由多个内核共享时，它的空间可以通过共享、分区或两者来分配。我们称最后一种情况为分区共享。本文将分区共享作为一种通解进行研究，提出了一种分区共享优化的理论和技术。提出了一种优化分区共享的理论和技术。该理论表明，分区共享问题可简化为分区问题。该技术使用动态规划来优化总体缺失率和两种不同类型的公平性的分区。最后，本文评估了最优缓存共享的效果，并将其与数千个4程序协同运行组的传统解决方案进行了比较，每个协同运行组有近1.8亿种不同的缓存共享方式。最佳分区共享平均比免费共享好26%，比同等分区好98%。我们还演示了最优分区和公平分区之间的权衡。

引用次数: 50