2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献_第7页

Auto-Tuning the Java Virtual Machine 自动调优Java虚拟机

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.84

Sanath Jayasena, Milinda Fernando, Tharindu Rusira Patabandi, Chalitha Perera, C. Philips

We address the problem of tuning the performance of the Java Virtual Machine (JVM) with run-time flags (parameters). We use the Hot Spot JVM in our study. As the Hot Spot JVM comes with over 600 flags to choose from, selecting a subset manually to maximize performance is infeasible. In prior work, the potential performance improvement is limited by the fact that only a subset of the tunable flags are tuned. We adopt a different approach and present the Hot Spot Auto-tuner which considers the entire JVM and the effect of all the flags. To the best of our knowledge, ours is the first auto-tuner for optimizing the performance of the JVM as a whole. We organize the JVM flags into a tree structure by building a flag-hierarchy, which helps us to resolve dependencies on aspects of the JVM such as garbage collector algorithms and JIT compilation, and helps to reduce the configuration search-space. Experiments with the SPECjvm2008 and DaCapo benchmarks show that we could optimize the Hot Spot JVM with significant speedup, 16 SPECjvm2008 startup programs were improved by an average of 19% with three of them improved dramatically by 63%, 51% and 32% within a maximum tuning time of 200 minutes for each. Based on a minimum tuning time of 200 minutes, average performance improvement for 13 DaCapo benchmark programs is 26% with 42% being the maximum improvement.

我们解决了使用运行时标志(参数)调优Java虚拟机(JVM)性能的问题。我们在研究中使用了hotspot JVM。由于hotspot JVM有超过600个标志可供选择，因此手动选择一个子集来最大化性能是不可行的。在之前的工作中，由于只调优了可调标志的一个子集，潜在的性能改进受到了限制。我们采用了一种不同的方法，并提出了hotspot Auto-tuner，它考虑了整个JVM和所有标志的影响。据我们所知，我们的自动调优器是第一个优化JVM整体性能的自动调优器。通过构建一个标志层次结构，我们将JVM标志组织成一个树状结构，这有助于我们解决对JVM各方面的依赖，如垃圾收集器算法和JIT编译，并有助于减少配置搜索空间。对SPECjvm2008和DaCapo基准测试的实验表明，我们可以优化hotspot JVM并获得显著的加速，16个SPECjvm2008启动程序平均提高了19%，其中三个程序在每个程序的最大调优时间为200分钟内显著提高了63%、51%和32%。基于200分钟的最小调优时间，13个DaCapo基准程序的平均性能改进为26%，最大改进为42%。

{"title":"Auto-Tuning the Java Virtual Machine","authors":"Sanath Jayasena, Milinda Fernando, Tharindu Rusira Patabandi, Chalitha Perera, C. Philips","doi":"10.1109/IPDPSW.2015.84","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.84","url":null,"abstract":"We address the problem of tuning the performance of the Java Virtual Machine (JVM) with run-time flags (parameters). We use the Hot Spot JVM in our study. As the Hot Spot JVM comes with over 600 flags to choose from, selecting a subset manually to maximize performance is infeasible. In prior work, the potential performance improvement is limited by the fact that only a subset of the tunable flags are tuned. We adopt a different approach and present the Hot Spot Auto-tuner which considers the entire JVM and the effect of all the flags. To the best of our knowledge, ours is the first auto-tuner for optimizing the performance of the JVM as a whole. We organize the JVM flags into a tree structure by building a flag-hierarchy, which helps us to resolve dependencies on aspects of the JVM such as garbage collector algorithms and JIT compilation, and helps to reduce the configuration search-space. Experiments with the SPECjvm2008 and DaCapo benchmarks show that we could optimize the Hot Spot JVM with significant speedup, 16 SPECjvm2008 startup programs were improved by an average of 19% with three of them improved dramatically by 63%, 51% and 32% within a maximum tuning time of 200 minutes for each. Based on a minimum tuning time of 200 minutes, average performance improvement for 13 DaCapo benchmark programs is 26% with 42% being the maximum improvement.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126519296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum at UESTC 在电子科技大学计算机科学本科课程中整合并行与分布式计算主题

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.66

Guoming Lu, Jie Xu, Jieyan Liu, Bo Dai, Shenglin Gui, Siyu Zhan

In Fall 2013, we began participation in NSF/IEEE TCPP 2013 Early Adopters Program. This paper presents our efforts to incorporate parallel and distributed computing topics into our undergraduate computer science and engineering curriculum with the guide of the IEEE-TCPP model Curriculum. So far, TCPP recommended curriculum has been integrated eight courses, evaluations show our integration effort is successful. Evaluation also shows that practices such as lab/homework assignment effective improve students' conception, and contest club is a necessary complementarity of in-class courses.

2013年秋季，我们开始参与NSF/IEEE TCPP 2013早期采用者计划。本文介绍了我们在IEEE-TCPP模型课程的指导下，将并行和分布式计算主题纳入我们的本科计算机科学与工程课程的努力。到目前为止，TCPP推荐课程已经整合了8门课程，评估表明我们的整合工作是成功的。评价还表明，实验/作业等实践活动有效地提高了学生的观念，竞赛俱乐部是课堂课程的必要补充。

引用次数: 4

Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer 基于KNC的NASA超级计算机早期多节点性能评估

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.140

S. Saini, Haoqiang Jin, D. Jespersen, Samson Cheung, M. J. Djomehri, Johnny Chang, R. Hood

We have conducted performance evaluation of a dual-rail Fourteen Data Rate (FDR) InfiniBand (IB) connected cluster, where each node has two Intel Xeon E5-2670 (Sandy Bridge) processors and two Intel Xeon Phi coprocessors. The Xeon Phi, based on the Many Integrated Core (MIC) architecture, is of the Knights Corner (KNC) generation. We used several types of benchmarks for the study. We ran the MPI and multi-zone versions of the NAS Parallel Benchmarks (NPB) -- both original and optimized for the Xeon Phi. Among the full-scale benchmarks, we ran two versions of WRF, including one optimized for the MIC, and used a 12 Km Continental U.S (CONUS) data set. We also used original and optimized versions of OVERFLOW and ran with four different datasets to understand scaling in symmetric mode and related load-balancing issues. We present performance for the four different modes of using the host + MIC combination: native host, native MIC, offload, and symmetric. We also discuss the various optimization techniques used in optimizing two of the NPBs for offload mode as well as WRF and OVERFLOW. WRF 3.4 optimized for MIC runs 47% faster than the original NCAR WRF 3.4. The optimized version of OVERFLOW runs 18% faster on the host and the load-balancing strategy used improves the performance on MIC by 5% to 36% depending on the data size. In addition, we discuss the issues related to offload mode and load balancing in symmetric mode.

我们对双轨14数据速率(FDR) InfiniBand (IB)连接集群进行了性能评估，其中每个节点都有两个Intel Xeon E5-2670 (Sandy Bridge)处理器和两个Intel Xeon Phi协处理器。Xeon Phi基于多集成核心(MIC)架构，属于骑士角(KNC)一代。我们在研究中使用了几种基准。我们运行了NAS并行基准测试(NPB)的MPI和多区域版本——都是针对Xeon Phi的原始版本和优化版本。在全面基准测试中，我们运行了两个版本的WRF，其中一个版本针对MIC进行了优化，并使用了12公里的美国大陆(CONUS)数据集。我们还使用了OVERFLOW的原始和优化版本，并运行了四种不同的数据集，以了解对称模式下的扩展和相关的负载平衡问题。我们介绍了使用主机+ MIC组合的四种不同模式的性能:本机主机、本机MIC、卸载和对称。我们还讨论了为卸载模式以及WRF和OVERFLOW优化两个npb所使用的各种优化技术。针对MIC优化的WRF 3.4比原始的NCAR WRF 3.4运行速度快47%。优化版本的OVERFLOW在主机上的运行速度提高了18%，所使用的负载平衡策略将MIC上的性能提高了5%到36%，具体取决于数据大小。此外，我们还讨论了对称模式下的卸载模式和负载均衡的相关问题。

{"title":"Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer","authors":"S. Saini, Haoqiang Jin, D. Jespersen, Samson Cheung, M. J. Djomehri, Johnny Chang, R. Hood","doi":"10.1109/IPDPSW.2015.140","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.140","url":null,"abstract":"We have conducted performance evaluation of a dual-rail Fourteen Data Rate (FDR) InfiniBand (IB) connected cluster, where each node has two Intel Xeon E5-2670 (Sandy Bridge) processors and two Intel Xeon Phi coprocessors. The Xeon Phi, based on the Many Integrated Core (MIC) architecture, is of the Knights Corner (KNC) generation. We used several types of benchmarks for the study. We ran the MPI and multi-zone versions of the NAS Parallel Benchmarks (NPB) -- both original and optimized for the Xeon Phi. Among the full-scale benchmarks, we ran two versions of WRF, including one optimized for the MIC, and used a 12 Km Continental U.S (CONUS) data set. We also used original and optimized versions of OVERFLOW and ran with four different datasets to understand scaling in symmetric mode and related load-balancing issues. We present performance for the four different modes of using the host + MIC combination: native host, native MIC, offload, and symmetric. We also discuss the various optimization techniques used in optimizing two of the NPBs for offload mode as well as WRF and OVERFLOW. WRF 3.4 optimized for MIC runs 47% faster than the original NCAR WRF 3.4. The optimized version of OVERFLOW runs 18% faster on the host and the load-balancing strategy used improves the performance on MIC by 5% to 36% depending on the data size. In addition, we discuss the issues related to offload mode and load balancing in symmetric mode.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122754939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

The Promethee Method for Cloud Brokering with Trust and Assurance Criteria 基于信任和保证标准的云代理Promethee方法

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.63

C. Toinard, Timothee Ravier, C. Cérin, Yanik Ngoko

In this paper we deal with the cloud brokering problem in the context of a multi-cloud infrastructure. The problem is by nature a multi-criterion optimization problem. The focus is put mainly (but not only) on the security/trust criterion which is rarely considered in the litterature. We use the well known Promethee method to solve the problem which is original in the context of cloud brokering. In other words, if we give a high priority to the secure deployment of a service, are we still able to satisfy all of the others required QoS constraints? Reciprocally, if we give a high priority to the RTT (Round-Trip Time) constraint to access the Cloud, are we still able to ensure a weak/medium/strong 'security level'? We decided to stay at a high level of abstraction for the problem formulation and to conduct experiments using 'real' data. We believe that the design of the solution and the simulation tool we introduce in the paper are practical, thanks to the Promethee approach that has been used for more than 25 years but never, to our knowledge, for solving Cloud optimization problems. We expect that this study will be a first step to better understand, in the future, potential constraints in terms of control over external cloud services in order to implement them in a simple manner. The contributions of the paper are the modeling of an optimization problem with security constraints, the problem solving with the Promethee method and an experimental study aiming to play with multiple constraints to measure the impact of each constraint on the solution. During this process, we also provide a sensitive analysis of the Consensus Assessments Initiative Questionnaire by the Cloud Security Alliance (CSA). The analysis deals with the variety, balance and disparity of the questionnaire answers.

本文研究了多云基础设施环境下的云代理问题。该问题本质上是一个多准则优化问题。重点主要(但不仅是)放在文献中很少考虑的安全/信任标准上。我们使用著名的Promethee方法来解决云代理环境下的原始问题。换句话说，如果我们为服务的安全部署提供了高优先级，我们是否仍然能够满足所有其他所需的QoS约束?反过来，如果我们对访问云的RTT(往返时间)限制给予高优先级，我们仍然能够确保弱/中等/强的“安全级别”吗?我们决定对问题的表述保持高度抽象，并使用“真实”数据进行实验。我们相信，我们在论文中介绍的解决方案设计和模拟工具是实用的，这要归功于Promethee方法，该方法已经使用了超过25年，但据我们所知，它从未用于解决云优化问题。我们希望这项研究将是更好地理解的第一步，在未来，在控制外部云服务方面的潜在限制，以便以简单的方式实现它们。本文的贡献是对一个具有安全约束的优化问题进行建模，用Promethee方法求解问题，并进行了一项实验研究，旨在利用多个约束来衡量每个约束对解决方案的影响。在此过程中，我们还对云安全联盟(CSA)的共识评估倡议问卷进行了敏感分析。分析了问卷答案的多样性、平衡性和差异性。

{"title":"The Promethee Method for Cloud Brokering with Trust and Assurance Criteria","authors":"C. Toinard, Timothee Ravier, C. Cérin, Yanik Ngoko","doi":"10.1109/IPDPSW.2015.63","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.63","url":null,"abstract":"In this paper we deal with the cloud brokering problem in the context of a multi-cloud infrastructure. The problem is by nature a multi-criterion optimization problem. The focus is put mainly (but not only) on the security/trust criterion which is rarely considered in the litterature. We use the well known Promethee method to solve the problem which is original in the context of cloud brokering. In other words, if we give a high priority to the secure deployment of a service, are we still able to satisfy all of the others required QoS constraints? Reciprocally, if we give a high priority to the RTT (Round-Trip Time) constraint to access the Cloud, are we still able to ensure a weak/medium/strong 'security level'? We decided to stay at a high level of abstraction for the problem formulation and to conduct experiments using 'real' data. We believe that the design of the solution and the simulation tool we introduce in the paper are practical, thanks to the Promethee approach that has been used for more than 25 years but never, to our knowledge, for solving Cloud optimization problems. We expect that this study will be a first step to better understand, in the future, potential constraints in terms of control over external cloud services in order to implement them in a simple manner. The contributions of the paper are the modeling of an optimization problem with security constraints, the problem solving with the Promethee method and an experimental study aiming to play with multiple constraints to measure the impact of each constraint on the solution. During this process, we also provide a sensitive analysis of the Consensus Assessments Initiative Questionnaire by the Cloud Security Alliance (CSA). The analysis deals with the variety, balance and disparity of the questionnaire answers.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124642362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing 利用动态占空比调制提高高性能计算的能源效率

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.144

Sridutt Bhalachandra, Allan Porterfield, J. Prins

Power is increasingly the limiting factor in High Performance Computing (HPC). Growing core counts in each generation increase power and energy demands. In the future, strict power and energy budgets will be used to control the operating costs of supercomputer centers. Every node needs to use energy wisely. Energy efficiency can either be improved by taking less time or running at lower power. In this paper, we use Dynamic Duty Cycle Modulation (DDCM) to improve energy efficiency by improving performance under a power bound. When the power is not capped, DDCM reduces processor power, saving energy and reducing processor temperature. DDCM allows the clock frequency to be controlled for each individual core with very low overhead. Any situation where the individual threads on a processor are exhibiting imbalance, a more balanced execution can be obtained by slowing the "fast" threads. We use time between MPI collectives and the waiting time at the collective to determine a thread's "near optimal" frequency. All changes are within the MPI library, introducing no user code changes or additional communication/synchronization. To test DDCM, a set of synthetic MPI programs with load imbalance were created. In addition, a couple of HPC MPI benchmarks with load imbalance were examined. In our experiments, DDCM saves up to 13.5% processor energy on one node and 20.8% on 16 nodes. By applying a power cap, DDCM effectively shifts power consumption between cores and improves overall performance. Performance improvements of 6.0% and 5.6% on one and 16 nodes, respectively, were observed.

功率日益成为高性能计算(HPC)的限制因素。每一代堆芯数量的增加增加了电力和能源需求。在未来，严格的电力和能源预算将用于控制超级计算机中心的运营成本。每个节点都需要明智地使用能源。能源效率可以通过减少时间或降低功率来提高。在本文中，我们使用动态占空比调制(DDCM)来改善功率约束下的性能，从而提高能源效率。当电源不封顶时，DDCM降低处理器功耗，节能并降低处理器温度。DDCM允许以非常低的开销来控制每个单独核心的时钟频率。在处理器上的各个线程表现出不平衡的任何情况下，都可以通过减慢“快速”线程的速度来获得更平衡的执行。我们使用MPI集合之间的时间和集合上的等待时间来确定线程的“接近最佳”频率。所有更改都在MPI库中，不引入用户代码更改或额外的通信/同步。为了测试DDCM，创建了一套负载不平衡的合成MPI程序。此外，还测试了几个负载不平衡的HPC MPI基准测试。在我们的实验中，DDCM在一个节点上节省了13.5%的处理器能量，在16个节点上节省了20.8%的处理器能量。通过应用功率上限，DDCM有效地转移了核心之间的功耗并提高了整体性能。在1个节点和16个节点上，性能分别提高了6.0%和5.6%。

{"title":"Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing","authors":"Sridutt Bhalachandra, Allan Porterfield, J. Prins","doi":"10.1109/IPDPSW.2015.144","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.144","url":null,"abstract":"Power is increasingly the limiting factor in High Performance Computing (HPC). Growing core counts in each generation increase power and energy demands. In the future, strict power and energy budgets will be used to control the operating costs of supercomputer centers. Every node needs to use energy wisely. Energy efficiency can either be improved by taking less time or running at lower power. In this paper, we use Dynamic Duty Cycle Modulation (DDCM) to improve energy efficiency by improving performance under a power bound. When the power is not capped, DDCM reduces processor power, saving energy and reducing processor temperature. DDCM allows the clock frequency to be controlled for each individual core with very low overhead. Any situation where the individual threads on a processor are exhibiting imbalance, a more balanced execution can be obtained by slowing the \"fast\" threads. We use time between MPI collectives and the waiting time at the collective to determine a thread's \"near optimal\" frequency. All changes are within the MPI library, introducing no user code changes or additional communication/synchronization. To test DDCM, a set of synthetic MPI programs with load imbalance were created. In addition, a couple of HPC MPI benchmarks with load imbalance were examined. In our experiments, DDCM saves up to 13.5% processor energy on one node and 20.8% on 16 nodes. By applying a power cap, DDCM effectively shifts power consumption between cores and improves overall performance. Performance improvements of 6.0% and 5.6% on one and 16 nodes, respectively, were observed.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124687879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Dependability Modeling and Assessment of Complex Adaptive Networked Systems 复杂自适应网络系统的可靠性建模与评估

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.142

K. Ravindran

The dependability of a networked software system S is a measure of how good S meets its service-level objectives in the presence of uncontrolled external environment conditions incident on S. The dependability attribute is hard to determine due to the inherent complexity of S, i.e., how the behavior of S is affected by the external environment conditions and the network resource availability is difficult to be accurately captured with mathematical models. The complexity arises from the large dimensionality of input parameter space and the interactions between various components in S. This leads to the employment of model-based control techniques that adapt the operations of S over multiple "observe-actuate" rounds and steer S towards a reference input QoS. How close is the actual QoS so achieved to the reference QoS is a measure of the dependability of S. Our paper studies model-based engineering methods to quantify the notion of dependability, and therein enhance the dependability of S. The paper provides a management architecture to dynamically evaluate the dependability of S, and adjust the plant and algorithm parameters to control the dependability.

网络软件系统S的可靠性是衡量S在发生不受控制的外部环境条件下满足其服务水平目标的程度。由于S固有的复杂性，可靠性属性难以确定，即S的行为如何受到外部环境条件的影响，网络资源的可用性难以用数学模型准确捕获。复杂性来自于输入参数空间的大维度和S中各个组件之间的相互作用。这导致了基于模型的控制技术的使用，该技术在多个“观察-驱动”轮中调整S的操作，并将S导向参考输入QoS。本文研究了基于模型的工程方法，对可靠性概念进行量化，从而提高了S的可靠性，并提供了一种动态评估S可靠性的管理体系结构，通过调整设备参数和算法参数来控制S的可靠性。

{"title":"Dependability Modeling and Assessment of Complex Adaptive Networked Systems","authors":"K. Ravindran","doi":"10.1109/IPDPSW.2015.142","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.142","url":null,"abstract":"The dependability of a networked software system S is a measure of how good S meets its service-level objectives in the presence of uncontrolled external environment conditions incident on S. The dependability attribute is hard to determine due to the inherent complexity of S, i.e., how the behavior of S is affected by the external environment conditions and the network resource availability is difficult to be accurately captured with mathematical models. The complexity arises from the large dimensionality of input parameter space and the interactions between various components in S. This leads to the employment of model-based control techniques that adapt the operations of S over multiple \"observe-actuate\" rounds and steer S towards a reference input QoS. How close is the actual QoS so achieved to the reference QoS is a measure of the dependability of S. Our paper studies model-based engineering methods to quantify the notion of dependability, and therein enhance the dependability of S. The paper provides a management architecture to dynamically evaluate the dependability of S, and adjust the plant and algorithm parameters to control the dependability.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123322742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Roofline-Based Performance Estimator for Distributed Matrix-Multiply on Intel CnC 基于屋顶线的Intel CnC分布式矩阵乘法性能估计

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.134

Martin Kong, L. Pouchet, P. Sadayappan

In this paper we show how to analytically model two widely used distributed matrix-multiply algorithms, Cannon's 2D and Johnson's 3D, implemented within the Intel Concurrent Collections framework for shared/distributed memory execution. Our precise analytical model proceeds by estimating the computation time and communication times, taking into account factors such as the block size, communication bandwidth, processor's peak performance, etc. It then applies a roofline-based approach to determine the running time based on communication/computation bottleneck estimation. Our models are validated by comparing the estimations to the measured run times varying the problem size and work distribution, showing only marginal differences. We conclude by using our model to perform a predictive analysis on the impact of improving the computation speed by a factor of 4×.

在本文中，我们展示了如何解析建模两种广泛使用的分布式矩阵乘法算法，Cannon's 2D和Johnson's 3D，这两种算法在英特尔并发集合框架中实现，用于共享/分布式内存执行。我们的精确分析模型是通过估计计算时间和通信次数来进行的，考虑了诸如块大小、通信带宽、处理器峰值性能等因素。然后，它应用基于屋顶线的方法来确定基于通信/计算瓶颈估计的运行时间。通过将估计值与测量的运行时间进行比较来验证我们的模型，从而改变问题大小和工作分布，只显示出微小的差异。最后，我们使用我们的模型对计算速度提高4倍的影响进行了预测分析。

引用次数: 3

PLC Keynote PLC主题

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.176

M. Gschwind

Summary form only given. Heterogeneous designs are becoming ubiquitous across many new systemarchitectures as architects are turning to accelerators to deliver increased system performance and capability. In order to realize the potential of these heterogeneous designs, a framework is needed to support applications to take advantage of system capabilities. A suitable framework will depend on a combination of hardware primitives and software programming models. Together, hardware and software for heterogeneous multicore designs have to address the four "P"s of modern system design: Productivity, Portability, Performance, and Partitioning. To support current software deployment models, hardware primitives and programming models must ensure application isolation for partitioned virtual machine environments and application portability across a broad range of heterogeneous system design offerings. Hardware primitives and programming models must enable data center architects to build systems that continue scaling up performance and application developers to develop their applications with the necessary productivity. The Coherent Accelerator Processor Interface (CAPI) provides the basis for such a framework as integration point of accelerators into the POWER system architecture. CAPI accelerators can access application data directly using an integrated MMU. The CAPI MMU also provides partition isolation. Enabling accelerators to manage and pace their data access simplifies programming and prevents the CPUs from becoming serial bottlenecks. Finally, CAPI provides pointer identify, i.e., it enables the same address to be used in both CPU and accelerator to retrieve the same objects from memory. Pointer identity lays the foundation for high performance and high productivity accelerator programming models.

只提供摘要形式。异构设计在许多新的系统架构中变得无处不在，因为架构师正在转向加速器来交付更高的系统性能和能力。为了实现这些异构设计的潜力，需要一个框架来支持应用程序利用系统功能。一个合适的框架将取决于硬件原语和软件编程模型的组合。总之，异构多核设计的硬件和软件必须解决现代系统设计的四个“P”:生产力、可移植性、性能和分区。为了支持当前的软件部署模型，硬件原语和编程模型必须确保分区虚拟机环境的应用程序隔离以及跨各种异构系统设计产品的应用程序可移植性。硬件原语和编程模型必须使数据中心架构师能够构建持续提升性能的系统，并使应用程序开发人员能够以必要的生产力开发其应用程序。相干加速器处理器接口(CAPI)为这样一个框架提供了基础，作为加速器与POWER系统体系结构的集成点。CAPI加速器可以使用集成的MMU直接访问应用程序数据。CAPI MMU还提供分区隔离。允许加速器管理和调整它们的数据访问可以简化编程，并防止cpu成为串行瓶颈。最后，CAPI提供了指针标识，也就是说，它允许在CPU和加速器中使用相同的地址来从内存中检索相同的对象。指针标识为高性能和高生产率加速器编程模型奠定了基础。

{"title":"PLC Keynote","authors":"M. Gschwind","doi":"10.1109/IPDPSW.2015.176","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.176","url":null,"abstract":"Summary form only given. Heterogeneous designs are becoming ubiquitous across many new systemarchitectures as architects are turning to accelerators to deliver increased system performance and capability. In order to realize the potential of these heterogeneous designs, a framework is needed to support applications to take advantage of system capabilities. A suitable framework will depend on a combination of hardware primitives and software programming models. Together, hardware and software for heterogeneous multicore designs have to address the four \"P\"s of modern system design: Productivity, Portability, Performance, and Partitioning. To support current software deployment models, hardware primitives and programming models must ensure application isolation for partitioned virtual machine environments and application portability across a broad range of heterogeneous system design offerings. Hardware primitives and programming models must enable data center architects to build systems that continue scaling up performance and application developers to develop their applications with the necessary productivity. The Coherent Accelerator Processor Interface (CAPI) provides the basis for such a framework as integration point of accelerators into the POWER system architecture. CAPI accelerators can access application data directly using an integrated MMU. The CAPI MMU also provides partition isolation. Enabling accelerators to manage and pace their data access simplifies programming and prevents the CPUs from becoming serial bottlenecks. Finally, CAPI provides pointer identify, i.e., it enables the same address to be used in both CPU and accelerator to retrieve the same objects from memory. Pointer identity lays the foundation for high performance and high productivity accelerator programming models.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133333421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA 三维存储器集成FPGA上矩阵乘法的性能建模

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.133

Shreyas G. Singapura, A. Panangadan, V. Prasanna

Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.

三维集成电路的最新进展使垂直存储器堆栈能够与FPGA层集成。这种架构支持对内存的高带宽和低延迟访问，这对内存密集型应用程序是有益的。我们建立了一个具有代表性的三维存储器集成FPGA架构的性能模型，用于矩阵乘法。我们从吞吐量和能源效率方面推导了该算法在该模型上的峰值性能。我们评估了不同架构参数对性能的影响，并确定了关键瓶颈。这些参数包括内存层、vault和Through Silicon Vias (tsv)的配置。我们的分析表明，在这种架构中，内存是主要的能源消耗者之一。我们为这个应用程序对vault上的内存激活调度进行了建模，并表明它在保持200 GOPS/s的吞吐量的同时将能源效率提高了1.83倍。3D存储器集成FPGA模型在大小为16K×16K的矩阵中实现了93 GOPS/J的峰值性能。我们还比较了2D架构与3D架构的峰值性能，并观察到吞吐量和能源效率的边际改进。我们的分析表明，瓶颈是FPGA，它主导了总计算时间和能量消耗。除了需要O (m3)计算量的矩阵乘法外，我们还分析了需要O (m2)计算量的应用类别。特别是，对于矩阵的变换，我们发现能耗的改进是3倍，运行时间的改进是7倍。这表明，为了利用3D存储器的大带宽，应用程序的计算成本必须与内存访问时间相匹配。

{"title":"Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA","authors":"Shreyas G. Singapura, A. Panangadan, V. Prasanna","doi":"10.1109/IPDPSW.2015.133","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.133","url":null,"abstract":"Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132327401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Heterogeneous Habanero-C (H2C): A Portable Programming Model for Heterogeneous Processors 异构Habanero-C (H2C):异构处理器的可移植编程模型

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.81

Deepak Majeti, Vivek Sarkar

Heterogeneous architectures with their diverse architectural features impose significant programmability challenges. Existing programming systems involve non-trivial learning and are not productive, not portable, and are challenging to tune for performance. In this paper, we introduce Heterogeneous Habanero-C (H2C), which is an implementation of the Habanero execution model for modern heterogeneous (CPU + GPU) architectures. The H2C language provides high-level constructs to specify the computation, communication, and synchronization in a given application. H2C also implements novel constructs for task partitioning and locality. The H2C (source-to-source) compiler and runtime framework efficiently map these high-level constructs onto the underlying heterogeneous platform, which can include multiple CPU cores and multiple GPU devices, possibly from different vendors. Experimental evaluations of four applications show significant improvements in productivity, portability, and performance.

具有不同体系结构特性的异构体系结构对可编程性提出了重大挑战。现有的编程系统涉及非平凡的学习，而且效率不高，不可移植，并且很难调优性能。在本文中，我们介绍了异构Habanero- c (H2C)，它是现代异构(CPU + GPU)架构的Habanero执行模型的实现。H2C语言提供了高级结构来指定给定应用程序中的计算、通信和同步。H2C还实现了任务分区和局部性的新结构。H2C(源代码到源代码)编译器和运行时框架有效地将这些高级结构映射到底层异构平台上，该平台可以包括多个CPU内核和多个GPU设备，可能来自不同的供应商。对四个应用程序的实验评估显示了在生产力、可移植性和性能方面的显著改进。

引用次数: 13