首页 > 最新文献

2020 IEEE International Symposium on Workload Characterization (IISWC)最新文献

英文 中文
Organizing Committee : IISWC 2020 组委会:IISWC 2020
Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00008
{"title":"Organizing Committee : IISWC 2020","authors":"","doi":"10.1109/iiswc50251.2020.00008","DOIUrl":"https://doi.org/10.1109/iiswc50251.2020.00008","url":null,"abstract":"","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114001083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MATCH: An MPI Fault Tolerance Benchmark Suite MATCH:一个MPI容错基准测试套件
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00015
Luanzheng Guo, G. Georgakoudis, K. Parasyris, I. Laguna, Dong Li
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.
MPI已经广泛部署在高性能计算旗舰系统中,旨在加速运行在数百个进程和计算节点上的分布式科学应用程序。维护MPI应用程序执行的正确性和完整性至关重要,特别是对于安全关键型科学应用程序。因此,提出了一系列有效的MPI容错技术,以使MPI应用程序的执行能够有效地从系统故障中恢复。然而,目前还没有一种结构化的方法来研究和比较不同的MPI容错设计,从而指导针对不同场景的高效MPI容错技术的选择和开发。为了解决这个问题,我们设计、开发和评估了一个名为MATCH的基准套件,以表征、研究和全面比较MPI容错设计的不同组合和配置。我们的调查得出了有用的发现:(1)Reinit恢复总体上优于ULFM恢复;(2) Reinit恢复与尺度大小和输入问题大小无关,而ULFM恢复与尺度大小无关;(3)采用带FTI检查点的Reinit恢复是一种高效的容错设计。MATCH代码可在https://github.com/kakulo/MPI-FT-Bench获得。
{"title":"MATCH: An MPI Fault Tolerance Benchmark Suite","authors":"Luanzheng Guo, G. Georgakoudis, K. Parasyris, I. Laguna, Dong Li","doi":"10.1109/IISWC50251.2020.00015","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00015","url":null,"abstract":"MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128815415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud AWS云上CNN运算次数的实证分析与建模
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00026
Ubaid Ullah Hafeez, Anshul Gandhi
Given the widespread use of Convolutional Neural Networks (CNNs) in image classification applications, cloud providers now routinely offer several GPU-equipped instances with varying price points and hardware specifications. From a practitioner's perspective, given an arbitrary CNN, it is not obvious which GPU instance should be employed to minimize the model training time and/or rental cost. This paper presents Ceer, a model-driven approach to determine the optimal GPU instance(s) for any given CNN. Based on an operation-level empirical analysis of various CNNs, we develop regression models for heavy GPU operations (where input size is a key feature) and employ the sample median estimator for light GPU and CPU operations. To estimate the communication overhead between CPU and GPU(s), especially in the case of multi-GPU training, we develop a model that relates this communication overhead to the number of model parameters in the CNN. Evaluation results on AWS Cloud show that Ceer can accurately predict training time and cost (less than 5% average prediction error) across CNNs, enabling 36% −44% cost savings over simpler strategies that employ the cheapest or the latest generation GPU instances.
鉴于卷积神经网络(cnn)在图像分类应用中的广泛使用,云提供商现在通常会提供几种配备gpu的实例,这些实例具有不同的价格点和硬件规格。从从业者的角度来看,给定一个任意的CNN,应该使用哪个GPU实例来最小化模型训练时间和/或租赁成本并不明显。本文提出了Ceer,一种模型驱动的方法,用于确定任何给定CNN的最佳GPU实例。基于对各种cnn的操作级经验分析,我们开发了用于重型GPU操作(其中输入大小是关键特征)的回归模型,并对轻型GPU和CPU操作使用样本中值估计器。为了估计CPU和GPU之间的通信开销,特别是在多GPU训练的情况下,我们开发了一个模型,将这种通信开销与CNN中模型参数的数量联系起来。AWS云上的评估结果表明,Ceer可以准确地预测cnn的训练时间和成本(平均预测误差小于5%),与使用最便宜或最新一代GPU实例的更简单策略相比,可以节省36% - 44%的成本。
{"title":"Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud","authors":"Ubaid Ullah Hafeez, Anshul Gandhi","doi":"10.1109/IISWC50251.2020.00026","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00026","url":null,"abstract":"Given the widespread use of Convolutional Neural Networks (CNNs) in image classification applications, cloud providers now routinely offer several GPU-equipped instances with varying price points and hardware specifications. From a practitioner's perspective, given an arbitrary CNN, it is not obvious which GPU instance should be employed to minimize the model training time and/or rental cost. This paper presents Ceer, a model-driven approach to determine the optimal GPU instance(s) for any given CNN. Based on an operation-level empirical analysis of various CNNs, we develop regression models for heavy GPU operations (where input size is a key feature) and employ the sample median estimator for light GPU and CPU operations. To estimate the communication overhead between CPU and GPU(s), especially in the case of multi-GPU training, we develop a model that relates this communication overhead to the number of model parameters in the CNN. Evaluation results on AWS Cloud show that Ceer can accurately predict training time and cost (less than 5% average prediction error) across CNNs, enabling 36% −44% cost savings over simpler strategies that employ the cheapest or the latest generation GPU instances.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Reconfigurable Accelerator Compute Hierarchy: A Case Study using Content-Based Image Retrieval 可重构加速器计算层次:使用基于内容的图像检索的案例研究
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00034
Nazanin Farahpour, Y. Hao, Zhenman Fang, Glenn D. Reinman
The recent adoption of reconfigurable hardware accelerators in data centers has significantly improved their computational power and energy efficiency for compute-intensive applications. However, for common communication-bound analytics workloads, these benefits are limited by the efficiency of data movement in the IO stack. For this reason, server architects are proposing a more data-centric acceleration scheme by moving the compute elements closer to the data. While prior studies focus on the benefits of Near Data Processing (NDP) solely on one level of the memory hierarchy (one of cache, main memory or storage), we focus on the collaboration of NDP accelerators at all levels and their collective benefits in accelerating an application pipeline. In this paper, we present a Reconfigurable Accelerator Compute Hierarchy (ReACH) that combines on-chip, near-memory, and near-storage accelerators. Each memory level has a reconfigurable accelerator chip attached to it, which provides distinct compute and memory capabilities and offers a broad spectrum of acceleration options. To enable effective acceleration on various application pipelines, we propose a holistic approach to coordinate between the compute levels, reducing inter-level data access interference and achieving asynchronous task flow control. To minimize the programming efforts of using the compute hierarchy, a uniform programming interface is designed to decouple the ReACH configuration from the user application source code and allow runtime adjustments without modifying the deployed application. We experimentally deploy a billion-scale Content-Based Image Retrieval (CBIR) system on ReACH. Simulation results demonstrate that a proper application mapping eliminates unnecessary data movement, and ReACH achieves 4.5x throughput gain while reducing energy consumption by 52% compared to conventional on-chip acceleration.
最近在数据中心中采用的可重构硬件加速器显著提高了计算密集型应用程序的计算能力和能源效率。然而,对于常见的通信绑定分析工作负载,这些好处受到IO堆栈中数据移动效率的限制。出于这个原因,服务器架构师提出了一种更加以数据为中心的加速方案,将计算元素移动到更靠近数据的位置。虽然以前的研究只关注近数据处理(NDP)在内存层次结构的一个级别(缓存,主存或存储之一)的好处,但我们关注的是NDP加速器在所有级别的协作以及它们在加速应用程序管道方面的集体好处。在本文中,我们提出了一个可重构加速器计算层次结构(Reconfigurable Accelerator Compute Hierarchy,简称ReACH),它结合了片上、近内存和近存储加速器。每个内存级别都有一个可重新配置的加速器芯片,它提供了独特的计算和内存能力,并提供了广泛的加速选项。为了在各种应用程序管道上实现有效的加速,我们提出了一种整体方法来协调计算层之间的协调,减少层间数据访问干扰并实现异步任务流控制。为了最大限度地减少使用计算层次结构的编程工作,设计了一个统一的编程接口来将ReACH配置与用户应用程序源代码解耦,并允许在不修改已部署应用程序的情况下进行运行时调整。我们实验部署了一个十亿规模的基于内容的图像检索(CBIR)系统。仿真结果表明,适当的应用映射消除了不必要的数据移动,与传统的片上加速相比,ReACH实现了4.5倍的吞吐量增益,同时降低了52%的能耗。
{"title":"Reconfigurable Accelerator Compute Hierarchy: A Case Study using Content-Based Image Retrieval","authors":"Nazanin Farahpour, Y. Hao, Zhenman Fang, Glenn D. Reinman","doi":"10.1109/IISWC50251.2020.00034","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00034","url":null,"abstract":"The recent adoption of reconfigurable hardware accelerators in data centers has significantly improved their computational power and energy efficiency for compute-intensive applications. However, for common communication-bound analytics workloads, these benefits are limited by the efficiency of data movement in the IO stack. For this reason, server architects are proposing a more data-centric acceleration scheme by moving the compute elements closer to the data. While prior studies focus on the benefits of Near Data Processing (NDP) solely on one level of the memory hierarchy (one of cache, main memory or storage), we focus on the collaboration of NDP accelerators at all levels and their collective benefits in accelerating an application pipeline. In this paper, we present a Reconfigurable Accelerator Compute Hierarchy (ReACH) that combines on-chip, near-memory, and near-storage accelerators. Each memory level has a reconfigurable accelerator chip attached to it, which provides distinct compute and memory capabilities and offers a broad spectrum of acceleration options. To enable effective acceleration on various application pipelines, we propose a holistic approach to coordinate between the compute levels, reducing inter-level data access interference and achieving asynchronous task flow control. To minimize the programming efforts of using the compute hierarchy, a uniform programming interface is designed to decouple the ReACH configuration from the user application source code and allow runtime adjustments without modifying the deployed application. We experimentally deploy a billion-scale Content-Based Image Retrieval (CBIR) system on ReACH. Simulation results demonstrate that a proper application mapping eliminates unnecessary data movement, and ReACH achieves 4.5x throughput gain while reducing energy consumption by 52% compared to conventional on-chip acceleration.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129123057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliability Modeling of NISQ- Era Quantum Computers NISQ时代量子计算机的可靠性建模
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00018
Ji Liu, Huiyang Zhou
Recent developments in quantum computers have been pushing up the number of qubits. However, the state-of-the-art Noisy Intermediate Scale Quantum (NISQ) computers still do not have enough qubits to accommodate the error correction circuit. Noise in quantum gates limits the reliability of quantum circuits. To characterize the noise effects, prior methods such as process tomography, gateset tomography and randomized benchmarking have been proposed. However, the challenge is that these methods do not scale well with the number of qubits. Noise models based on the understanding of underneath physics have also been proposed to study different kinds of noise in quantum computers. The difficulty is that there is no widely accepted noise model that incorporates all different kinds of errors. The realworld errors can be very complicated and it remains an active area of research to produce accurate noise models. In this paper, instead of using noise models to estimate the reliability, which is measured with success rates or inference strength, we treat the NISQ quantum computer as a black box. We use several quantum circuit characteristics such as the number of qubits, circuit depth, the number of CNOT gates, and the connection topology of the quantum computer as inputs to the black box and derive a reliability estimation model using (1) polynomial fitting and (2) a shallow neural network. We propose randomized benchmarks with random numbers of qubits and basic gates to generate a large data set for neural network training. We show that the estimated reliability from our black-box model outperforms the noise models from Qiskit. We also showcase that our black-box model can be used to guide quantum circuit optimization at compile time.
量子计算机的最新发展推动了量子比特的数量。然而,最先进的噪声中尺度量子(NISQ)计算机仍然没有足够的量子比特来容纳纠错电路。量子门中的噪声限制了量子电路的可靠性。为了表征噪声的影响,人们提出了诸如过程层析成像、闸集层析成像和随机基准测试等先前的方法。然而,挑战在于这些方法不能很好地扩展量子位的数量。基于对底层物理的理解,噪声模型也被提出用于研究量子计算机中不同类型的噪声。困难的是,目前还没有一个被广泛接受的噪声模型来综合各种不同的误差。现实世界的误差可能非常复杂,如何产生准确的噪声模型仍然是一个活跃的研究领域。在本文中,我们将NISQ量子计算机视为一个黑盒子,而不是使用噪声模型来估计可靠性,可靠性是用成功率或推理强度来衡量的。我们使用几个量子电路特征,如量子比特的数量、电路深度、CNOT门的数量和量子计算机的连接拓扑作为黑盒的输入,并使用(1)多项式拟合和(2)浅层神经网络推导出可靠性估计模型。我们提出了随机基准与随机数的量子比特和基本门,以产生一个大的数据集用于神经网络训练。结果表明,黑盒模型的估计可靠性优于Qiskit的噪声模型。我们还展示了我们的黑盒模型可以用于指导量子电路在编译时的优化。
{"title":"Reliability Modeling of NISQ- Era Quantum Computers","authors":"Ji Liu, Huiyang Zhou","doi":"10.1109/IISWC50251.2020.00018","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00018","url":null,"abstract":"Recent developments in quantum computers have been pushing up the number of qubits. However, the state-of-the-art Noisy Intermediate Scale Quantum (NISQ) computers still do not have enough qubits to accommodate the error correction circuit. Noise in quantum gates limits the reliability of quantum circuits. To characterize the noise effects, prior methods such as process tomography, gateset tomography and randomized benchmarking have been proposed. However, the challenge is that these methods do not scale well with the number of qubits. Noise models based on the understanding of underneath physics have also been proposed to study different kinds of noise in quantum computers. The difficulty is that there is no widely accepted noise model that incorporates all different kinds of errors. The realworld errors can be very complicated and it remains an active area of research to produce accurate noise models. In this paper, instead of using noise models to estimate the reliability, which is measured with success rates or inference strength, we treat the NISQ quantum computer as a black box. We use several quantum circuit characteristics such as the number of qubits, circuit depth, the number of CNOT gates, and the connection topology of the quantum computer as inputs to the black box and derive a reliability estimation model using (1) polynomial fitting and (2) a shallow neural network. We propose randomized benchmarks with random numbers of qubits and basic gates to generate a large data set for neural network training. We show that the estimated reliability from our black-box model outperforms the noise models from Qiskit. We also showcase that our black-box model can be used to guide quantum circuit optimization at compile time.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation 现实世界图的顶点重排序及其应用:一个经验评价
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00031
Reet Barik, Marco Minutoli, M. Halappanavar, Nathan R. Tallent, A. Kalyanaraman
Vertex reordering is a way to improve locality in graph computations. Given an input (or “natural”) order, reordering aims to compute an alternate permutation of the vertices that is aimed at maximizing a locality-based objective. Given decades of research on this topic, there are tens of graph reordering schemes, and there are also several linear arrangement “gap” measures for treatment as objectives. However, a comprehensive empirical analysis of the efficacy of the ordering schemes against the different gap measures, and against real-world applications is currently lacking. In this study, we present an extensive empirical evaluation of up to 11 ordering schemes, taken from different classes of approaches, on a set of 34 real-world graphs emerging from different application domains. Our study is presented in two parts: a) a thorough comparative evaluation of the different ordering schemes on their effectiveness to optimize different linear arrangement gap measures, relevant to preserving locality; and b) extensive evaluation of the impact of the ordering schemes on two real-world, parallel graph applications, namely, community detection and influence maximization. Our studies show a significant divergence among the ordering schemes (up to 40x between the best and the poor) in their effectiveness to reduce the gap measures; and a wide ranging impact of the ordering schemes on various aspects including application runtime (up to 4x), memory and cache use, load balancing, and parallel work and efficiency. The comparative study also helps in revealing the nuances of a parallel environment (compared to serial) on the ordering schemes and their role in optimizing applications.
顶点重排序是提高图计算局部性的一种方法。给定输入(或“自然”)顺序,重新排序旨在计算顶点的替代排列,以最大化基于位置的目标。经过几十年的研究,有几十种图的重新排序方案,也有几种线性排列的“间隙”措施作为目标处理。然而,目前缺乏针对不同差距度量和实际应用的排序方案有效性的综合实证分析。在这项研究中,我们对多达11种排序方案进行了广泛的实证评估,这些方案取自不同类别的方法,涉及来自不同应用领域的34个真实世界图。我们的研究分为两部分:a)对不同排序方案在优化不同线性排列间隙措施方面的有效性进行了全面的比较评价,这些措施与保持局部性有关;b)广泛评估排序方案对两个现实世界并行图应用的影响,即社区检测和影响最大化。我们的研究表明,在减少差距措施的有效性方面,排序方案之间存在显著差异(最优者与最贫困者之间的差异高达40倍);排序方案在各个方面都有广泛的影响,包括应用程序运行时(最多4倍)、内存和缓存使用、负载平衡、并行工作和效率。比较研究还有助于揭示并行环境(与串行环境相比)在排序方案及其在优化应用程序中的作用方面的细微差别。
{"title":"Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation","authors":"Reet Barik, Marco Minutoli, M. Halappanavar, Nathan R. Tallent, A. Kalyanaraman","doi":"10.1109/IISWC50251.2020.00031","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00031","url":null,"abstract":"Vertex reordering is a way to improve locality in graph computations. Given an input (or “natural”) order, reordering aims to compute an alternate permutation of the vertices that is aimed at maximizing a locality-based objective. Given decades of research on this topic, there are tens of graph reordering schemes, and there are also several linear arrangement “gap” measures for treatment as objectives. However, a comprehensive empirical analysis of the efficacy of the ordering schemes against the different gap measures, and against real-world applications is currently lacking. In this study, we present an extensive empirical evaluation of up to 11 ordering schemes, taken from different classes of approaches, on a set of 34 real-world graphs emerging from different application domains. Our study is presented in two parts: a) a thorough comparative evaluation of the different ordering schemes on their effectiveness to optimize different linear arrangement gap measures, relevant to preserving locality; and b) extensive evaluation of the impact of the ordering schemes on two real-world, parallel graph applications, namely, community detection and influence maximization. Our studies show a significant divergence among the ordering schemes (up to 40x between the best and the poor) in their effectiveness to reduce the gap measures; and a wide ranging impact of the ordering schemes on various aspects including application runtime (up to 4x), memory and cache use, load balancing, and parallel work and efficiency. The comparative study also helps in revealing the nuances of a parallel environment (compared to serial) on the ordering schemes and their role in optimizing applications.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121438128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Keynote #1, Keynote #2 IISWC 2020 主题1,主题2 IISWC 2020
Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00037
{"title":"Keynote #1, Keynote #2 IISWC 2020","authors":"","doi":"10.1109/iiswc50251.2020.00037","DOIUrl":"https://doi.org/10.1109/iiswc50251.2020.00037","url":null,"abstract":"","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124199054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPI for Runtime Performance Measurement: The Good, the Bad, and the Ugly 用于运行时性能测量的CPI:好、坏和丑
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00019
Li Yi, Cong Li, Jianmei Guo
Originally used for micro-architectural performance characterization, the metric of cycles per instruction (CPI) is now emerging as a proxy for workload performance measurement in runtime cloud environments. It has been used to evaluate the performance per workload before and after applying a system configuration change and to detect contentions on the micro-architectural resources in workload colocation. In this paper, we re-examine the use of CPI on two representative cloud computing workloads. An alternative metric, reference cycles per instruction (RCPI), is defined for comparison. We show that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases. However, in the other cases with a different frequency scaling, we observe a better CPI value given a worse performance. We conjecture that both the observations are due to the bias of CPI towards scenarios with a low core frequency. We next demonstrate that a significant change in either CPI or RCPI does not necessarily indicate a boost or loss in performance, since both CPI and RCPI are dependent on workload intensities. It implies that the use of CPI without referring to the workload intensity is probably inappropriate. This provokes the discussion of the right way to use CPI, e.g., modeling CPI as a dependent variable given other relevant factors as the independent variables.
每指令周期(CPI)度量最初用于微体系结构性能表征,现在正在成为运行时云环境中工作负载性能度量的代理。它已被用于在应用系统配置更改之前和之后评估每个工作负载的性能,并检测工作负载托管中微体系结构资源上的争用。在本文中,我们重新检查CPI在两个代表性云计算工作负载上的使用。还定义了另一种度量,即每条指令的引用周期(RCPI),用于比较。我们表明,在某些情况下,CPI在识别微体系结构性能变化方面比RCPI更敏感。然而,在具有不同频率缩放的其他情况下,我们观察到在较差的性能下有较好的CPI值。我们推测,这两个观测结果都是由于CPI偏向于低核心频率的情景。接下来,我们将证明CPI或RCPI的显著变化并不一定表明性能的提高或下降,因为CPI和RCPI都依赖于工作负载强度。这意味着使用CPI而不参考工作负载强度可能是不合适的。这引发了对CPI正确使用方式的讨论,例如,将CPI建模为因变量,并将其他相关因素作为自变量。
{"title":"CPI for Runtime Performance Measurement: The Good, the Bad, and the Ugly","authors":"Li Yi, Cong Li, Jianmei Guo","doi":"10.1109/IISWC50251.2020.00019","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00019","url":null,"abstract":"Originally used for micro-architectural performance characterization, the metric of cycles per instruction (CPI) is now emerging as a proxy for workload performance measurement in runtime cloud environments. It has been used to evaluate the performance per workload before and after applying a system configuration change and to detect contentions on the micro-architectural resources in workload colocation. In this paper, we re-examine the use of CPI on two representative cloud computing workloads. An alternative metric, reference cycles per instruction (RCPI), is defined for comparison. We show that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases. However, in the other cases with a different frequency scaling, we observe a better CPI value given a worse performance. We conjecture that both the observations are due to the bias of CPI towards scenarios with a low core frequency. We next demonstrate that a significant change in either CPI or RCPI does not necessarily indicate a boost or loss in performance, since both CPI and RCPI are dependent on workload intensities. It implies that the use of CPI without referring to the workload intensity is probably inappropriate. This provokes the discussion of the right way to use CPI, e.g., modeling CPI as a dependent variable given other relevant factors as the independent variables.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121917063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Program Committee : IISWC 2020 项目委员会:IISWC 2020
Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00007
{"title":"Program Committee : IISWC 2020","authors":"","doi":"10.1109/iiswc50251.2020.00007","DOIUrl":"https://doi.org/10.1109/iiswc50251.2020.00007","url":null,"abstract":"","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131589562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High Frequency Performance Monitoring via Architectural Event Measurement 通过架构事件度量进行高频性能监控
Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00020
Chutitep Woralert, James Bruska, Chen Liu, Lok K. Yan
Obtaining detailed software execution information via hardware performance counters is a powerful analysis technique. The performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this paper, we present K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low overhead, periodic performance counter data using a kernel module based design. Our proposed approach has been evaluated on three different case studies to demonstrate its effectiveness, correctness and efficiency. By moving the responsibility of timing to kernel space, K-LEB can gather periodic data at a 100μs rate, which is 100 times faster than other comparable performance counter monitoring approaches. At the same time, it reduces the monitoring overhead by at least 58.8%, and the difference between the recorded performance counter readings and those of other tools are less than 0.3%.
通过硬件性能计数器获取详细的软件执行信息是一种强大的分析技术。性能计数器提供了一种有效的方法来监控程序行为;因此,由于硬件架构或软件设计和实现造成的性能瓶颈可以被识别、隔离和改进。然而,监视机制的粒度和开销对于正确的分析是至关重要的。许多先前的设计已经能够提供性能计数器监视,但存在一些固有的缺点,例如侵入性代码更改、慢计时器系统或需要内核补丁。在本文中,我们提出了K-LEB (Kernel - Lineage of Event Behavior),这是一种新的监控机制,可以使用基于内核模块的设计产生精确的、非侵入性的、低开销的、周期性的性能计数器数据。我们提出的方法已经在三个不同的案例研究中进行了评估,以证明其有效性,正确性和效率。通过将计时的责任转移到内核空间,K-LEB可以以100μs的速率收集周期性数据,这比其他类似的性能计数器监控方法快100倍。同时,它减少了至少58.8%的监视开销,并且记录的性能计数器读数与其他工具的读数之间的差异小于0.3%。
{"title":"High Frequency Performance Monitoring via Architectural Event Measurement","authors":"Chutitep Woralert, James Bruska, Chen Liu, Lok K. Yan","doi":"10.1109/IISWC50251.2020.00020","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00020","url":null,"abstract":"Obtaining detailed software execution information via hardware performance counters is a powerful analysis technique. The performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this paper, we present K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low overhead, periodic performance counter data using a kernel module based design. Our proposed approach has been evaluated on three different case studies to demonstrate its effectiveness, correctness and efficiency. By moving the responsibility of timing to kernel space, K-LEB can gather periodic data at a 100μs rate, which is 100 times faster than other comparable performance counter monitoring approaches. At the same time, it reduces the monitoring overhead by at least 58.8%, and the difference between the recorded performance counter readings and those of other tools are less than 0.3%.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128686835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2020 IEEE International Symposium on Workload Characterization (IISWC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1