首页 > 最新文献

2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)最新文献

英文 中文
CHAMPVis: Comparative Hierarchical Analysis of Microarchitectural Performance CHAMPVis:微架构性能的比较层次分析
Lillian Pentecost, Udit Gupta, Elisa Ngan, J. Beyer, Gu-Yeon Wei, D. Brooks, M. Behrisch
Performance analysis and optimization are essential tasks for hardware and software engineers. In the age of datacenter-scale computing, it is particularly important to conduct comparative performance analysis to understand discrepancies and limitations among different hardware systems and applications. However, there is a distinct lack of productive visualization tools for these comparisons. We present CHAMPVis, a web-based, interactive visualization tool that leverages the hierarchical organization of hardware systems to enable productive performance analysis. With CHAMPVis, users can make definitive performance comparisons across applications or hardware platforms. In addition, CHAMPVis provides methods to rank and cluster based on performance metrics to identify common optimization opportunities. Our thorough task analysis reveals three types of datacenter-scale performance analysis tasks: summarization, detailed comparative analysis, and interactive performance bottleneck identification. We propose techniques for each class of tasks including (1) 1-D feature space projection for similarity analysis; (2) Hierarchical parallel co-ordinates for comparative analysis; and (3) User interactions for rapid diagnostic queries to identify optimization targets. We evaluate CHAMPVis by analyzing standard datacenter applications and machine learning benchmarks in two different case studies.
性能分析和优化是硬件和软件工程师的基本任务。在数据中心规模的计算时代,进行比较性能分析以了解不同硬件系统和应用程序之间的差异和限制尤为重要。然而,对于这些比较,明显缺乏有效的可视化工具。我们提出CHAMPVis,一个基于网络的交互式可视化工具,它利用硬件系统的分层组织来实现高效的性能分析。使用CHAMPVis,用户可以跨应用程序或硬件平台进行明确的性能比较。此外,CHAMPVis还提供了基于性能指标进行排序和聚类的方法,以确定常见的优化机会。我们的全面任务分析揭示了三种类型的数据中心级性能分析任务:总结、详细比较分析和交互式性能瓶颈识别。我们为每一类任务提出了技术,包括:(1)用于相似性分析的一维特征空间投影;(2)层次平行坐标进行比较分析;(3)用于快速诊断查询的用户交互,以确定优化目标。我们通过分析两个不同案例研究中的标准数据中心应用程序和机器学习基准来评估CHAMPVis。
{"title":"CHAMPVis: Comparative Hierarchical Analysis of Microarchitectural Performance","authors":"Lillian Pentecost, Udit Gupta, Elisa Ngan, J. Beyer, Gu-Yeon Wei, D. Brooks, M. Behrisch","doi":"10.1109/ProTools49597.2019.00013","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00013","url":null,"abstract":"Performance analysis and optimization are essential tasks for hardware and software engineers. In the age of datacenter-scale computing, it is particularly important to conduct comparative performance analysis to understand discrepancies and limitations among different hardware systems and applications. However, there is a distinct lack of productive visualization tools for these comparisons. We present CHAMPVis, a web-based, interactive visualization tool that leverages the hierarchical organization of hardware systems to enable productive performance analysis. With CHAMPVis, users can make definitive performance comparisons across applications or hardware platforms. In addition, CHAMPVis provides methods to rank and cluster based on performance metrics to identify common optimization opportunities. Our thorough task analysis reveals three types of datacenter-scale performance analysis tasks: summarization, detailed comparative analysis, and interactive performance bottleneck identification. We propose techniques for each class of tasks including (1) 1-D feature space projection for similarity analysis; (2) Hierarchical parallel co-ordinates for comparative analysis; and (3) User interactions for rapid diagnostic queries to identify optimization targets. We evaluate CHAMPVis by analyzing standard datacenter applications and machine learning benchmarks in two different case studies.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115181967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
In Situ Visualization of Performance Metrics in Multiple Domains 多领域性能指标的现场可视化
Allen R. Sanderson, John A. Schmidt, A. Humphrey, M. Papka, R. Sisneros
As application scientists develop and deploy simula- tion codes on to leadership-class computing resources, there is a need to instrument these codes to better understand performance to efficiently utilize these resources. This instrumentation may come from independent third-party tools that generate and store performance metrics or from custom instrumentation tools built directly into the application. The metrics collected are then available for visual analysis, typically in the domain in which there were collected. In this paper, we introduce an approach to visualize and analyze the performance metrics in situ in the context of the machine, application, and communication domains (MAC model) using a single visualization tool. This visualization model provides a holistic view of the application performance in the context of the resources where it is executing.
当应用程序科学家在领导级计算资源上开发和部署模拟代码时,需要对这些代码进行检测,以便更好地了解性能,从而有效地利用这些资源。该工具可以来自生成和存储性能指标的独立第三方工具,也可以来自直接内置于应用程序中的自定义工具。然后,收集的度量标准可用于可视化分析,通常在收集度量标准的领域中。在本文中,我们介绍了一种使用单一可视化工具在机器,应用程序和通信域(MAC模型)上下文中可视化和分析原位性能指标的方法。这个可视化模型提供了应用程序在其执行的资源上下文中的性能的整体视图。
{"title":"In Situ Visualization of Performance Metrics in Multiple Domains","authors":"Allen R. Sanderson, John A. Schmidt, A. Humphrey, M. Papka, R. Sisneros","doi":"10.1109/ProTools49597.2019.00014","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00014","url":null,"abstract":"As application scientists develop and deploy simula- tion codes on to leadership-class computing resources, there is a need to instrument these codes to better understand performance to efficiently utilize these resources. This instrumentation may come from independent third-party tools that generate and store performance metrics or from custom instrumentation tools built directly into the application. The metrics collected are then available for visual analysis, typically in the domain in which there were collected. In this paper, we introduce an approach to visualize and analyze the performance metrics in situ in the context of the machine, application, and communication domains (MAC model) using a single visualization tool. This visualization model provides a holistic view of the application performance in the context of the resources where it is executing.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123395528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
[Copyright notice] (版权)
{"title":"[Copyright notice]","authors":"","doi":"10.1109/protools49597.2019.00002","DOIUrl":"https://doi.org/10.1109/protools49597.2019.00002","url":null,"abstract":"","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125149698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
[Title page] (标题页)
{"title":"[Title page]","authors":"","doi":"10.1109/protools49597.2019.00001","DOIUrl":"https://doi.org/10.1109/protools49597.2019.00001","url":null,"abstract":"","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122997009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Level Performance Instrumentation for Kokkos Applications Using TAU 使用TAU的Kokkos应用程序的多级性能仪器
S. Shende, Nicholas Chaimov, A. Malony, N. Imam
The TAU Performance System® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ functors. It can also use instrumentation at the OpenMP, CUDA, pthread, or other runtime levels to expose the implementation details giving a dual focus of higher-level abstractions as well as low-level execution dynamics. This multi-level instrumentation strategy adopted by TAU can highlight performance problems across multiple layers of the runtime system without modifying the application binary.
TAU性能系统®为Kokkos应用的仪器仪表提供了多层次的仪器仪表策略。Kokkos提供了一个性能可移植的API,用于在节点级别表达并行性。TAU使用Kokkos分析系统,使用用户指定的lambda函数或c++函子的并行内核名称来暴露性能因素。它还可以在OpenMP、CUDA、pthread或其他运行时级别上使用instrumentation来公开实现细节,从而双重关注高级抽象和低级执行动态。TAU采用的这种多级插装策略可以在不修改应用程序二进制文件的情况下突出运行时系统的多层性能问题。
{"title":"Multi-Level Performance Instrumentation for Kokkos Applications Using TAU","authors":"S. Shende, Nicholas Chaimov, A. Malony, N. Imam","doi":"10.1109/ProTools49597.2019.00012","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00012","url":null,"abstract":"The TAU Performance System® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ functors. It can also use instrumentation at the OpenMP, CUDA, pthread, or other runtime levels to expose the implementation details giving a dual focus of higher-level abstractions as well as low-level execution dynamics. This multi-level instrumentation strategy adopted by TAU can highlight performance problems across multiple layers of the runtime system without modifying the application binary.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Automatic Instrumentation Refinement for Empirical Performance Modeling 经验性能建模的自动仪表改进
Jan-Patrick Lehr, A. Calotoiu, C. Bischof, F. Wolf
The analysis of runtime performance is important during the development and throughout the life cycle of HPC applications. One important objective in performance analysis is to identify regions in the code that show significant runtime increase with larger problem sizes or more processes. One approach to identify such regions is to use empirical performance modeling, i.e., building performance models based on measurements. While the modeling itself has already been streamlined and automated, the generation of the required measurements is time consuming and tedious. In this paper, we propose an approach to automatically adjust the instrumentation to reduce overhead and focus the measurements to relevant regions, i.e.,such that show increasing runtime with larger input parameters or increasing number of MPI ranks. Our approach employs Extra-P to generate performance models, which it then uses to extrapolate runtime and, finally, decide which functions should be kept for measurement. Also, the analysis expands the instrumentation, by heuristically adding functions based on static source-code features. We evaluate our approach using benchmarks from SPEC CPU 2006, SU2, and parallel MILC. The evaluation shows that our approach can filter functions of little interest and generate profiles that contain mostly relevant regions. For example, the overhead for SU2 can be improved automatically from 200% to 11% compared to filtered Score-P measurements.
运行时性能分析在HPC应用程序的开发和整个生命周期中非常重要。性能分析的一个重要目标是确定代码中随着较大的问题规模或更多的进程而显着增加运行时的区域。识别这些区域的一种方法是使用经验性能建模,即基于测量建立性能模型。虽然建模本身已经被简化和自动化,但是所需度量的生成是耗时且乏味的。在本文中,我们提出了一种自动调整仪器的方法,以减少开销并将测量集中到相关区域,即随着输入参数的增加或MPI排名的增加,运行时间会增加。我们的方法使用Extra-P来生成性能模型,然后使用它来推断运行时,并最终决定应该保留哪些功能以进行度量。此外,通过启发式地添加基于静态源代码特性的函数,分析扩展了工具。我们使用SPEC CPU 2006、SU2和并行MILC的基准测试来评估我们的方法。评估表明,我们的方法可以过滤不感兴趣的函数,并生成包含大多数相关区域的轮廓。例如,与过滤后的Score-P测量值相比,SU2的开销可以自动从200%提高到11%。
{"title":"Automatic Instrumentation Refinement for Empirical Performance Modeling","authors":"Jan-Patrick Lehr, A. Calotoiu, C. Bischof, F. Wolf","doi":"10.1109/ProTools49597.2019.00011","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00011","url":null,"abstract":"The analysis of runtime performance is important during the development and throughout the life cycle of HPC applications. One important objective in performance analysis is to identify regions in the code that show significant runtime increase with larger problem sizes or more processes. One approach to identify such regions is to use empirical performance modeling, i.e., building performance models based on measurements. While the modeling itself has already been streamlined and automated, the generation of the required measurements is time consuming and tedious. In this paper, we propose an approach to automatically adjust the instrumentation to reduce overhead and focus the measurements to relevant regions, i.e.,such that show increasing runtime with larger input parameters or increasing number of MPI ranks. Our approach employs Extra-P to generate performance models, which it then uses to extrapolate runtime and, finally, decide which functions should be kept for measurement. Also, the analysis expands the instrumentation, by heuristically adding functions based on static source-code features. We evaluate our approach using benchmarks from SPEC CPU 2006, SU2, and parallel MILC. The evaluation shows that our approach can filter functions of little interest and generate profiles that contain mostly relevant regions. For example, the overhead for SU2 can be improved automatically from 200% to 11% compared to filtered Score-P measurements.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124424296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools 使用PaRSEC仪器工具的Tile低秩Cholesky分解性能分析
Qinglei Cao, Yu Pei, T. Hérault, Kadir Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra
This paper highlights the necessary development of new instrumentation tools within the PaRSE task-based runtime system to leverage the performance of low-rank matrix computations. In particular, the tile low-rank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging large-scale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSE's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSE's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to fine-grained and heterogeneous task execution. This permits us not only to provide insights from PaRSE, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSE developers for productivity as well as high-performance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communication-reducing and synchronization-reducing perspectives. This tool-assisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8X performance speedup over existing implementations on massively parallel supercomputers, toward solving large-scale 3D climate and weather prediction applications.
本文强调了在PaRSE基于任务的运行时系统中开发新的仪器工具的必要性,以利用低秩矩阵计算的性能。特别是,低秩(TLR) Cholesky分解是解决具有挑战性的大规模科学应用的最关键的矩阵运算之一。挑战在于各种计算内核的异构算术强度,这在运行时编排任务执行时对PaRSE的动态引擎造成压力。这种不规则的工作负载要求部署新的调度启发式方法来对关键路径授予特权,同时暴露任务并行性以最大化硬件占用。为了衡量PaRSE引擎及其各种调度策略处理此类工作负载的有效性,实现适合细粒度和异构任务执行的充分的性能分析和分析工具变得至关重要。这不仅允许我们提供来自PaRSE的见解,还允许我们识别潜在的应用程序性能瓶颈。这些工具实际上可以促进应用程序和PaRSE开发人员之间的协同,以提高生产力和高性能计算的目的。我们展示了这些可适应工具的好处,同时从数据分布、减少通信和减少同步的角度评估了TLR Cholesky分解的性能。这个工具辅助的性能分析产生了三个主要贡献:一个新的混合数据分布,一个新的分层TLR Cholesky算法,以及一个用于调整tile大小的新性能模型。新的TLR Cholesky分解在大规模并行超级计算机上实现了8倍的性能加速,用于解决大规模3D气候和天气预报应用。
{"title":"Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools","authors":"Qinglei Cao, Yu Pei, T. Hérault, Kadir Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra","doi":"10.1109/ProTools49597.2019.00009","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00009","url":null,"abstract":"This paper highlights the necessary development of new instrumentation tools within the PaRSE task-based runtime system to leverage the performance of low-rank matrix computations. In particular, the tile low-rank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging large-scale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSE's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSE's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to fine-grained and heterogeneous task execution. This permits us not only to provide insights from PaRSE, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSE developers for productivity as well as high-performance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communication-reducing and synchronization-reducing perspectives. This tool-assisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8X performance speedup over existing implementations on massively parallel supercomputers, toward solving large-scale 3D climate and weather prediction applications.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122264437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
The Case for a Common Instrumentation Interface for HPC Codes HPC代码通用仪表接口的案例
David Boehme, K. Huck, Jonathan Madsen, J. Weidendorfer
Lightweight timekeeping functionality for basic performance logging, regression testing, and anomaly detection is essential in HPC codes. We present the Caliper, TiMemory, and PerfStubs libraries that have recently been developed as common solutions for these tasks. Lightweight, always-on profiling solutions are typically built around user-defined instrumentation points, which can benefit a variety of use cases beyond application timekeeping. We argue for the creation of a tool-agnostic adapter layer to make these instrumentation points available to third-party tools, runtime systems, and system software.
用于基本性能记录、回归测试和异常检测的轻量级计时功能在HPC代码中是必不可少的。我们将介绍最近开发的Caliper、timmemory和perfstub库,它们是针对这些任务的通用解决方案。轻量级的、永远在线的分析解决方案通常是围绕用户定义的工具点构建的,这可以使应用程序计时之外的各种用例受益。我们主张创建与工具无关的适配器层,以使这些工具点对第三方工具、运行时系统和系统软件可用。
{"title":"The Case for a Common Instrumentation Interface for HPC Codes","authors":"David Boehme, K. Huck, Jonathan Madsen, J. Weidendorfer","doi":"10.1109/ProTools49597.2019.00010","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00010","url":null,"abstract":"Lightweight timekeeping functionality for basic performance logging, regression testing, and anomaly detection is essential in HPC codes. We present the Caliper, TiMemory, and PerfStubs libraries that have recently been developed as common solutions for these tasks. Lightweight, always-on profiling solutions are typically built around user-defined instrumentation points, which can benefit a variety of use cases beyond application timekeeping. We argue for the creation of a tool-agnostic adapter layer to make these instrumentation points available to third-party tools, runtime systems, and system software.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131721087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Asvie: A Timing-Agnostic SVE Optimization Methodology 一种时间不可知的SVE优化方法
M. T. Cruz, Daniel Ruiz, Roxana Rusitoru
As we are quickly approaching exascale and moving onwards towards the next challenge, we are exploring a wider range of technologies and architectures. The further out the timeframes considered, the less likely prototype hardware is available. A popular method of exploring new architectural extensions is to emulate them on existing platforms. The Arm Instruction Emulator (ArmIE) is such a tool, which we use on existing Armv8 platforms to run Arm's latest vector architecture, the Scalable Vector Extension (SVE). To aid with porting applications towards SVE, we developed an application optimization methodology based on ArmIE that uses timing-agnostic metrics to assess application quality. We show how we have successfully optimized the High Performance Conjugate Gradient (HPCG) High Performance Computing benchmark to SVE by using our methodology, resulting in a hand-optimized intrinsics-based version.
随着我们快速接近百亿亿级并向下一个挑战迈进,我们正在探索更广泛的技术和架构。考虑的时间范围越长,原型硬件可用的可能性就越小。探索新体系结构扩展的一种流行方法是在现有平台上模拟它们。Arm指令模拟器(ArmIE)就是这样一个工具,我们在现有的Armv8平台上使用它来运行Arm最新的矢量架构,可扩展矢量扩展(SVE)。为了帮助将应用程序移植到SVE,我们开发了一种基于ArmIE的应用程序优化方法,该方法使用与时间无关的度量来评估应用程序质量。我们展示了我们如何通过使用我们的方法成功地将高性能共轭梯度(HPCG)高性能计算基准优化到SVE,从而产生一个手动优化的基于本征的版本。
{"title":"Asvie: A Timing-Agnostic SVE Optimization Methodology","authors":"M. T. Cruz, Daniel Ruiz, Roxana Rusitoru","doi":"10.1109/ProTools49597.2019.00007","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00007","url":null,"abstract":"As we are quickly approaching exascale and moving onwards towards the next challenge, we are exploring a wider range of technologies and architectures. The further out the timeframes considered, the less likely prototype hardware is available. A popular method of exploring new architectural extensions is to emulate them on existing platforms. The Arm Instruction Emulator (ArmIE) is such a tool, which we use on existing Armv8 platforms to run Arm's latest vector architecture, the Scalable Vector Extension (SVE). To aid with porting applications towards SVE, we developed an application optimization methodology based on ArmIE that uses timing-agnostic metrics to assess application quality. We show how we have successfully optimized the High Performance Conjugate Gradient (HPCG) High Performance Computing benchmark to SVE by using our methodology, resulting in a hand-optimized intrinsics-based version.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"934 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123780368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Designing Efficient Parallel Software via Compositional Performance Modeling 基于组合性能建模的高效并行软件设计
A. Calotoiu, Thomas Höhl, H. Mantel, Toni Nguyen, F. Wolf
Performance models are powerful instruments for understanding the performance of parallel systems and uncovering their bottlenecks. Already during system design, performance models can help ponder alternatives. However, creating a performance model - whether theoretically or empirically - for an entire application that does not exist yet is challenging unless the interactions between all system components are well understood, which is often not the case during design. In this paper, we propose to generate performance models of full programs from performance models of their components using formal composition operators derived from parallel design patterns such as pipeline or task pool. As long as the design of the overall system follows such a pattern, its performance model can be predicted with reasonable accuracy without an actual implementation.
性能模型是理解并行系统性能和发现其瓶颈的强大工具。在系统设计期间,性能模型可以帮助考虑备选方案。然而,为尚不存在的整个应用程序创建性能模型(无论是理论上的还是经验上的)是具有挑战性的,除非所有系统组件之间的交互都被很好地理解,而在设计期间通常不是这样。在本文中,我们建议使用从并行设计模式(如管道或任务池)派生的正式组合操作符,从组件的性能模型生成完整程序的性能模型。只要整个系统的设计遵循这样的模式,就可以在不实际实现的情况下以合理的精度预测其性能模型。
{"title":"Designing Efficient Parallel Software via Compositional Performance Modeling","authors":"A. Calotoiu, Thomas Höhl, H. Mantel, Toni Nguyen, F. Wolf","doi":"10.1109/ProTools49597.2019.00008","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00008","url":null,"abstract":"Performance models are powerful instruments for understanding the performance of parallel systems and uncovering their bottlenecks. Already during system design, performance models can help ponder alternatives. However, creating a performance model - whether theoretically or empirically - for an entire application that does not exist yet is challenging unless the interactions between all system components are well understood, which is often not the case during design. In this paper, we propose to generate performance models of full programs from performance models of their components using formal composition operators derived from parallel design patterns such as pipeline or task pool. As long as the design of the overall system follows such a pattern, its performance model can be predicted with reasonable accuracy without an actual implementation.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131786663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1