Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Pub Date : 2015-05-04 DOI:10.1109/CCGrid.2015.59

James Lin, Akira Nukada, S. Matsuoka

{"title":"Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi","authors":"James Lin, Akira Nukada, S. Matsuoka","doi":"10.1109/CCGrid.2015.59","DOIUrl":null,"url":null,"abstract":"Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existing solution needs a hand-written assembly implementation. Therefore, we modeled G/S with hardware performance counters which can be profiled by the tools like PAPI. We profiled Address Generation Interlock (AGI) events as the number of G/S, estimated the average latency of G/S with VPU_DATA_READ, and combined them to model the total latencies of G/S. We applied our model to the 3D 7-point stencil and the result showed G/S spent nearly 40% of total kernel time. We also validated the model by implementing a G/S- free version with intrinsics. The contribution of the work is a performance model for G/S built with hardware counters. We believe the model can be generally applicable to CPU as well.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"40 1","pages":"713-716"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existing solution needs a hand-written assembly implementation. Therefore, we modeled G/S with hardware performance counters which can be profiled by the tools like PAPI. We profiled Address Generation Interlock (AGI) events as the number of G/S, estimated the average latency of G/S with VPU_DATA_READ, and combined them to model the total latencies of G/S. We applied our model to the 3D 7-point stencil and the result showed G/S spent nearly 40% of total kernel time. We also validated the model by implementing a G/S- free version with intrinsics. The contribution of the work is a performance model for G/S built with hardware counters. We believe the model can be generally applicable to CPU as well.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

建模收集和分散与硬件性能计数器为Xeon Phi

Intel初始多核指令(IMCI)为Xeon Phi引入了硬件实现的收集和分散(G/S)从/到非连续内存位置加载/存储SIMD寄存器的内容。然而，它们可能是Xeon Phi的关键性能瓶颈之一。建模G/S可以提供对Xeon Phi处理器性能的洞察，然而，现有的解决方案需要手工编写的汇编实现。因此，我们使用硬件性能计数器对G/S进行建模，这些计数器可以通过PAPI等工具进行分析。我们将Address Generation Interlock (AGI)事件描述为G/S的数量，使用VPU_DATA_READ估计G/S的平均延迟，并将它们结合起来建模G/S的总延迟。我们将我们的模型应用于3D 7点模板，结果显示G/S花费了近40%的总内核时间。我们还通过实现带有intrinsic的无G/S版本来验证该模型。这项工作的贡献是一个用硬件计数器构建的G/S性能模型。我们相信该模型也可以普遍适用于CPU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

自引率

0.00%

发文量