2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

英文中文

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks 芯片上网络的异构自适应节流

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.44

K. Chang, Rachata Ausavarungnirun, Chris Fallin, O. Mutlu

The network-on-chip (NoC) is a primary shared resource in a chip multiprocessor (CMP) system. As core counts continue to increase and applications become increasingly data-intensive, the network load will also increase, leading to more congestion in the network. This network congestion can degrade system performance if the network load is not appropriately controlled. Prior works have proposed source-throttling congestion control, which limits the rate at which new network traffic (packets) enters the NoC in order to reduce congestion and improve performance. These prior congestion control mechanisms have shortcomings that significantly limit their performance: either 1) they are not application-aware, but rather throttle all applications equally regardless of applications' sensitivity to latency, or 2) they are not network-load-aware, throttling according to application characteristics but sometimes under- or over-throttling the cores. In this work, we propose Heterogeneous Adaptive Throttling, or HAT, a new source-throttling congestion control mechanism based on two key principles: application-aware throttling and network-load-aware throttling rate adjustment. First, we observe that only network-bandwidth-intensive applications(those which use the network most heavily) should be throttled, allowing the other latency-sensitive applications to make faster progress without as much interference. Second, we observe that the throttling rate which yields the best performance varies between workloads, a single, static, throttling rate under-throttles some workloads while over-throttling others. Hence, the throttling mechanism should observe network load dynamically and adjust its throttling rate accordingly. While some past works have also used a closed-loop control approach, none have been application-aware. HAT is the first mechanism to combine application-awareness and network-load-aware throttling rate adjustment to address congestion in a NoC. We evaluate HAT using a wide variety of multiprogrammed workloads on several NoC-based CMP systems with 16-, 64-, and 144-cores and compare its performance to two state-of-the-art congestion control mechanisms. Our evaluations show that HAT consistently provides higher system performance and fairness than prior congestion control mechanisms.

片上网络(NoC)是片上多处理器(CMP)系统中的主要共享资源。随着核心数量的不断增加和应用程序的数据密集程度越来越高，网络负载也会增加，从而导致网络更加拥塞。如果不适当控制网络负载，这种网络拥塞会降低系统性能。先前的工作已经提出了源节流拥塞控制，它限制了新网络流量(数据包)进入NoC的速率，以减少拥塞并提高性能。这些先前的拥塞控制机制有明显限制其性能的缺点:要么1)它们不是应用程序感知的，而是平等地限制所有应用程序，而不管应用程序对延迟的敏感性如何;要么2)它们不是网络负载感知的，根据应用程序特征进行限制，但有时会对核心进行过少或过少的限制。在这项工作中，我们提出了异构自适应节流(HAT)，这是一种新的源节流拥塞控制机制，基于两个关键原则:应用感知节流和网络负载感知节流速率调整。首先，我们观察到只有网络带宽密集型应用程序(那些使用网络最多的应用程序)应该被限制，从而允许其他延迟敏感的应用程序在没有太多干扰的情况下取得更快的进展。其次，我们观察到产生最佳性能的节流率因工作负载而异，单一的静态节流率对某些工作负载的节流不足，而对其他工作负载的节流过高。因此，节流机制应该动态观察网络负载，并相应地调整其节流速率。虽然过去的一些工作也使用了闭环控制方法，但没有一个是应用感知的。HAT是第一个结合应用程序感知和网络负载感知的节流速率调整来解决NoC中的拥塞的机制。我们在几个基于cpu的16核、64核和144核CMP系统上使用各种多程序工作负载来评估HAT，并将其性能与两种最先进的拥塞控制机制进行比较。我们的评估表明，HAT始终比以前的拥塞控制机制提供更高的系统性能和公平性。

{"title":"HAT: Heterogeneous Adaptive Throttling for On-Chip Networks","authors":"K. Chang, Rachata Ausavarungnirun, Chris Fallin, O. Mutlu","doi":"10.1109/SBAC-PAD.2012.44","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.44","url":null,"abstract":"The network-on-chip (NoC) is a primary shared resource in a chip multiprocessor (CMP) system. As core counts continue to increase and applications become increasingly data-intensive, the network load will also increase, leading to more congestion in the network. This network congestion can degrade system performance if the network load is not appropriately controlled. Prior works have proposed source-throttling congestion control, which limits the rate at which new network traffic (packets) enters the NoC in order to reduce congestion and improve performance. These prior congestion control mechanisms have shortcomings that significantly limit their performance: either 1) they are not application-aware, but rather throttle all applications equally regardless of applications' sensitivity to latency, or 2) they are not network-load-aware, throttling according to application characteristics but sometimes under- or over-throttling the cores. In this work, we propose Heterogeneous Adaptive Throttling, or HAT, a new source-throttling congestion control mechanism based on two key principles: application-aware throttling and network-load-aware throttling rate adjustment. First, we observe that only network-bandwidth-intensive applications(those which use the network most heavily) should be throttled, allowing the other latency-sensitive applications to make faster progress without as much interference. Second, we observe that the throttling rate which yields the best performance varies between workloads, a single, static, throttling rate under-throttles some workloads while over-throttling others. Hence, the throttling mechanism should observe network load dynamically and adjust its throttling rate accordingly. While some past works have also used a closed-loop control approach, none have been application-aware. HAT is the first mechanism to combine application-awareness and network-load-aware throttling rate adjustment to address congestion in a NoC. We evaluate HAT using a wide variety of multiprogrammed workloads on several NoC-based CMP systems with 16-, 64-, and 144-cores and compare its performance to two state-of-the-art congestion control mechanisms. Our evaluations show that HAT consistently provides higher system performance and fairness than prior congestion control mechanisms.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123856578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

A Parallel Implementation of Gomory-Hu's Cut Tree Algorithm Gomory-Hu切树算法的并行实现

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.37

Jaime Cohen, L. A. Rodrigues, E. P. Duarte

Cut trees are a compact representation of the edge-connectivity between every pair of vertices of an undirected graph, and have a large number of applications. In this work a parallel version of the well known Gomory-Hu cut tree algorithm is presented. The parallel strategy is based on the master/slave model. The strategy is optimistic in the sense that the master process manipulates the tree being constructed and the slaves solve minimum s-t-cuts independently. Another version is proposed that employs a heuristic that enumerates all (up to a limit) of the minimum s-t-cuts in order to choose the most balanced one. The algorithm was implemented and extensive experimental results are presented, including a comparison with Gusfieldâs cut tree algorithm. Parallel versions of these algorithms have achieved significant speedups on real and synthetic graphs. We discuss the trade-offs between the two alternatives, each of which presents better results given the characteristics of the input graph. In particular, the existence of balanced cuts clearly gives an advantage to Gomory-Huâsalgorithm.

切树是无向图中每对顶点之间的边连通性的一种紧凑表示，具有大量的应用。在这项工作中，提出了一个著名的Gomory-Hu切树算法的并行版本。并行策略基于主/从模型。该策略是乐观的，因为主进程操纵正在构造的树，从进程独立地求解最小s-t-cuts。另一个版本提出了一个启发式，枚举所有的最小s-t-cuts，以选择最平衡的一个。给出了该算法的实现和广泛的实验结果，包括与gusfield 切树算法的比较。这些算法的并行版本在真实图和合成图上取得了显著的加速。我们讨论了两种替代方案之间的权衡，每种方案在给定输入图的特征时都能提供更好的结果。特别是，平衡切割的存在明显地给gomory - hu算法带来了优势。

引用次数: 5

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators 数据并行硬件加速器集体通信中寄存器文件与广播互连的效率研究

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.35

A. Pedram, A. Gerstlauer, R. V. D. Geijn

Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75× better power and up to 10× better area efficiency compared to traditional SIMD architectures.

降低功耗和提高效率是许多应用的关键问题。如何设计高效的计算元素，同时在应用程序领域内保持足够的灵活性是一个基本问题。在本文中，我们介绍了广播总线如何在线性代数操作的数据并行硬件加速器环境中消除耗电多端口寄存器文件的使用。我们展示了一种算法/架构协同设计，用于映射不同的集体通信操作，这对于在大多数线性代数例程中实现性能和效率至关重要，例如GEMM, syk和矩阵转置。我们比较了基于广播总线的架构与传统SIMD、2D-SIMD和平面寄存器文件在这些操作方面的面积和能源效率。结果表明，与传统SIMD架构相比，原型线性代数核心的快速广播数据移动能力可以实现高达75倍的功率提升和高达10倍的面积效率提升。

{"title":"On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators","authors":"A. Pedram, A. Gerstlauer, R. V. D. Geijn","doi":"10.1109/SBAC-PAD.2012.35","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.35","url":null,"abstract":"Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75× better power and up to 10× better area efficiency compared to traditional SIMD architectures.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"12 s2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120845388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles 使用指令重用配置文件的低开销指令缓存建模

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.25

Muneeb Khan, Andreas Sembrant, Erik Hagersten

Performance loss caused by L1 instruction cache misses varies between different architectures and cache sizes. For processors employing power-efficient in-order execution with small caches, performance can be significantly affected by instruction cache misses. The growing use of low-power multi-threaded CPUs (with shared L1 caches) in general purpose computing platforms requires new efficient techniques for analyzing application instruction cache usage. Such insight can be achieved using traditional simulation technologies modeling several cache sizes, but the overhead of simulators may be prohibitive for practical optimization usage. In this paper we present a statistical method to quickly model application instruction cache performance. Most importantly we propose a very low-overhead sampling mechanism to collect runtime data from the application's instruction stream. This data is fed to the statistical model which accurately estimates the instruction cache miss ratio for the sampled execution. Our sampling method is about 10x faster than previously suggested sampling approaches, with average runtime overhead as low as 25% over native execution. The architecturally-independent data collected is used to accurately model miss ratio for several cache sizes simultaneously, with average absolute error of 0.2%. Finally, we show how our tool can be used to identify program phases with large instruction cache footprint. Such phases can then be targeted to optimize for reduced code footprint.

由L1指令缓存丢失引起的性能损失在不同的体系结构和缓存大小之间是不同的。对于使用小缓存的高效按顺序执行的处理器，指令缓存丢失可能会严重影响性能。在通用计算平台中越来越多地使用低功耗多线程cpu(带有共享L1缓存)，需要新的高效技术来分析应用程序指令缓存的使用情况。可以使用传统的模拟技术对多个缓存大小进行建模来实现这种洞察，但是模拟器的开销可能不利于实际的优化使用。本文提出了一种快速建模应用程序指令缓存性能的统计方法。最重要的是，我们提出了一种非常低开销的采样机制，从应用程序的指令流中收集运行时数据。这些数据被馈送到统计模型中，该模型可以准确地估计采样执行的指令缓存缺失率。我们的抽样方法比以前建议的抽样方法快10倍，平均运行时开销比本机执行低25%。所收集的与体系结构无关的数据用于同时准确地模拟几种缓存大小的缺失率，平均绝对误差为0.2%。最后，我们展示了如何使用我们的工具来识别具有大指令缓存占用的程序阶段。然后可以针对这些阶段进行优化，以减少代码占用。

{"title":"Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles","authors":"Muneeb Khan, Andreas Sembrant, Erik Hagersten","doi":"10.1109/SBAC-PAD.2012.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.25","url":null,"abstract":"Performance loss caused by L1 instruction cache misses varies between different architectures and cache sizes. For processors employing power-efficient in-order execution with small caches, performance can be significantly affected by instruction cache misses. The growing use of low-power multi-threaded CPUs (with shared L1 caches) in general purpose computing platforms requires new efficient techniques for analyzing application instruction cache usage. Such insight can be achieved using traditional simulation technologies modeling several cache sizes, but the overhead of simulators may be prohibitive for practical optimization usage. In this paper we present a statistical method to quickly model application instruction cache performance. Most importantly we propose a very low-overhead sampling mechanism to collect runtime data from the application's instruction stream. This data is fed to the statistical model which accurately estimates the instruction cache miss ratio for the sampled execution. Our sampling method is about 10x faster than previously suggested sampling approaches, with average runtime overhead as low as 25% over native execution. The architecturally-independent data collected is used to accurately model miss ratio for several cache sizes simultaneously, with average absolute error of 0.2%. Finally, we show how our tool can be used to identify program phases with large instruction cache footprint. Such phases can then be targeted to optimize for reduced code footprint.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126294422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Runtime Procedure for Energy Savings in Applications with Point-to-Point Communications 点对点通信应用中节能的运行时程序

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.20

Vaibhav Sundriyal, M. Sosonkina, A. Gaenko

Although high-performance computing has always been about efficient application execution, both energy and power consumption have become critical concerns owing to their effect on operating costs and failure rates of large-scale computing platforms. Modern microprocessors are equipped with the capabilities to reduce their power consumption using techniques such as dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (called throttling). Without careful application, however, DVFS and throttling may cause a significant performance loss due to system overhead. This work presents design considerations for a runtime procedure that dynamically analyzes blocking point-to-point communications, groups them according to the proposed criteria, and applies frequency scaling by analyzing both communication and architectural parameters without penalizing the performance much. Experiments, performed on NAS parallel benchmarks verify the proposed design by exhibiting energy savings of as much as 11% with a performance loss as low as 2%.

尽管高性能计算一直与高效的应用程序执行有关，但由于能源和功耗对大型计算平台的操作成本和故障率的影响，它们已成为关键问题。现代微处理器配备了使用动态电压和频率缩放(DVFS)和CPU时钟调制(称为节流)等技术来降低功耗的能力。但是，如果不仔细应用，DVFS和节流可能会由于系统开销而导致显著的性能损失。这项工作提出了运行时过程的设计考虑，该过程动态分析阻塞点对点通信，根据建议的标准对它们进行分组，并通过分析通信和体系结构参数来应用频率缩放，而不会对性能造成太大影响。在NAS并行基准测试上进行的实验通过显示节能高达11%而性能损失低至2%来验证所提出的设计。

引用次数: 12

Scalable Triadic Analysis of Large-Scale Graphs: Multi-core vs. Multi-processor vs. Multi-threaded Shared Memory Architectures 大规模图的可伸缩三元分析:多核、多处理器、多线程共享内存架构

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-09-27 DOI: 10.1109/SBAC-PAD.2012.39

George Chin, A. Márquez, Sutanay Choudhury, J. Feo

Triadic analysis encompasses a useful set of graph mining methods that are centered on the concept of a triad, which is a sub graph of three nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis of large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We then conducted performance evaluations of the parallel triad census algorithm on three specific systems: CrayXMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.

三元分析包含了一组有用的图挖掘方法，这些方法以三元的概念为中心，三元是三个节点的子图。这些方法经常应用于社会科学以及许多其他不同的领域。三元方法通常在三元普查上操作，该普查计算图中每个可能边缘配置的三元的数量。与其他图算法一样，当图达到数千万到数十亿个节点时，三元普查算法不能很好地扩展。为了实现大规模图的三元分析，我们开发并优化了一种三元普查算法，以有效地在共享内存架构上执行。然后，我们在三个特定的系统:CrayXMT、HP Superdome和AMD多核NUMA机器上对并行三合一普查算法进行了性能评估。这三种系统具有共享内存架构，但在管理并行性方面具有明显不同的硬件能力。

引用次数: 4

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀