首页 > 最新文献

2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

英文 中文
Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer 单片云计算机上基于相位的应用驱动分层电源管理
Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra
To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.
为了提高能源效率,处理器允许动态电压和频率缩放(DVFS),这使得它们能够在飞行中改变性能和功耗。许多核心架构,如英特尔实验室的单芯片云计算机(SCC)实验处理器,具有DVFS基础架构,通过在片上拥有比今天的多核更多独立的电压和频域来扩展。本文提出了一种适用于此类体系结构的新颖的、分层的、透明的客户机-服务器电源管理方案。该方案不仅考虑了频率域内内核的局部信息,而且考虑了跨越多个频率和电压域的信息,试图在性能窗口内最小化能量消耗。我们使用一种新颖的应用程序驱动的相位检测和预测方法来实现我们提出的分层功率控制,用于消息传递接口(MPI)应用,这是SCC具有快速片上网络和非相干存储器层次结构的自然选择。这个相位预测器作为分层DVFS控制器的前端,提供必要的DVFS调度点。SCC硬件的实验结果表明,我们的方法提供了能量延迟产品(EDP)的显着改进,高达27.2%,平均11.4%,执行时间平均增加7.7%,比没有DVFS的基线版本。与现有方法相比,这些改进来自相位预测精度的提高和更有效的DVFS域控制。
{"title":"Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer","authors":"Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra","doi":"10.1109/PACT.2011.19","DOIUrl":"https://doi.org/10.1109/PACT.2011.19","url":null,"abstract":"To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130775310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs 行缓冲器重组:同时提高dram的性能和降低能量
N. Gulur, R. Manikantan, R. Govindarajan, M. Mehendale
In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.
基于多核存储器访问的时空局部性特征,我们提出了一种将DRAM存储中现有的单个大行缓冲区重组为多个较小的行缓冲区的方法。建议的配置有助于提高行命中率,还降低了行激活所需的能量。这项工作的主要贡献是提出这样的重组,而不需要对现有的广泛接受的DRAM规范进行任何重大更改。我们提出的重组将四核、八核和十六核工作负载的性能分别提高了35.8%、14.5%和21.6%,同时DRAM能耗降低了42%、28%和31%。此外,我们还引入了一个基于需求的缓冲区管理分配方案,该方案显示了额外的性能改进。
{"title":"Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs","authors":"N. Gulur, R. Manikantan, R. Govindarajan, M. Mehendale","doi":"10.1109/PACT.2011.34","DOIUrl":"https://doi.org/10.1109/PACT.2011.34","url":null,"abstract":"In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131822126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Decoupled Architectures as a Low-Complexity Alternative to Out-of-order Execution 作为无序执行的低复杂度替代方案的解耦架构
N. Crago, Sanjay J. Patel
In this paper we present OUTRIDERHP, a novel implementation of a decoupled architecture that approaches the performance of contemporary out-of-order processors on parallel benchmarks while maintaining low hardware complexity. OUTRIDERHP leverages the compiler to separate a single thread of execution into memory-accessing and memory-consuming streams that can be executed concurrently, which we call strands. We identify loss-of-decoupling events which cripple performance on traditional decoupled architectures, and design OUTRIDERHP to enable extraction of multiple strands and control speculation which provide superior memory and functional unit latency tolerance. OUTRIDERHP outperforms a baseline in-order architecture by 26-220% and Decoupled Access/Execute by 7-172% when executing parallel benchmarks on an 8-core CMP configuration. OUTRIDERHP performs within 15% of higher-complexity out-of-order cores despite not utilizing large physical register files, dynamic scheduling, and register renaming hardware.
在本文中,我们提出了OUTRIDERHP,这是一种解耦架构的新实现,在保持低硬件复杂性的同时,在并行基准测试中接近当代无序处理器的性能。OUTRIDERHP利用编译器将单个执行线程分离为可以并发执行的内存访问流和内存消耗流,我们称之为链。我们确定了在传统解耦架构上削弱性能的解耦丢失事件,并设计了OUTRIDERHP来实现多链提取和控制推测,从而提供更好的内存和功能单元延迟容忍。在8核CMP配置上执行并行基准测试时,OUTRIDERHP比基准顺序架构高出26-220%,比解耦访问/执行高出7-172%。尽管不使用大型物理寄存器文件、动态调度和寄存器重命名硬件,但OUTRIDERHP在15%的高复杂性乱序内核中执行。
{"title":"Decoupled Architectures as a Low-Complexity Alternative to Out-of-order Execution","authors":"N. Crago, Sanjay J. Patel","doi":"10.1109/PACT.2011.28","DOIUrl":"https://doi.org/10.1109/PACT.2011.28","url":null,"abstract":"In this paper we present OUTRIDERHP, a novel implementation of a decoupled architecture that approaches the performance of contemporary out-of-order processors on parallel benchmarks while maintaining low hardware complexity. OUTRIDERHP leverages the compiler to separate a single thread of execution into memory-accessing and memory-consuming streams that can be executed concurrently, which we call strands. We identify loss-of-decoupling events which cripple performance on traditional decoupled architectures, and design OUTRIDERHP to enable extraction of multiple strands and control speculation which provide superior memory and functional unit latency tolerance. OUTRIDERHP outperforms a baseline in-order architecture by 26-220% and Decoupled Access/Execute by 7-172% when executing parallel benchmarks on an 8-core CMP configuration. OUTRIDERHP performs within 15% of higher-complexity out-of-order cores despite not utilizing large physical register files, dynamic scheduling, and register renaming hardware.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130479663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory Architecture for Integrating Emerging Memory Technologies 集成新兴存储器技术的存储器体系结构
Kun Fang, Long Chen, Zhao Zhang, Zhichun Zhu
Current main memory system design is severely limited by the decades-old synchronous DRAM architecture, which requires the memory controller to track the internal status of memory devices (chips) and schedule the timing of all device operations. This rigidity has become an obstacle of integrating emerging memory technologies such as PCM into existing memory systems, because their timing requirements are vastly different. Furthermore, with the trend of embedding memory controllers into processors, it is crucial to have interoperability among general-purpose processors and diverse memory modules. To address this issue, we propose a new memory architecture framework called universal memory architecture (UniMA). It enables the interoperability by decoupling the scheduling of device operations from memory controller, using a bridge chip at each memory module to perform local scheduling. The new architecture may also help improve memory scalability, power efficiency, and bandwidth as previously proposed decoupled memory organizations. A major focus of this study is to evaluate the performance impact of local scheduling of device operations. We present a prototype implementation of UniMA on top of DDRx memory bus, and then evaluate its efficiency with different workloads. The simulation results show that UniMA actually improves memory system efficiency for memory-intensive workloads due to increased parallelism among memory modules. The overall performance improvement over the conventional DDRx memory architecture is 3.1% on average. The performance of other workloads is reduced slightly, by 1.0% on average, due to a small increase of memory idle latency. In short, the prototype and evaluation demonstrate that it is possible to integrate diverse memory technologies into a single memory architecture with virtually no loss of overall performance.
当前的主存储系统设计受到几十年前的同步DRAM架构的严重限制,这种架构要求存储控制器跟踪存储设备(芯片)的内部状态,并安排所有设备操作的时间。这种刚性已经成为将新兴存储技术(如PCM)集成到现有存储系统中的障碍,因为它们的时序要求大不相同。此外,随着存储控制器嵌入处理器的趋势,通用处理器和不同存储模块之间的互操作性变得至关重要。为了解决这个问题,我们提出了一个新的内存架构框架,称为通用内存架构(UniMA)。它通过将设备操作的调度与存储器控制器解耦,在每个存储器模块上使用桥接芯片来执行本地调度,从而实现互操作性。新架构还可以帮助提高内存可伸缩性、电源效率和带宽,就像以前提出的解耦内存组织一样。本研究的主要焦点是评估设备操作的本地调度对性能的影响。提出了基于DDRx存储总线的UniMA的原型实现,并在不同的工作负载下对其效率进行了评估。仿真结果表明,由于内存模块之间的并行性增加,UniMA实际上提高了内存密集型工作负载的内存系统效率。与传统的DDRx内存架构相比,总体性能平均提高3.1%。其他工作负载的性能略有下降,平均下降1.0%,这是由于内存空闲延迟的小幅增加。简而言之,原型和评估表明,可以将多种内存技术集成到单个内存架构中,而几乎不会损失整体性能。
{"title":"Memory Architecture for Integrating Emerging Memory Technologies","authors":"Kun Fang, Long Chen, Zhao Zhang, Zhichun Zhu","doi":"10.1109/PACT.2011.71","DOIUrl":"https://doi.org/10.1109/PACT.2011.71","url":null,"abstract":"Current main memory system design is severely limited by the decades-old synchronous DRAM architecture, which requires the memory controller to track the internal status of memory devices (chips) and schedule the timing of all device operations. This rigidity has become an obstacle of integrating emerging memory technologies such as PCM into existing memory systems, because their timing requirements are vastly different. Furthermore, with the trend of embedding memory controllers into processors, it is crucial to have interoperability among general-purpose processors and diverse memory modules. To address this issue, we propose a new memory architecture framework called universal memory architecture (UniMA). It enables the interoperability by decoupling the scheduling of device operations from memory controller, using a bridge chip at each memory module to perform local scheduling. The new architecture may also help improve memory scalability, power efficiency, and bandwidth as previously proposed decoupled memory organizations. A major focus of this study is to evaluate the performance impact of local scheduling of device operations. We present a prototype implementation of UniMA on top of DDRx memory bus, and then evaluate its efficiency with different workloads. The simulation results show that UniMA actually improves memory system efficiency for memory-intensive workloads due to increased parallelism among memory modules. The overall performance improvement over the conventional DDRx memory architecture is 3.1% on average. The performance of other workloads is reduced slightly, by 1.0% on average, due to a small increase of memory idle latency. In short, the prototype and evaluation demonstrate that it is possible to integrate diverse memory technologies into a single memory architecture with virtually no loss of overall performance.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130687013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Compiler Directed Data Locality Optimization for Multicore Architectures 面向多核架构的编译器定向数据局部性优化
W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy
This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.
本文提出并评估了一种多核体系结构中缓存层次感知的代码并行/映射和调度策略。我们提出的并行化/映射策略通过考虑应用程序的数据访问模式和目标体系结构的片上缓存层次结构来确定循环迭代到核心的映射。此步骤的目标是最大化每个缓存级别的数据局部性,同时最小化跨核心的数据依赖性。另一方面,我们的调度策略决定了分配给目标体系结构中每个核心的迭代的调度,其目标是满足代码中的所有数据依赖(包括核心内和核心间),并减少共享数据的核心之间的数据重用距离。将并行化/映射问题和调度问题形式化于线性代数框架中,并利用Farkas引理和整数傅里叶-莫兹金消去法求解。为了衡量我们的方案的有效性,我们在编译器中实现了它们,并在多核机器上使用八个多线程应用程序对它们进行了测试。我们的研究结果表明,与其他方法相比,所提出的映射方案降低了所有缓存层次结构的缓存丢失率,并显著提高了应用程序的执行时间,并且在调度支持下,缓存丢失率和执行时间的改进幅度要大得多。
{"title":"Compiler Directed Data Locality Optimization for Multicore Architectures","authors":"W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy","doi":"10.1109/PACT.2011.24","DOIUrl":"https://doi.org/10.1109/PACT.2011.24","url":null,"abstract":"This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications 参数化微基准测试:复杂应用的自动调优方法
Wenjing Ma, S. Krishnamoorthy, G. Agrawal
Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large number of kernels that exhibit similar computation and memory access characteristics and require optimization over similar problem spaces. We propose an auto-tuning method for tensor contraction functions on GPUs, based on parameterized micro-benchmarks. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations without auto-tuning.
自动调优已经成为创建高度优化代码的重要实用方法。然而,体系结构和应用程序日益复杂,导致搜索空间过大,无法进行经验自动调优。这里,我们将重点关注应用程序所面临的自动调优挑战,这些应用程序不仅需要对少量不同的内核进行自动调优,而且需要对大量具有相似计算和内存访问特征的内核进行自动调优,并且需要对类似的问题空间进行优化。我们提出了一种基于参数化微基准的gpu张量收缩函数的自动调优方法。使用我们的参数化微基准测试方法,与使用默认优化而不进行自动调优的版本相比,我们获得了高达2倍的加速。
{"title":"Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications","authors":"Wenjing Ma, S. Krishnamoorthy, G. Agrawal","doi":"10.1145/2212908.2212938","DOIUrl":"https://doi.org/10.1145/2212908.2212938","url":null,"abstract":"Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large number of kernels that exhibit similar computation and memory access characteristics and require optimization over similar problem spaces. We propose an auto-tuning method for tensor contraction functions on GPUs, based on parameterized micro-benchmarks. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations without auto-tuning.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133580702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries 在Python中编译动态数据结构以启用多核和多核库
Bin Ren, G. Agrawal
Programmer productivity considerations are increasing the popularity of interpreted languages like Python. At the same time, for applications where performance is important, these languages clearly lack even on uniprocessors. In addition, the use of dynamic data structures in a language like Python makes it very hard to use emerging libraries for enabling the execution on multi-core and many-core architectures. This paper presents a framework for compiling Python to use multi-core and many-core libraries. The key component of our framework involves a suite of algorithms for replacing dynamic and/or nested data structures by arrays, while minimizing unnecessary data copying costs. This involves a novel use of an existing partial redundancy elimination algorithm, development of a new demand-driven interprocedural partial redundancy algorithm, a data flow formulation for determining that the contents of the data structure are of the same type, and a linearization algorithm. We have evaluated our framework using data mining and two linear algebra applications written in pure Python. The key observations were: 1) the code generated by our framework is only 10% to 20% slower compared to the hand-written C code that invokes the same libraries, 2) our optimizations turn out to be significant for improving the performance in most cases, and 3) we outperform interpreted Python and the C++ code generated by an existing tool by one to two orders of magnitude.
考虑到程序员的生产力,像Python这样的解释性语言越来越受欢迎。同时,对于性能很重要的应用程序,这些语言甚至在单处理器上也明显缺乏。此外,在像Python这样的语言中使用动态数据结构使得很难使用新兴的库来支持在多核和多核架构上执行。本文介绍了一个编译Python以使用多核和多核库的框架。我们框架的关键组件包括一套算法,用于通过数组替换动态和/或嵌套数据结构,同时最大限度地减少不必要的数据复制成本。这包括对现有部分冗余消除算法的新使用,开发新的需求驱动的程序间部分冗余算法,用于确定数据结构的内容具有相同类型的数据流公式,以及线性化算法。我们使用数据挖掘和两个用纯Python编写的线性代数应用程序来评估我们的框架。关键的观察结果是:1)与调用相同库的手工编写的C代码相比,我们的框架生成的代码只慢10%到20%;2)我们的优化在大多数情况下对提高性能非常重要;3)我们的性能比由现有工具生成的解释型Python和c++代码高出一到两个数量级。
{"title":"Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries","authors":"Bin Ren, G. Agrawal","doi":"10.1109/PACT.2011.13","DOIUrl":"https://doi.org/10.1109/PACT.2011.13","url":null,"abstract":"Programmer productivity considerations are increasing the popularity of interpreted languages like Python. At the same time, for applications where performance is important, these languages clearly lack even on uniprocessors. In addition, the use of dynamic data structures in a language like Python makes it very hard to use emerging libraries for enabling the execution on multi-core and many-core architectures. This paper presents a framework for compiling Python to use multi-core and many-core libraries. The key component of our framework involves a suite of algorithms for replacing dynamic and/or nested data structures by arrays, while minimizing unnecessary data copying costs. This involves a novel use of an existing partial redundancy elimination algorithm, development of a new demand-driven interprocedural partial redundancy algorithm, a data flow formulation for determining that the contents of the data structure are of the same type, and a linearization algorithm. We have evaluated our framework using data mining and two linear algebra applications written in pure Python. The key observations were: 1) the code generated by our framework is only 10% to 20% slower compared to the hand-written C code that invokes the same libraries, 2) our optimizations turn out to be significant for improving the performance in most cases, and 3) we outperform interpreted Python and the C++ code generated by an existing tool by one to two orders of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115518174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Alternative Memory Access Scheduling in Manycore Accelerators 多核加速器中可选的内存访问调度
Yonggon Kim, Hyunseok Lee, John Kim
Memory controllers in graphics processing units (GPU) often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling compared with in-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose an alternative memory scheduling where the memory scheduling is performed not at the destination (i.e., memory controller) but is done at the source (i.e., the cores). We propose two complementary techniques in source-based memory scheduling -- network congestion-aware source throttling and super packets, where multiple request packets are grouped together to create a single super packet. By combing these techniques, the performance across a wide range of application is within 95% of the complex FR-FCFS on average and at significantly lower cost and complexity.
图形处理单元(GPU)中的内存控制器通常采用乱序调度来最大化行访问局部性。然而,与有序调度相比,这需要复杂的逻辑来支持乱序调度。为了提供低成本和低复杂性的内存调度,我们提出了一种替代内存调度,其中内存调度不是在目标(即内存控制器)执行,而是在源(即内核)完成。我们在基于源的内存调度中提出了两种互补的技术——网络拥塞感知源节流和超级数据包,其中多个请求数据包被分组在一起以创建单个超级数据包。通过结合这些技术,在广泛应用范围内的性能平均在复杂FR-FCFS的95%以内,并且成本和复杂性显著降低。
{"title":"An Alternative Memory Access Scheduling in Manycore Accelerators","authors":"Yonggon Kim, Hyunseok Lee, John Kim","doi":"10.1109/PACT.2011.37","DOIUrl":"https://doi.org/10.1109/PACT.2011.37","url":null,"abstract":"Memory controllers in graphics processing units (GPU) often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling compared with in-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose an alternative memory scheduling where the memory scheduling is performed not at the destination (i.e., memory controller) but is done at the source (i.e., the cores). We propose two complementary techniques in source-based memory scheduling -- network congestion-aware source throttling and super packets, where multiple request packets are grouped together to create a single super packet. By combing these techniques, the performance across a wide range of application is within 95% of the complex FR-FCFS on average and at significantly lower cost and complexity.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114477105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments 在多内存控制器环境中调节局部性与并行性的权衡
S. M. Hassan, Dhruv Choudhary, M. Rasquinha, S. Yalamanchili
The presence of multiple MCs and their integration into the on-chip network fabric creates a highly concurrent system that can support significant levels of memory level parallelism (MLP) across cores. This work exposes the trade-off between DRAM parameters, bank level parallelism (BLP), and row buffer hit rate that exposes the amount of effective BLP that is necessary to approximate a 100% hit rate. We further study how this trade-off can be controlled and propose a class of global (system) and local (within an MC) address mappings that can be tuned to optimize the performance across a set of multiprogrammed benchmarks.
多个mc的存在及其集成到片上网络结构中,创建了一个高度并发的系统,可以跨核心支持显著级别的内存级并行性(MLP)。这项工作揭示了DRAM参数、银行级并行性(BLP)和行缓冲区命中率之间的权衡,显示了接近100%命中率所需的有效BLP的数量。我们进一步研究了如何控制这种权衡,并提出了一类全局(系统)和本地(在MC内)地址映射,可以对其进行调优,以优化跨一组多程序基准测试的性能。
{"title":"Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments","authors":"S. M. Hassan, Dhruv Choudhary, M. Rasquinha, S. Yalamanchili","doi":"10.1109/PACT.2011.33","DOIUrl":"https://doi.org/10.1109/PACT.2011.33","url":null,"abstract":"The presence of multiple MCs and their integration into the on-chip network fabric creates a highly concurrent system that can support significant levels of memory level parallelism (MLP) across cores. This work exposes the trade-off between DRAM parameters, bank level parallelism (BLP), and row buffer hit rate that exposes the amount of effective BLP that is necessary to approximate a 100% hit rate. We further study how this trade-off can be controlled and propose a class of global (system) and local (within an MC) address mappings that can be tuned to optimize the performance across a set of multiprogrammed benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"14 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120904947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Exploiting Mutual Awareness between Prefetchers and On-chip Networks in Multi-cores 利用多核预取器和片上网络之间的相互感知
Junghoon Lee, Minjeong Shin, H. Kim, John Kim, Jaehyuk Huh
The unique characteristics of prefetch traffic have not been considered in on-chip network design for multicore architectures. Most prefetchers are often oblivious to the network congestion when generating prefetech requests. In this work, we investigate the interaction between prefetchers and on-chip networks and exploit the synergy of these two components in multi-core architectures. We explore prefetchaware on-chip networks that differentiates between prefetch and demand traffic by prioritizing demand traffic. In addition, we propose prefetch control mechanism based on network congestion. Our evaluations show that the combination of the proposed prefetch-aware router architecture and congestion sensitive prefetch control improves the performance of benchmarks by 11-13% on average, up to 30% on some of the workloads.
在多核架构的片上网络设计中,没有考虑到预取流量的独特特性。大多数预取器在生成预取请求时通常不会注意到网络拥塞。在这项工作中,我们研究了预取器和片上网络之间的相互作用,并利用了这两个组件在多核架构中的协同作用。我们探索预取感知芯片网络,通过优先考虑需求流量来区分预取和需求流量。此外,我们还提出了基于网络拥塞的预取控制机制。我们的评估表明,所提出的预取感知路由器架构和拥塞敏感预取控制相结合,平均可将基准测试的性能提高11-13%,在某些工作负载上可提高30%。
{"title":"Exploiting Mutual Awareness between Prefetchers and On-chip Networks in Multi-cores","authors":"Junghoon Lee, Minjeong Shin, H. Kim, John Kim, Jaehyuk Huh","doi":"10.1109/PACT.2011.27","DOIUrl":"https://doi.org/10.1109/PACT.2011.27","url":null,"abstract":"The unique characteristics of prefetch traffic have not been considered in on-chip network design for multicore architectures. Most prefetchers are often oblivious to the network congestion when generating prefetech requests. In this work, we investigate the interaction between prefetchers and on-chip networks and exploit the synergy of these two components in multi-core architectures. We explore prefetchaware on-chip networks that differentiates between prefetch and demand traffic by prioritizing demand traffic. In addition, we propose prefetch control mechanism based on network congestion. Our evaluations show that the combination of the proposed prefetch-aware router architecture and congestion sensitive prefetch control improves the performance of benchmarks by 11-13% on average, up to 30% on some of the workloads.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2011 International Conference on Parallel Architectures and Compilation Techniques
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1