首页 > 最新文献

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文 中文
Analyzing cache performance bottlenecks of STM applications and addressing them with compiler's help 分析STM应用程序的缓存性能瓶颈,并在编译器的帮助下解决这些瓶颈
Sandya Mannarswamy, R. Govindarajan
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.
软件事务性内存(STM)是一种很有前途的编程范例,可用于共享内存多线程程序,作为传统的基于锁的同步的替代方案。然而,主流软件对STM的采用率相当低,因为它的开销相当大,缓存/内存性能也很差。在本文中,我们对STM应用程序的缓存行为进行了详细的研究,并量化了不同的STM因素对应用程序经历的缓存丢失的影响。根据我们的分析,我们提出了一种编译器驱动的锁数据托管(LDC),旨在减少STM上的缓存开销。我们证明了LDC通过减少dcache缺失延迟和提高执行时间性能,有效地改善了STM应用程序的缓存行为。
{"title":"Analyzing cache performance bottlenecks of STM applications and addressing them with compiler's help","authors":"Sandya Mannarswamy, R. Govindarajan","doi":"10.1145/1854273.1854345","DOIUrl":"https://doi.org/10.1145/1854273.1854345","url":null,"abstract":"Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133386715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable thread scheduling and global power management for heterogeneous many-core architectures 异构多核架构的可伸缩线程调度和全局电源管理
Jonathan A. Winter, D. Albonesi, C. Shoemaker
Future many-core microprocessors are likely to be heterogeneous, by design or due to variability and defects. The latter type of heterogeneity is especially challenging due to its unpredictability. To minimize the performance and power impact of these hardware imperfections, the runtime thread scheduler and global power manager must be nimble enough to handle such random heterogeneity. With hundreds of cores expected on a single die in the future, these algorithms must provide high power-performance efficiency, yet remain scalable with low runtime overhead. This paper presents a range of scheduling and power management algorithms and performs a detailed evaluation of their effectiveness and scalability on heterogeneous many-core architectures with up to 256 cores. We also conduct a limit study on the potential benefits of coordinating scheduling and power management and demonstrate that coordination yields little benefit. We highlight the scalability limitations of previously proposed thread scheduling algorithms that were designed for small-scale chip multiprocessors and propose a Hierarchical Hungarian Scheduling Algorithm that dramatically reduces the scheduling overhead without loss of accuracy. Finally, we show that the high computational requirements of prior global power management algorithms based on linear programming make them infeasible for many-core chips, and that an algorithm that we call Steepest Drop achieves orders of magnitude lower execution time without sacrificing power-performance efficiency.
未来的多核微处理器可能是异构的,这是由于设计或可变性和缺陷造成的。由于其不可预测性,后一种类型的异质性尤其具有挑战性。为了尽量减少这些硬件缺陷对性能和功耗的影响,运行时线程调度器和全局电源管理器必须足够灵活,能够处理这种随机的异构性。由于未来单个芯片上预计会有数百个内核,这些算法必须提供高功率性能效率,同时保持低运行时开销的可扩展性。本文介绍了一系列调度和电源管理算法,并对其在多达256核的异构多核架构上的有效性和可扩展性进行了详细评估。我们还对协调调度和电源管理的潜在好处进行了限制研究,并证明协调产生的好处很少。我们强调了先前提出的针对小型芯片多处理器的线程调度算法的可扩展性限制,并提出了一种分层匈牙利调度算法,该算法在不损失精度的情况下显着降低了调度开销。最后,我们证明了基于线性规划的先前全局电源管理算法的高计算需求使得它们对多核芯片不可行,并且我们称之为最陡下降的算法在不牺牲功率性能效率的情况下实现了数量级的低执行时间。
{"title":"Scalable thread scheduling and global power management for heterogeneous many-core architectures","authors":"Jonathan A. Winter, D. Albonesi, C. Shoemaker","doi":"10.1145/1854273.1854283","DOIUrl":"https://doi.org/10.1145/1854273.1854283","url":null,"abstract":"Future many-core microprocessors are likely to be heterogeneous, by design or due to variability and defects. The latter type of heterogeneity is especially challenging due to its unpredictability. To minimize the performance and power impact of these hardware imperfections, the runtime thread scheduler and global power manager must be nimble enough to handle such random heterogeneity. With hundreds of cores expected on a single die in the future, these algorithms must provide high power-performance efficiency, yet remain scalable with low runtime overhead. This paper presents a range of scheduling and power management algorithms and performs a detailed evaluation of their effectiveness and scalability on heterogeneous many-core architectures with up to 256 cores. We also conduct a limit study on the potential benefits of coordinating scheduling and power management and demonstrate that coordination yields little benefit. We highlight the scalability limitations of previously proposed thread scheduling algorithms that were designed for small-scale chip multiprocessors and propose a Hierarchical Hungarian Scheduling Algorithm that dramatically reduces the scheduling overhead without loss of accuracy. Finally, we show that the high computational requirements of prior global power management algorithms based on linear programming make them infeasible for many-core chips, and that an algorithm that we call Steepest Drop achieves orders of magnitude lower execution time without sacrificing power-performance efficiency.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"278 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115600870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 159
Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information 基于剖析信息的分层管道并行性半自动提取与开发
Georgios Tournavitis, Björn Franke
In recent years multi-core computer systems have left the realm of high-performance computing and virtually all of today's desktop computers and embedded computing systems are equipped with several processing cores. Still, no single parallel programming model has found widespread support and parallel programming remains an art for the majority of application programmers. In addition, there exists a plethora of sequential legacy applications for which automatic parallelization is the only hope to benefit from the increased processing power of modern multi-core systems. In the past automatic parallelization largely focused on data parallelism. In this paper we present a novel approach to extracting and exploiting pipeline parallelism from sequential applications. We use profiling to overcome the limitations of static data and control flow analysis enabling more aggressive parallelization. Our approach is orthogonal to existing automatic parallelization approaches and additional data parallelism may be exploited in the individual pipeline stages. The key contribution of this paper is a whole-program representation that supports profiling, parallelism extraction and exploitation. We demonstrate how this enhances conventional pipeline parallelization by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. We have evaluated our methodology on a set of multimedia and stream processing benchmarks and demonstrate speedups of up to 4.7 on a eight-core Intel Xeon machine.
近年来,多核计算机系统已经离开了高性能计算的领域,今天几乎所有的台式计算机和嵌入式计算系统都配备了几个处理核心。但是,没有任何一种并行编程模型得到广泛的支持,并行编程对于大多数应用程序程序员来说仍然是一门艺术。此外,存在大量的顺序遗留应用程序,对于这些应用程序,自动并行化是从现代多核系统不断增强的处理能力中获益的唯一希望。过去,自动并行化主要关注数据并行性。本文提出了一种从顺序应用程序中提取和利用管道并行性的新方法。我们使用剖析来克服静态数据和控制流分析的限制,从而实现更积极的并行化。我们的方法与现有的自动并行化方法是正交的,并且可以在各个管道阶段利用额外的数据并行性。本文的主要贡献是支持分析、并行提取和开发的整个程序表示。我们演示了这如何增强传统的管道并行化,包括以统一和自动的方式支持多级循环和管道阶段复制。我们在一组多媒体和流处理基准测试中评估了我们的方法,并在八核Intel Xeon机器上演示了高达4.7的加速。
{"title":"Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information","authors":"Georgios Tournavitis, Björn Franke","doi":"10.1145/1854273.1854321","DOIUrl":"https://doi.org/10.1145/1854273.1854321","url":null,"abstract":"In recent years multi-core computer systems have left the realm of high-performance computing and virtually all of today's desktop computers and embedded computing systems are equipped with several processing cores. Still, no single parallel programming model has found widespread support and parallel programming remains an art for the majority of application programmers. In addition, there exists a plethora of sequential legacy applications for which automatic parallelization is the only hope to benefit from the increased processing power of modern multi-core systems. In the past automatic parallelization largely focused on data parallelism. In this paper we present a novel approach to extracting and exploiting pipeline parallelism from sequential applications. We use profiling to overcome the limitations of static data and control flow analysis enabling more aggressive parallelization. Our approach is orthogonal to existing automatic parallelization approaches and additional data parallelism may be exploited in the individual pipeline stages. The key contribution of this paper is a whole-program representation that supports profiling, parallelism extraction and exploitation. We demonstrate how this enhances conventional pipeline parallelization by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. We have evaluated our methodology on a set of multimedia and stream processing benchmarks and demonstrate speedups of up to 4.7 on a eight-core Intel Xeon machine.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126765487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors 芯片多处理器的动态管理多线程可重构体系结构
Matthew A. Watkins, D. Albonesi
Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, recon-figurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy × delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.
先前的工作已经证明,可重构逻辑可以显著地有利于某些应用。然而,可重构架构在传统上受到高面积开销和有限的应用程序覆盖的困扰。我们提出了一个动态管理的多线程可重构架构,该架构由多个共享可重构结构集群组成,大大减少了可重构性的面积开销,同时仍然提供相同的功率效率和性能优势。与其他共享SMT和CMP资源一样,共享线程之间可重构资源的动态分区以及不同可重构集群之间线程的协同调度必须得到智能管理,才能实现共享结构的全部优势。我们提出了一些复杂的动态管理方法,包括应用机器学习、多线程阶段管理和稳定性检测。总的来说,我们表明,使用我们的动态管理策略,多线程可重构结构可以实现更好的energyxdelay2,比为每个核心提供更大的私有结构要少得多的面积和功耗。此外,与通过将fabric区域分配给其他核心所能达到的理想效果相比,我们的方法在特定工作负载上实现了更高的性能和能效。
{"title":"Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors","authors":"Matthew A. Watkins, D. Albonesi","doi":"10.1145/1854273.1854284","DOIUrl":"https://doi.org/10.1145/1854273.1854284","url":null,"abstract":"Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, recon-figurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy × delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129364383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Design and implementation of the PLUG architecture for programmable and efficient network lookups 为可编程和高效的网络查找设计和实现PLUG架构
Amit Kumar, Lorenzo De Carli, Sung Jin Kim, M. Kruijf, K. Sankaralingam, Cristian Estan, S. Jha
This paper proposes a new architecture called Pipelined LookUp Grid (PLUG) that can perform data structure lookups in network processing. PLUGs are programmable and through simplicity achieve power efficiency. We draw upon the insights that data structure lookups have natural structure that can be statically determined and exploited. The PLUG execution model transforms data-structure lookups into pipelined stages of computation and associates small code-blocks with data. The PLUG architecture is a tiled architecture with each tile consisting predominantly of SRAMs, a lightweight no-buffering router, and an array of lightweight computation cores. Using a principle of fixed delays in the execution model, the architecture is contention-free and completely statically scheduled thus achieving high energy efficiency. The architecture enables rapid deployment of new network protocols and generalizes as a data-structure accelerator. This paper describes the PLUG architecture, the compiler, and evaluates our RTL prototype PLUG chip synthesized on a 55nm technology library. We evaluate six diverse high-end network processing workloads including IPv4, IPv6, and Ethernet forwarding. We show that at a 55nm technology, a 16-tile PLUG occupies 58mm2, provides 4MB on-chip storage, and sustains a clock frequency of 1 GHz. This translates to 1 billion lookups per second, a latency of 18ns to 219ns, and average power less than 1 watt.
本文提出了一种新的结构,称为管道查找网格(pipeline LookUp Grid, PLUG),它可以在网络处理中执行数据结构查找。插头是可编程的,通过简单实现功率效率。我们认为数据结构查找具有可以静态确定和利用的自然结构。PLUG执行模型将数据结构查找转换为计算的流水线阶段,并将小代码块与数据关联起来。PLUG架构是一种分层架构,每个分层主要由sram、一个轻量级无缓冲路由器和一组轻量级计算核心组成。在执行模型中使用固定延迟原则,该体系结构是无争用的,并且完全是静态调度的,从而实现了高能效。该体系结构能够快速部署新的网络协议,并作为数据结构加速器进行推广。本文介绍了PLUG的架构、编译器,并对我们在55nm工艺库上合成的RTL原型PLUG芯片进行了评估。我们评估了六种不同的高端网络处理工作负载,包括IPv4、IPv6和以太网转发。我们表明,在55nm技术下,16块PLUG占用58mm2,提供4MB片上存储,并保持1ghz的时钟频率。这意味着每秒查找10亿次,延迟为18ns到219ns,平均功耗低于1瓦。
{"title":"Design and implementation of the PLUG architecture for programmable and efficient network lookups","authors":"Amit Kumar, Lorenzo De Carli, Sung Jin Kim, M. Kruijf, K. Sankaralingam, Cristian Estan, S. Jha","doi":"10.1145/1854273.1854316","DOIUrl":"https://doi.org/10.1145/1854273.1854316","url":null,"abstract":"This paper proposes a new architecture called Pipelined LookUp Grid (PLUG) that can perform data structure lookups in network processing. PLUGs are programmable and through simplicity achieve power efficiency. We draw upon the insights that data structure lookups have natural structure that can be statically determined and exploited. The PLUG execution model transforms data-structure lookups into pipelined stages of computation and associates small code-blocks with data. The PLUG architecture is a tiled architecture with each tile consisting predominantly of SRAMs, a lightweight no-buffering router, and an array of lightweight computation cores. Using a principle of fixed delays in the execution model, the architecture is contention-free and completely statically scheduled thus achieving high energy efficiency. The architecture enables rapid deployment of new network protocols and generalizes as a data-structure accelerator. This paper describes the PLUG architecture, the compiler, and evaluates our RTL prototype PLUG chip synthesized on a 55nm technology library. We evaluate six diverse high-end network processing workloads including IPv4, IPv6, and Ethernet forwarding. We show that at a 55nm technology, a 16-tile PLUG occupies 58mm2, provides 4MB on-chip storage, and sustains a clock frequency of 1 GHz. This translates to 1 billion lookups per second, a latency of 18ns to 219ns, and average power less than 1 watt.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129133678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Power and thermal characterization of POWER6 system POWER6系统的功率和热特性
Víctor Jiménez, F. Cazorla, R. Gioiosa, M. Valero, C. Boneti, E. Kursun, Chen-Yong Cher, C. Isci, A. Buyuktosunoglu, P. Bose
Controlling power consumption and temperature is of major concern for modern computing systems. In this work we characterize thermal behavior and power consumption of an IBM POWER6™-based system. We perform the characterization at several levels: application, operating system, and hardware level, both when the system is idle, and under load. At hardware level, we report a 25% reduction in total system power consumption by using the processor low power mode. We also study the effect of the hardware thread prioritization mechanism provided by POWER6 on different workloads and how this mechanism can be used to limit power consumption. At OS level, we analyze the power reduction techniques implemented in the Linux kernel, such as the tickless kernel and the CPU idle power manager. At application level, we characterize the power consumption and the temperature of two sets of benchmarks (METbench and SPEC CPU2006) and we study the effect of workload characteristics on power consumption and core temperature. From this characterization we derive a model based on performance counters that allows us to predict the total power consumption of the POWER6 system with an average error under 3% for CMP and 5% for SMT. To the best of our knowledge, this is the first power model of a system including CMP+SMT processors. Finally, we show that the static decision on whether to consolidate tasks into the same core/chip, as it is currently done in Linux, can be improved by dynamically considering the low-power capabilities of the underlying architecture and the characteristics of the application (up to 5X improvement in ED2P).
控制功耗和温度是现代计算系统的主要关注点。在这项工作中,我们描述了基于IBM POWER6™的系统的热行为和功耗。我们在几个级别上执行特性描述:应用程序、操作系统和硬件级别,在系统空闲和负载下都是如此。在硬件级别,我们报告说,通过使用处理器低功耗模式,系统总功耗降低了25%。我们还研究了POWER6提供的硬件线程优先级机制对不同工作负载的影响,以及如何使用该机制来限制功耗。在操作系统级别,我们分析了在Linux内核中实现的功耗降低技术,例如无tick - less内核和CPU空闲电源管理器。在应用程序级别,我们描述了两组基准测试(METbench和SPEC CPU2006)的功耗和温度,并研究了工作负载特性对功耗和核心温度的影响。从这个特性中,我们得出了一个基于性能计数器的模型,该模型允许我们预测POWER6系统的总功耗,CMP的平均误差低于3%,SMT的平均误差低于5%。据我们所知,这是包含CMP+SMT处理器的系统的第一个电源模型。最后,我们展示了是否将任务合并到相同的内核/芯片中的静态决策,就像目前在Linux中所做的那样,可以通过动态考虑底层架构的低功耗功能和应用程序的特征来改进(在ED2P中提高了5倍)。
{"title":"Power and thermal characterization of POWER6 system","authors":"Víctor Jiménez, F. Cazorla, R. Gioiosa, M. Valero, C. Boneti, E. Kursun, Chen-Yong Cher, C. Isci, A. Buyuktosunoglu, P. Bose","doi":"10.1145/1854273.1854281","DOIUrl":"https://doi.org/10.1145/1854273.1854281","url":null,"abstract":"Controlling power consumption and temperature is of major concern for modern computing systems. In this work we characterize thermal behavior and power consumption of an IBM POWER6™-based system. We perform the characterization at several levels: application, operating system, and hardware level, both when the system is idle, and under load. At hardware level, we report a 25% reduction in total system power consumption by using the processor low power mode. We also study the effect of the hardware thread prioritization mechanism provided by POWER6 on different workloads and how this mechanism can be used to limit power consumption. At OS level, we analyze the power reduction techniques implemented in the Linux kernel, such as the tickless kernel and the CPU idle power manager. At application level, we characterize the power consumption and the temperature of two sets of benchmarks (METbench and SPEC CPU2006) and we study the effect of workload characteristics on power consumption and core temperature. From this characterization we derive a model based on performance counters that allows us to predict the total power consumption of the POWER6 system with an average error under 3% for CMP and 5% for SMT. To the best of our knowledge, this is the first power model of a system including CMP+SMT processors. Finally, we show that the static decision on whether to consolidate tasks into the same core/chip, as it is currently done in Linux, can be improved by dynamically considering the low-power capabilities of the underlying architecture and the characteristics of the application (up to 5X improvement in ED2P).","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126446046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
The potential of using dynamic information flow analysis in data value prediction 动态信息流分析在数据价值预测中的应用潜力
Walid J. Ghandour, Haitham Akkary, Wes Masri
Value prediction is a technique to increase parallelism by attempting to overcome serialization constraints caused by true data dependences. By predicting the outcome of an instruction before it executes, value prediction allows data dependent instructions to issue and execute speculatively, hence increasing parallelism when the prediction is correct. In case of a misprediction, the execution is redone with the corrected value. If the benefit from increased parallelism outweighs the misprediction recovery penalty, overall performance could be improved. Enhancing performance with value prediction therefore requires highly accurate prediction methods. Most existing general value prediction techniques are local and future outputs of an instruction are predicted based on outputs from previous executions of the same instruction.
值预测是一种通过尝试克服由真正的数据依赖性引起的序列化约束来增加并行性的技术。通过在指令执行之前预测其结果,值预测允许数据相关指令的发布和推测性执行,从而在预测正确时增加并行性。如果预测错误,则使用正确的值重新执行。如果增加的并行性带来的好处超过了错误预测的恢复损失,那么整体性能就可以得到改善。因此,通过值预测来提高性能需要高度精确的预测方法。大多数现有的通用值预测技术都是局部的,并且指令的未来输出是基于先前执行同一指令的输出来预测的。
{"title":"The potential of using dynamic information flow analysis in data value prediction","authors":"Walid J. Ghandour, Haitham Akkary, Wes Masri","doi":"10.1145/1854273.1854327","DOIUrl":"https://doi.org/10.1145/1854273.1854327","url":null,"abstract":"Value prediction is a technique to increase parallelism by attempting to overcome serialization constraints caused by true data dependences. By predicting the outcome of an instruction before it executes, value prediction allows data dependent instructions to issue and execute speculatively, hence increasing parallelism when the prediction is correct. In case of a misprediction, the execution is redone with the corrected value. If the benefit from increased parallelism outweighs the misprediction recovery penalty, overall performance could be improved. Enhancing performance with value prediction therefore requires highly accurate prediction methods. Most existing general value prediction techniques are local and future outputs of an instruction are predicted based on outputs from previous executions of the same instruction.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132233020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Raising the level of many-core programming with compiler technology - meeting a grand challenge 利用编译器技术提高多核编程水平——迎接重大挑战
Wen-mei W. Hwu
Modern GPUs and CPUs are massively parallel, many-core processors. While application developers for these many-core chips are reporting 10X-100X speedup over sequential code on traditional microprocessors, the current practice of many-core programming based on OpenCL, CUDA, and OpenMP puts strain on software development, testing and support teams. According to the semiconductor industry roadmap, these processors could scale up to over 1,000X speedup over single cores by the end of the year 2016. Such a dramatic performance difference between parallel and sequential execution will motivate an increasing number of developers to parallelize their applications. Today, an application programmer has to understand the desirable parallel programming idioms, manually work around potential hardware performance pitfalls, and restructure their application design in order to achieve their performance objectives on many-core processors. In this presentation, I will discuss why advanced compiler functionalities have not found traction with the developer communities, what the industry is doing today to try to address the challenges, and how the academic community can contribute to this exciting revolution.
现代gpu和cpu是大规模并行的多核处理器。虽然这些多核芯片的应用程序开发人员报告说,与传统微处理器上的顺序代码相比,这些芯片的速度提高了10 -100倍,但目前基于OpenCL、CUDA和OpenMP的多核编程实践给软件开发、测试和支持团队带来了压力。根据半导体行业路线图,到2016年底,这些处理器的速度将比单核提高1000倍以上。并行执行和顺序执行之间如此巨大的性能差异将促使越来越多的开发人员将其应用程序并行化。如今,应用程序程序员必须理解理想的并行编程习惯,手动解决潜在的硬件性能缺陷,并重新构建应用程序设计,以便在多核处理器上实现性能目标。在这次演讲中,我将讨论为什么高级编译器功能没有得到开发人员社区的关注,目前业界正在做些什么来尝试解决这些挑战,以及学术界如何为这场激动人心的革命做出贡献。
{"title":"Raising the level of many-core programming with compiler technology - meeting a grand challenge","authors":"Wen-mei W. Hwu","doi":"10.1145/1854273.1854279","DOIUrl":"https://doi.org/10.1145/1854273.1854279","url":null,"abstract":"Modern GPUs and CPUs are massively parallel, many-core processors. While application developers for these many-core chips are reporting 10X-100X speedup over sequential code on traditional microprocessors, the current practice of many-core programming based on OpenCL, CUDA, and OpenMP puts strain on software development, testing and support teams. According to the semiconductor industry roadmap, these processors could scale up to over 1,000X speedup over single cores by the end of the year 2016. Such a dramatic performance difference between parallel and sequential execution will motivate an increasing number of developers to parallelize their applications. Today, an application programmer has to understand the desirable parallel programming idioms, manually work around potential hardware performance pitfalls, and restructure their application design in order to achieve their performance objectives on many-core processors. In this presentation, I will discuss why advanced compiler functionalities have not found traction with the developer communities, what the industry is doing today to try to address the challenges, and how the academic community can contribute to this exciting revolution.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1996 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130958953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simple and fast biased locks 简单和快速偏锁
N. Vasudevan, Kedar S. Namjoshi, S. Edwards
Locks are used to ensure exclusive access to shared memory locations. Unfortunately, lock operations are expensive, so much work has been done on optimizing their performance for common access patterns. One such pattern is found in networking applications, where there is a single thread dominating lock accesses. An important special case arises when a single-threaded program calls a thread-safe library that uses locks. An effective way to optimize the dominant-thread pattern is to “bias” the lock implementation so that accesses by the dominant thread have negligible overhead. We take this approach in this work: we simplify and generalize existing techniques for biased locks, producing a large design space with many trade-offs. For example, if we assume the dominant process acquires the lock infinitely often (a reasonable assumption for packet processing), it is possible to make the dominant process perform a lock operation without expensive fence or compare-and-swap instructions. This gives a very low overhead solution; we confirm its efficacy by experiments. We show how these constructions can be extended for lock reservation, re-reservation, and to reader-writer situations.
锁用于确保对共享内存位置的独占访问。不幸的是,锁操作的成本很高,为了针对常见的访问模式优化它们的性能,已经做了很多工作。在网络应用程序中可以找到一种这样的模式,其中有一个线程控制锁访问。当单线程程序调用使用锁的线程安全库时,会出现一个重要的特殊情况。优化主线程模式的一种有效方法是“偏置”锁实现,以便主线程的访问开销可以忽略不计。我们在这项工作中采用了这种方法:我们简化和推广了现有的偏置锁技术,产生了一个具有许多权衡的大设计空间。例如,如果我们假设主导进程无限频繁地获得锁(对于包处理来说是一个合理的假设),那么就有可能让主导进程执行锁操作,而不需要昂贵的fence或比较-交换指令。这提供了一个非常低开销的解决方案;通过实验证实了其有效性。我们将展示如何将这些结构扩展到锁保留、重新保留和读写情况。
{"title":"Simple and fast biased locks","authors":"N. Vasudevan, Kedar S. Namjoshi, S. Edwards","doi":"10.1145/1854273.1854287","DOIUrl":"https://doi.org/10.1145/1854273.1854287","url":null,"abstract":"Locks are used to ensure exclusive access to shared memory locations. Unfortunately, lock operations are expensive, so much work has been done on optimizing their performance for common access patterns. One such pattern is found in networking applications, where there is a single thread dominating lock accesses. An important special case arises when a single-threaded program calls a thread-safe library that uses locks. An effective way to optimize the dominant-thread pattern is to “bias” the lock implementation so that accesses by the dominant thread have negligible overhead. We take this approach in this work: we simplify and generalize existing techniques for biased locks, producing a large design space with many trade-offs. For example, if we assume the dominant process acquires the lock infinitely often (a reasonable assumption for packet processing), it is possible to make the dominant process perform a lock operation without expensive fence or compare-and-swap instructions. This gives a very low overhead solution; we confirm its efficacy by experiments. We show how these constructions can be extended for lock reservation, re-reservation, and to reader-writer situations.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130756410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
MapCG: Writing parallel program portable between CPU and GPU MapCG:编写可在CPU和GPU之间移植的并行程序
Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, Haibo Lin
Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to write a specific version of code for each potential target architecture. It results in high development and maintenance cost.
近年来,图形处理单元(GPU)在通用计算市场中扮演着重要的角色。目前常见的GPU编程方法是使用低级GPU api(如CUDA)编写GPU特定代码。尽管这种方法可以实现非常好的性能,但它会引起严重的可移植性问题:程序员需要为每个潜在的目标体系结构编写特定版本的代码。这导致了高昂的开发和维护成本。
{"title":"MapCG: Writing parallel program portable between CPU and GPU","authors":"Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, Haibo Lin","doi":"10.1145/1854273.1854303","DOIUrl":"https://doi.org/10.1145/1854273.1854303","url":null,"abstract":"Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to write a specific version of code for each potential target architecture. It results in high development and maintenance cost.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116946296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
期刊
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1