首页 > 最新文献

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文 中文
A case for NUMA-aware contention management on multicore systems 多核系统上numa感知争用管理的一个案例
S. Blagodurov, Sergey Zhuravlev, Mohammad Dashti, Alexandra Fedorova
On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as lastlevel caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that contention management is a lot more difficult on NUMA systems, because the scheduler must not only consider the placement of threads, but also the placement of their memory. This is mostly required to eliminate contention for memory controllers contrary to the popular belief that remote access latency is the dominant concern. In this work we quantify the effects on performance imposed by resource contention and remote access latency. This analysis inspires the design of a contention-aware scheduling algorithm for NUMA systems. This algorithm significantly outperforms a NUMA-unaware algorithm proposed before as well as the default Linux scheduler. We also investigate memory migration strategies, which are the necessary part of the NUMA contention-aware scheduling algorithm. Finally, we propose and evaluate a new contention management algorithm that is priority-aware.
在多核系统上,当内存密集型线程在共享部分内存层次结构(如最后一级缓存和内存控制器)的内核上共同调度时,就会发生共享资源争用。以前的工作研究了如何通过调度来解决争用问题。感知争用的调度器将竞争线程分离到单独的内存层次结构域中,以消除资源共享,从而减轻争用。然而,之前所有关于争用感知调度的工作都假定底层系统是UMA(统一内存访问延迟,单个内存控制器)。然而,现代多核系统是NUMA的,这意味着它们具有非统一的内存访问延迟和多个内存控制器。我们发现,在NUMA系统上,争用管理要困难得多,因为调度器不仅要考虑线程的位置,还要考虑它们的内存位置。这主要是为了消除对内存控制器的争用,这与认为远程访问延迟是主要问题的流行观点相反。在这项工作中,我们量化了资源争用和远程访问延迟对性能的影响。这一分析启发了NUMA系统竞争感知调度算法的设计。该算法明显优于之前提出的numa - aware算法以及默认的Linux调度器。我们还研究了内存迁移策略,这是NUMA竞争感知调度算法的必要组成部分。最后,我们提出并评估了一种新的优先级感知争用管理算法。
{"title":"A case for NUMA-aware contention management on multicore systems","authors":"S. Blagodurov, Sergey Zhuravlev, Mohammad Dashti, Alexandra Fedorova","doi":"10.1145/1854273.1854350","DOIUrl":"https://doi.org/10.1145/1854273.1854350","url":null,"abstract":"On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as lastlevel caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that contention management is a lot more difficult on NUMA systems, because the scheduler must not only consider the placement of threads, but also the placement of their memory. This is mostly required to eliminate contention for memory controllers contrary to the popular belief that remote access latency is the dominant concern. In this work we quantify the effects on performance imposed by resource contention and remote access latency. This analysis inspires the design of a contention-aware scheduling algorithm for NUMA systems. This algorithm significantly outperforms a NUMA-unaware algorithm proposed before as well as the default Linux scheduler. We also investigate memory migration strategies, which are the necessary part of the NUMA contention-aware scheduling algorithm. Finally, we propose and evaluate a new contention management algorithm that is priority-aware.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128178104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 291
Scaling of the PARSEC benchmark inputs 缩放PARSEC基准测试输入
Christian Bienia, Kai Li
A good benchmark suite should provide users with inputs that have multiple levels of fidelity. We present a framework that takes the novel view that benchmark inputs should be considered approximations of their original, full-sized inputs. The paper demonstrates how to use the proposed methodology to create several simulation input sets for the PARSEC benchmarks and how to quantify and measure their approximation error. We offer guidelines that PARSEC users can use to choose suitable simulation inputs for their scientific studies in a way that maximizes the accuracy of the simulation subject to a time constraint.
一个好的基准测试套件应该为用户提供具有多个保真度的输入。我们提出了一个框架,该框架采用了一种新颖的观点,即基准输入应被视为其原始全尺寸输入的近似值。本文演示了如何使用所提出的方法为PARSEC基准创建几个模拟输入集,以及如何量化和测量它们的近似误差。我们提供的指导方针,PARSEC用户可以使用选择合适的模拟输入为他们的科学研究,以最大限度地提高模拟的准确性受制于时间限制的方式。
{"title":"Scaling of the PARSEC benchmark inputs","authors":"Christian Bienia, Kai Li","doi":"10.1145/1854273.1854352","DOIUrl":"https://doi.org/10.1145/1854273.1854352","url":null,"abstract":"A good benchmark suite should provide users with inputs that have multiple levels of fidelity. We present a framework that takes the novel view that benchmark inputs should be considered approximations of their original, full-sized inputs. The paper demonstrates how to use the proposed methodology to create several simulation input sets for the PARSEC benchmarks and how to quantify and measure their approximation error. We offer guidelines that PARSEC users can use to choose suitable simulation inputs for their scientific studies in a way that maximizes the accuracy of the simulation subject to a time constraint.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133456964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
AM++: A generalized active message framework AM++:通用主动信息框架
Jeremiah Willcock, T. Hoefler, N. Edmonds, A. Lumsdaine
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their low-level transport layers. However, most active message frameworks have low-level programming interfaces that require significant programming effort to use directly in applications and that also prevent optimization opportunities. In this paper we present AM++, a new user-level library for active messages based on generic programming techniques. Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures. Runtime optimizations, such as message combining and filtering, are also provided by the library, removing the need to implement that functionality at the application level. Evaluation of AM++ with distributed-memory graph algorithms shows the usability benefits provided by these library features, as well as their performance advantages.
事实证明,主动报文是解决高性能计算中某些通信问题的有效方法。许多 MPI 实现以及分区全局地址空间语言的运行时都在其底层传输层中使用了主动报文。然而,大多数主动消息框架都有低级编程接口,直接在应用程序中使用需要大量编程工作,同时也阻碍了优化机会。在本文中,我们介绍了 AM++,一个基于通用编程技术的新用户级主动消息库。我们的库允许消息处理程序在显式循环中运行,编译器可对其进行优化和矢量化,也可在多核架构上并行执行。运行时优化(如消息合并和过滤)也由该库提供,从而无需在应用层实现该功能。利用分布式内存图算法对 AM++ 进行的评估表明,这些库功能不仅具有性能优势,还具有可用性优势。
{"title":"AM++: A generalized active message framework","authors":"Jeremiah Willcock, T. Hoefler, N. Edmonds, A. Lumsdaine","doi":"10.1145/1854273.1854323","DOIUrl":"https://doi.org/10.1145/1854273.1854323","url":null,"abstract":"Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their low-level transport layers. However, most active message frameworks have low-level programming interfaces that require significant programming effort to use directly in applications and that also prevent optimization opportunities. In this paper we present AM++, a new user-level library for active messages based on generic programming techniques. Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures. Runtime optimizations, such as message combining and filtering, are also provided by the library, removing the need to implement that functionality at the application level. Evaluation of AM++ with distributed-memory graph algorithms shows the usability benefits provided by these library features, as well as their performance advantages.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130521232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Partitioning streaming parallelism for multi-cores: A machine learning based approach 多核分流并行:一种基于机器学习的方法
Zheng Wang, M. O’Boyle
Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address this by developing a portable and automatic compiler-based approach to partitioning streaming programs using machine learning. Our technique predicts the ideal partition structure for a given streaming application using prior knowledge learned off-line. Using the predictor we rapidly search the program space (without executing any code) to generate and select a good partition. We applied this technique to standard StreamIt applications and compared against existing approaches. On a 4-core platform, our approach achieves 60% of the best performance found by iteratively compiling and executing over 3000 different partitions per program. We obtain, on average, a 1.90x speedup over the already tuned partitioning scheme of the StreamIt compiler. When compared against a state-of-the-art analytical, model-based approach, we achieve, on average, a 1.77x performance improvement. By porting our approach to a 8-core platform, we are able to obtain 1.8x improvement over the StreamIt default scheme, demonstrating the portability of our approach.
基于流的语言是现代应用程序中表达并行性的一种流行方法。然而,流并行性到多核处理器的有效映射高度依赖于程序和底层体系结构。我们通过开发一种可移植的、基于自动编译器的方法来解决这个问题,该方法使用机器学习来划分流程序。我们的技术使用离线学习的先验知识来预测给定流应用程序的理想分区结构。使用预测器,我们快速搜索程序空间(不执行任何代码)以生成并选择一个好的分区。我们将此技术应用于标准的StreamIt应用程序,并与现有方法进行比较。在4核平台上,通过迭代编译和执行每个程序超过3000个不同的分区,我们的方法实现了60%的最佳性能。在已经调优的StreamIt编译器分区方案上,我们平均获得了1.90倍的加速。与最先进的基于模型的分析方法相比,我们平均实现了1.77倍的性能改进。通过将我们的方法移植到8核平台,我们能够获得比StreamIt默认方案1.8倍的改进,证明了我们方法的可移植性。
{"title":"Partitioning streaming parallelism for multi-cores: A machine learning based approach","authors":"Zheng Wang, M. O’Boyle","doi":"10.1145/1854273.1854313","DOIUrl":"https://doi.org/10.1145/1854273.1854313","url":null,"abstract":"Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address this by developing a portable and automatic compiler-based approach to partitioning streaming programs using machine learning. Our technique predicts the ideal partition structure for a given streaming application using prior knowledge learned off-line. Using the predictor we rapidly search the program space (without executing any code) to generate and select a good partition. We applied this technique to standard StreamIt applications and compared against existing approaches. On a 4-core platform, our approach achieves 60% of the best performance found by iteratively compiling and executing over 3000 different partitions per program. We obtain, on average, a 1.90x speedup over the already tuned partitioning scheme of the StreamIt compiler. When compared against a state-of-the-art analytical, model-based approach, we achieve, on average, a 1.77x performance improvement. By porting our approach to a 8-core platform, we are able to obtain 1.8x improvement over the StreamIt default scheme, demonstrating the portability of our approach.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122404802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 118
A model for fusion and code motion in an automatic parallelizing compiler 自动并行编译器中的融合和代码运动模型
Uday Bondhugula, O. Günlük, S. Dash, Lakshminarayanan Renganarayanan
Loop fusion has been studied extensively, but in a manner isolated from other transformations. This was mainly due to the lack of a powerful intermediate representation for application of compositions of high-level transformations. Fusion presents strong interactions with parallelism and locality. Currently, there exist no models to determine good fusion structures integrated with all components of an auto-parallelizing compiler. This is also one of the reasons why all the benefits of optimization and automatic parallelization of long sequences of loop nests spanning hundreds of lines of code have never been explored. We present a fusion model in an integrated automatic parallelization framework that simultaneously optimizes for hardware prefetch stream buffer utilization, locality, and parallelism. Characterizing the legal space of fusion structures in the polyhedral compiler framework is not difficult. However, incorporating useful optimization criteria into such a legal space to pick good fusion structures is very hard. The model we propose captures utilization of hardware prefetch streams, loss of parallelism, as well as constraints imposed by privatization and code expansion into a single convex optimization space. The model scales very well to program sections spanning hundreds of lines of code. It has been implemented into the polyhedral pass of the IBM XL optimizing compiler. Experimental results demonstrate its effectiveness in finding good fusion structures for codes including SPEC benchmarks and large applications. An improvement ranging from 5% to nearly a factor of 2.75× is obtained over the current production compiler optimizer on these benchmarks.
环融合已被广泛研究,但在某种程度上与其他转换分离。这主要是由于缺乏用于高级转换组合的应用程序的强大的中间表示。核聚变表现出与并行性和局部性的强相互作用。目前,还没有一种模型来确定一个自动并行化编译器中所有组件的融合结构。这也是为什么从未探索过跨越数百行代码的长序列循环巢的优化和自动并行化的所有好处的原因之一。我们在一个集成的自动并行化框架中提出了一个融合模型,该模型同时优化了硬件预取流缓冲利用率、局部性和并行性。在多面体编译器框架中描述融合结构的合法空间并不困难。然而,将有用的优化标准整合到这样一个合法的空间中以选择好的融合结构是非常困难的。我们提出的模型捕获了硬件预取流的利用率,并行性的损失,以及私有化和代码扩展到单个凸优化空间所施加的约束。该模型可以很好地扩展到跨越数百行代码的程序部分。它已经被实现到IBM XL优化编译器的多面体通道中。实验结果表明,该方法可以有效地在包括SPEC基准测试和大型应用在内的代码中找到良好的融合结构。在这些基准测试中,与当前的生产编译器优化器相比,改进幅度从5%到近2.75倍不等。
{"title":"A model for fusion and code motion in an automatic parallelizing compiler","authors":"Uday Bondhugula, O. Günlük, S. Dash, Lakshminarayanan Renganarayanan","doi":"10.1145/1854273.1854317","DOIUrl":"https://doi.org/10.1145/1854273.1854317","url":null,"abstract":"Loop fusion has been studied extensively, but in a manner isolated from other transformations. This was mainly due to the lack of a powerful intermediate representation for application of compositions of high-level transformations. Fusion presents strong interactions with parallelism and locality. Currently, there exist no models to determine good fusion structures integrated with all components of an auto-parallelizing compiler. This is also one of the reasons why all the benefits of optimization and automatic parallelization of long sequences of loop nests spanning hundreds of lines of code have never been explored. We present a fusion model in an integrated automatic parallelization framework that simultaneously optimizes for hardware prefetch stream buffer utilization, locality, and parallelism. Characterizing the legal space of fusion structures in the polyhedral compiler framework is not difficult. However, incorporating useful optimization criteria into such a legal space to pick good fusion structures is very hard. The model we propose captures utilization of hardware prefetch streams, loss of parallelism, as well as constraints imposed by privatization and code expansion into a single convex optimization space. The model scales very well to program sections spanning hundreds of lines of code. It has been implemented into the polyhedral pass of the IBM XL optimizing compiler. Experimental results demonstrate its effectiveness in finding good fusion structures for codes including SPEC benchmarks and large applications. An improvement ranging from 5% to nearly a factor of 2.75× is obtained over the current production compiler optimizer on these benchmarks.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127501845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
DMATiler: Revisiting loop tiling for direct memory access DMATiler:为直接内存访问重新访问循环平铺
Haibo Lin, Tao Liu, Huoding Li, Tong Chen, Lakshminarayanan Renganarayanan, K. O'Brien, Ling Shao
In this paper we present the design and implementation of a DMATiler which combines compiler analysis and runtime management to optimize local memory performance. In traditional cache model based loop tiling optimizations, the compiler approximates runtime cache misses as the number of distinct cache lines touched by a loop nest. In contrast, the DMATiler has the full control of the addresses, sizes, and sequences of data transfers. DMATiler uses a simplified DMA performance model to formulate the cost model for DMA-tiled loop nests, then solves it using a custom gradient descent algorithm with heuristics guided by DMA characteristics. Given a loop nest, DMATiler uses loop interchange to make the loop order more friendlier for data movements. Moreover, DMATiler applies compressed data buffer and advanced DMA command to further optimize data transfers. We have implemented the DMATiler in the IBM XL C/C++ for Multi-core Acceleration for Linux, and have conducted experiments with a set of loop nest benchmarks. The results show DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.
在本文中,我们提出了一个datiler的设计和实现,它结合了编译分析和运行时管理来优化本地内存性能。在传统的基于缓存模型的循环平铺优化中,编译器将运行时缓存丢失近似为循环嵌套所触及的不同缓存行的数量。相反,DMATiler完全控制数据传输的地址、大小和顺序。DMATiler利用简化的DMA性能模型,建立了DMA平铺环巢的代价模型,然后利用基于DMA特征的启发式自定义梯度下降算法求解。给定一个循环巢,DMATiler使用循环交换使循环顺序对数据移动更友好。此外,DMATiler还使用压缩数据缓冲区和高级DMA命令来进一步优化数据传输。我们已经在IBM XL C/ c++中为Linux的多核加速实现了DMATiler,并使用一组循环巢基准测试进行了实验。结果表明,在Cell BE处理器上,DMATiler比软件控制的缓存(平均加速9.8倍)和单级循环阻塞(平均加速6.2倍)要高效得多。
{"title":"DMATiler: Revisiting loop tiling for direct memory access","authors":"Haibo Lin, Tao Liu, Huoding Li, Tong Chen, Lakshminarayanan Renganarayanan, K. O'Brien, Ling Shao","doi":"10.1145/1854273.1854351","DOIUrl":"https://doi.org/10.1145/1854273.1854351","url":null,"abstract":"In this paper we present the design and implementation of a DMATiler which combines compiler analysis and runtime management to optimize local memory performance. In traditional cache model based loop tiling optimizations, the compiler approximates runtime cache misses as the number of distinct cache lines touched by a loop nest. In contrast, the DMATiler has the full control of the addresses, sizes, and sequences of data transfers. DMATiler uses a simplified DMA performance model to formulate the cost model for DMA-tiled loop nests, then solves it using a custom gradient descent algorithm with heuristics guided by DMA characteristics. Given a loop nest, DMATiler uses loop interchange to make the loop order more friendlier for data movements. Moreover, DMATiler applies compressed data buffer and advanced DMA command to further optimize data transfers. We have implemented the DMATiler in the IBM XL C/C++ for Multi-core Acceleration for Linux, and have conducted experiments with a set of loop nest benchmarks. The results show DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127505836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Using dead blocks as a virtual victim cache 使用死块作为虚拟受害者缓存
S. Khan, Daniel A. Jiménez, D. Burger, B. Falsafi
Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.
缓存减轻了限制现代处理器性能的长内存延迟。然而,缓存的效率可能相当低。平均而言,2MB二级缓存中的缓存块在59%的时间内是死的,也就是说,在它被驱逐之前,它不会被再次引用。提高缓存效率可以通过减少丢失率来提高性能,或者通过允许具有相同丢失率的较小缓存来提高功率和能量。
{"title":"Using dead blocks as a virtual victim cache","authors":"S. Khan, Daniel A. Jiménez, D. Burger, B. Falsafi","doi":"10.1145/1854273.1854333","DOIUrl":"https://doi.org/10.1145/1854273.1854333","url":null,"abstract":"Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116844969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
StatCC: A statistical cache contention model StatCC:统计缓存争用模型
David Eklov, D. Black-Schaffer, Erik Hagersten
Chip multiprocessor (CMP) architectures sharing on chip resources, such as last-level caches, have recently become a mainstream computing platform. The performance of such systems can vary greatly depending on how co-scheduled applications compete for these shared resources. This work presents StatCC, a simple and efficient model for estimating the contention for shared cache resources between co-scheduled applications on chip multiprocessor architectures.
芯片多处理器(CMP)架构共享芯片资源,如最后一级缓存,最近已成为主流的计算平台。这类系统的性能可能会有很大差异,这取决于协同调度的应用程序如何竞争这些共享资源。这项工作提出了StatCC,一个简单而有效的模型,用于估计在芯片多处理器架构上共同调度的应用程序之间对共享缓存资源的争用。
{"title":"StatCC: A statistical cache contention model","authors":"David Eklov, D. Black-Schaffer, Erik Hagersten","doi":"10.1145/1854273.1854347","DOIUrl":"https://doi.org/10.1145/1854273.1854347","url":null,"abstract":"Chip multiprocessor (CMP) architectures sharing on chip resources, such as last-level caches, have recently become a mainstream computing platform. The performance of such systems can vary greatly depending on how co-scheduled applications compete for these shared resources. This work presents StatCC, a simple and efficient model for estimating the contention for shared cache resources between co-scheduled applications on chip multiprocessor architectures.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127652402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Reducing task creation and termination overhead in explicitly parallel programs 减少显式并行程序中的任务创建和终止开销
Jisheng Zhao, J. Shirako, V. K. Nandivada, Vivek Sarkar
There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup of 66.7×, 11.25× and 23.1× respectively on each platform and 4.6×, 2.1× and 6.4×performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper.
为了满足多核程序员的需求,任务并行编程系统出现了激增。当前的生产任务并行系统包括Cilk++、英特尔线程构建块、Java并发、。net任务并行库、OpenMP 3.0,以及当前的研究任务并行语言包括Cilk、Chapel、Fortress、X10和Habanero-Java (HJ)。程序员希望在代码中表达其算法固有的所有并行性,以实现向前可伸缩性和可移植性,但是这样做所带来的开销在当今的系统中可能非常大。在本文中,我们解决了减少由于过多的任务创建和终止而导致的程序开销总量的问题。我们引入了一个转换框架来优化任务并行程序的finish, forall和next语句。我们的方法包括消除冗余的任务创建和终止操作,以及将终止操作(finish)的强度降低到更轻量级的同步(next)。实验结果在三种平台上得到:双插槽128线程(16核)Niagara T2系统,四插槽16路Intel Xeon SMP和四插槽32路Power7 SMP。结果表明,在每个平台上,与未优化并行代码相关的最大加速分别为66.7 x、11.25 x和23.1 x,几何平均值分别为4.6 x、2.1 x和6.4×performance。本研究中的原始基准是用中等粒度的并行性编写的;对于使用细粒度并行性编写的程序,可以期望有更大的相对改进。然而,即使对于本文研究的中粒度并行基准测试,转换框架所获得的显著改进也强调了本文中介绍的编译器优化的重要性。
{"title":"Reducing task creation and termination overhead in explicitly parallel programs","authors":"Jisheng Zhao, J. Shirako, V. K. Nandivada, Vivek Sarkar","doi":"10.1145/1854273.1854298","DOIUrl":"https://doi.org/10.1145/1854273.1854298","url":null,"abstract":"There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup of 66.7×, 11.25× and 23.1× respectively on each platform and 4.6×, 2.1× and 6.4×performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114216014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems Ocelot:用于异构系统中批量同步应用程序的动态优化框架
G. Diamos, Andrew Kerr, S. Yalamanchili, Nathan Clark
Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications. This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
Ocelot是一个动态编译框架,旨在将NVIDIA CUDA应用程序使用的显式数据并行执行模型映射到各种多线程平台上。Ocelot包含一个从并行线程执行ISA (PTX)到多核处理器的动态二进制转换器,该转换器利用Low Level Virtual Machine (LLVM)代码生成器来针对x86和其他ISA。动态编译器能够执行现有的CUDA二进制文件,而无需从源代码重新编译,并支持在运行时在NVIDIA GPU和多核CPU上执行切换。它已经针对超过130个应用程序进行了验证,这些应用程序来自CUDA SDK, UIUC Parboil基准测试[1],Virginia Rodinia基准测试[2],GPU-VSIPL信号和图像处理库[3],Thrust库[4]以及几个特定领域的应用程序。本文概述了Ocelot动态编译器的实现,重点介绍了设计决策和权衡,并展示了它们对应用程序性能的影响。本文探索了几种仅适用于编译显式并行应用程序的新颖代码转换,并为这类新应用程序重新审视了传统的动态编译器优化。这项研究预计将为明确并行编程模型(如OpenCL)以及未来的CPU和GPU架构的编译工具的设计提供信息。
{"title":"Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems","authors":"G. Diamos, Andrew Kerr, S. Yalamanchili, Nathan Clark","doi":"10.1145/1854273.1854318","DOIUrl":"https://doi.org/10.1145/1854273.1854318","url":null,"abstract":"Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications. This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122078186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 229
期刊
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1