首页 > 最新文献

Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation最新文献

英文 中文
Data transformations for eliminating conflict misses 用于消除冲突遗漏的数据转换
Gabriel Rivera, C. Tseng
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PADLITE only uses array and column dimension sizes, relying on assumptions about common array reference patterns. PAD analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PADLITE can eliminate conflicts for benchmarks, but PAD is more effective over a range of cache and problem sizes. Padding reduces cache miss rates by 16% on average for a 16K direct-mapped cache. Execution times are reduced by 6% on average, with some SPEC95 programs improving up to 15%.
在科学项目中,很多缓存丢失都是由于集合结合性有限导致的冲突。我们研究了两种用于消除冲突错误的编译时数据布局转换,重点关注每次循环迭代中发生的错误。变量间填充调整变量的基址,而变量内填充修改数组的维度大小。评估了两个级别的精度。PADLITE只使用数组和列的维度大小,依赖于对常见数组引用模式的假设。PAD分析程序,通过线性化数组引用和计算均匀生成的引用之间的冲突距离来检测冲突缺失。利用计算两数gcd的欧几里得算法来预测线性代数代码中不同数组列之间的冲突。在一系列程序上进行的实验表明,PADLITE可以消除基准测试中的冲突,但是在一定的缓存和问题大小范围内,PAD更为有效。对于一个16K的直接映射缓存,填充可以平均减少16%的缓存缺失率。执行时间平均减少了6%,其中一些SPEC95程序提高了15%。
{"title":"Data transformations for eliminating conflict misses","authors":"Gabriel Rivera, C. Tseng","doi":"10.1145/277650.277661","DOIUrl":"https://doi.org/10.1145/277650.277661","url":null,"abstract":"Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PADLITE only uses array and column dimension sizes, relying on assumptions about common array reference patterns. PAD analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PADLITE can eliminate conflicts for benchmarks, but PAD is more effective over a range of cache and problem sizes. Padding reduces cache miss rates by 16% on average for a 16K direct-mapped cache. Execution times are reduced by 6% on average, with some SPEC95 programs improving up to 15%.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"86 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127991351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 264
Using integer sets for data-parallel program analysis and optimization 使用整数集进行数据并行程序分析和优化
Vikram S. Adve, J. Mellor-Crummey
In this paper, we describe our experience with using an abstract integer-set framework to develop the Rice dHPF compiler, a compiler for High Performance Fortran. We present simple, yet general formulations of the major computation partitioning and communication analysis tasks as well as a number of important optimizations in terms of abstract operations on sets of integer tuples. This approach has made it possible to implement a comprehensive collection of advanced optimizations in dHPF, and to do so in the context of a more general computation partitioning model than previous compilers. One potential limitation of the approach is that the underlying class of integer set problems is fundamentally unable to represent HPF data distributions on a symbolic number of processors. We describe how we extend the approach to compile codes for a symbolic number of processors, without requiring any changes to the set formulations for the above optimizations. We show experimentally that the set representation is not a dominant factor in compile times on both small and large codes. Finally, we present preliminary performance measurements to show that the generated code achieves good speedups for a few benchmarks. Overall, we believe we are the first to demonstrate by implementation experience that it is practical to build a compiler for HPF using a general and powerful integer-set framework.
在本文中,我们描述了使用抽象整数集框架开发Rice dHPF编译器的经验,Rice dHPF是一种高性能Fortran编译器。我们给出了主要计算划分和通信分析任务的简单而通用的公式,以及关于整数元组集合的抽象操作的一些重要优化。这种方法使得在dHPF中实现全面的高级优化成为可能,并且可以在比以前的编译器更通用的计算分区模型上下文中实现这些优化。该方法的一个潜在限制是,整数集问题的底层类从根本上无法在符号数量的处理器上表示HPF数据分布。我们将描述如何扩展该方法,以便为象征性数量的处理器编译代码,而不需要对上述优化的集合公式进行任何更改。我们通过实验证明集合表示并不是影响小代码和大代码编译时间的主要因素。最后,我们提供了初步的性能测量,以显示生成的代码在一些基准测试中实现了良好的加速。总的来说,我们相信我们是第一个通过实现经验证明使用通用且强大的整数集框架构建HPF编译器是可行的。
{"title":"Using integer sets for data-parallel program analysis and optimization","authors":"Vikram S. Adve, J. Mellor-Crummey","doi":"10.1145/277650.277721","DOIUrl":"https://doi.org/10.1145/277650.277721","url":null,"abstract":"In this paper, we describe our experience with using an abstract integer-set framework to develop the Rice dHPF compiler, a compiler for High Performance Fortran. We present simple, yet general formulations of the major computation partitioning and communication analysis tasks as well as a number of important optimizations in terms of abstract operations on sets of integer tuples. This approach has made it possible to implement a comprehensive collection of advanced optimizations in dHPF, and to do so in the context of a more general computation partitioning model than previous compilers. One potential limitation of the approach is that the underlying class of integer set problems is fundamentally unable to represent HPF data distributions on a symbolic number of processors. We describe how we extend the approach to compile codes for a symbolic number of processors, without requiring any changes to the set formulations for the above optimizations. We show experimentally that the set representation is not a dominant factor in compile times on both small and large codes. Finally, we present preliminary performance measurements to show that the generated code achieves good speedups for a few benchmarks. Overall, we believe we are the first to demonstrate by implementation experience that it is practical to build a compiler for HPF using a general and powerful integer-set framework.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Improving data-flow analysis with path profiles 使用路径配置文件改进数据流分析
Glenn Ammons, J. Larus
Data-flow analysis computes its solutions over the paths in a control-flow graph. These paths---whether feasible or infeasible, heavily or rarely executed---contribute equally to a solution. However, programs execute only a small fraction of their potential paths and, moreover, programs' execution time and cost is concentrated in a far smaller subset of hot paths.This paper describes a new approach to analyzing and optimizing programs, which improves the precision of data flow analysis along hot paths. Our technique identifies and duplicates hot paths, creating a hot path graph in which these paths are isolated. After flow analysis, the graph is reduced to eliminate unnecessary duplicates of unprofitable paths. In experiments on SPEC95 benchmarks, path qualification identified 2--112 times more non-local constants (weighted dynamically) than the Wegman-Zadek conditional constant algorithm, which translated into 1--7% more dynamic instructions with constant results.
数据流分析通过控制流图中的路径计算其解。这些路径——无论可行或不可行,大量执行或很少执行——对解决方案的贡献都是一样的。然而,程序只执行其潜在路径的一小部分,而且,程序的执行时间和成本集中在热路径的一个小得多的子集上。本文提出了一种分析和优化程序的新方法,提高了热路径数据流分析的精度。我们的技术识别和复制热路径,创建热路径图,其中这些路径是隔离的。流分析后,对图进行简化,以消除不必要的重复路径。在SPEC95基准测试的实验中,路径限定识别出的非局部常数(动态加权)是Wegman-Zadek条件常数算法的2- 112倍,后者转化为具有恒定结果的1- 7%的动态指令。
{"title":"Improving data-flow analysis with path profiles","authors":"Glenn Ammons, J. Larus","doi":"10.1145/277650.277665","DOIUrl":"https://doi.org/10.1145/277650.277665","url":null,"abstract":"Data-flow analysis computes its solutions over the paths in a control-flow graph. These paths---whether feasible or infeasible, heavily or rarely executed---contribute equally to a solution. However, programs execute only a small fraction of their potential paths and, moreover, programs' execution time and cost is concentrated in a far smaller subset of hot paths.This paper describes a new approach to analyzing and optimizing programs, which improves the precision of data flow analysis along hot paths. Our technique identifies and duplicates hot paths, creating a hot path graph in which these paths are isolated. After flow analysis, the graph is reduced to eliminate unnecessary duplicates of unprofitable paths. In experiments on SPEC95 benchmarks, path qualification identified 2--112 times more non-local constants (weighted dynamically) than the Wegman-Zadek conditional constant algorithm, which translated into 1--7% more dynamic instructions with constant results.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132085760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 112
Automatically closing open reactive programs 自动关闭打开的响应程序
Christopher Colby, Patrice Godefroid, L. Jagadeesan
We study in this paper the problem of analyzing implementations of open systems --- systems in which only some of the components are present. We present an algorithm for automatically closing an open concurrent reactive system with its most general environment, i.e., the environment that can provide any input at any time to the system. The result is a nondeterministic closed (i.e., self-executable) system which can exhibit all the possible reactive behaviors of the original open system. These behaviors can then be analyzed using VeriSoft, an existing tool for systematically exploring the state spaces of closed systems composed of multiple (possibly nondeterministic) processes executing arbitrary code. We have implemented the techniques introduced in this paper in a prototype tool for automatically closing open programs written in the C programming language. We discuss preliminary experimental results obtained with a large telephone-switching software application developed at Lucent Technologies.
在本文中,我们研究了分析开放系统的实现的问题——其中只有一些组件存在的系统。我们提出了一种自动关闭开放并发反应系统及其最一般环境的算法,即可以在任何时间向系统提供任何输入的环境。结果是一个不确定的封闭(即自执行)系统,它可以表现出原始开放系统的所有可能的反应行为。然后可以使用VeriSoft对这些行为进行分析,VeriSoft是一个现有的工具,用于系统地探索由执行任意代码的多个(可能不确定的)进程组成的封闭系统的状态空间。我们在用C语言编写的自动关闭打开程序的原型工具中实现了本文介绍的技术。我们讨论了朗讯科技开发的一个大型电话交换软件应用程序获得的初步实验结果。
{"title":"Automatically closing open reactive programs","authors":"Christopher Colby, Patrice Godefroid, L. Jagadeesan","doi":"10.1145/277650.277754","DOIUrl":"https://doi.org/10.1145/277650.277754","url":null,"abstract":"We study in this paper the problem of analyzing implementations of open systems --- systems in which only some of the components are present. We present an algorithm for automatically closing an open concurrent reactive system with its most general environment, i.e., the environment that can provide any input at any time to the system. The result is a nondeterministic closed (i.e., self-executable) system which can exhibit all the possible reactive behaviors of the original open system. These behaviors can then be analyzed using VeriSoft, an existing tool for systematically exploring the state spaces of closed systems composed of multiple (possibly nondeterministic) processes executing arbitrary code. We have implemented the techniques introduced in this paper in a prototype tool for automatically closing open programs written in the C programming language. We discuss preliminary experimental results obtained with a large telephone-switching software application developed at Lucent Technologies.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121474495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Complete removal of redundant expressions 完全去除多余的表达式
R. Bodík, R. Gupta, M. Soffa
Partial redundancy elimination (PRE), the most important component of global optimizers, generalizes the removal of common subexpressions and loop-invariant computations. Because existing PRE implementations are based on code motion, they fail to completely remove the redundancies. In fact, we observed that 73% of loop-invariant statements cannot be eliminated from loops by code motion alone. In dynamic terms, traditional PRE eliminates only half of redundancies that are strictly partial. To achieve a complete PRE, control flow restructuring must be applied. However, the resulting code duplication may cause code size explosion.This paper focuses on achieving a complete PRE while incurring an acceptable code growth. First, we present an algorithm for complete removal of partial redundancies, based on the integration of code motion and control flow restructuring. In contrast to existing complete techniques, we resort to restructuring merely to remove obstacles to code motion, rather than to carry out the actual optimization.Guiding the optimization with a profile enables additional code growth reduction through selecting those duplications whose cost is justified by sufficient execution-time gains. The paper develops two methods for determining the optimization benefit of restructuring a program region, one based on path-profiles and the other on data-flow frequency analysis. Furthermore, the abstraction underlying the new PRE algorithm enables a simple formulation of speculative code motion guaranteed to have positive dynamic improvements. Finally, we show how to balance the three transformations (code motion, restructuring, and speculation) to achieve a near-complete PRE with very little code growth.We also present algorithms for efficiently computing dynamic benefits. In particular, using an elimination-style data-flow framework, we derive a demand-driven frequency analyzer whose cost can be controlled by permitting a bounded degree of conservative imprecision in the solution.
部分冗余消除(PRE)是全局优化器中最重要的组成部分,它推广了公共子表达式和循环不变计算的消除。因为现有的PRE实现是基于代码运动的,所以它们不能完全消除冗余。事实上,我们观察到73%的循环不变语句不能仅通过代码运动从循环中消除。在动态方面,传统的PRE只消除了一半严格的部分冗余。为了实现完整的PRE,必须应用控制流重组。但是,产生的代码重复可能导致代码大小爆炸。本文的重点是在获得可接受的代码增长的同时实现完整的PRE。首先,我们提出了一种完全去除部分冗余的算法,该算法基于代码运动和控制流重组的集成。与现有的完整技术相比,我们采用重构仅仅是为了消除代码运动的障碍,而不是进行实际的优化。使用概要文件指导优化,可以通过选择那些成本被足够的执行时间收益所证明的重复来减少额外的代码增长。本文提出了两种确定规划区域重构优化效益的方法,一种是基于路径剖面的方法,另一种是基于数据流频率分析的方法。此外,新PRE算法的抽象使推测代码运动的简单表述能够保证具有积极的动态改进。最后,我们将展示如何平衡三种转换(代码移动、重组和推测),以实现几乎完整的PRE,并且代码增长很少。我们还提出了有效计算动态效益的算法。特别是,使用消除式数据流框架,我们推导出一个需求驱动的频率分析仪,其成本可以通过允许解决方案中有限程度的保守不精确来控制。
{"title":"Complete removal of redundant expressions","authors":"R. Bodík, R. Gupta, M. Soffa","doi":"10.1145/277650.277653","DOIUrl":"https://doi.org/10.1145/277650.277653","url":null,"abstract":"Partial redundancy elimination (PRE), the most important component of global optimizers, generalizes the removal of common subexpressions and loop-invariant computations. Because existing PRE implementations are based on code motion, they fail to completely remove the redundancies. In fact, we observed that 73% of loop-invariant statements cannot be eliminated from loops by code motion alone. In dynamic terms, traditional PRE eliminates only half of redundancies that are strictly partial. To achieve a complete PRE, control flow restructuring must be applied. However, the resulting code duplication may cause code size explosion.This paper focuses on achieving a complete PRE while incurring an acceptable code growth. First, we present an algorithm for complete removal of partial redundancies, based on the integration of code motion and control flow restructuring. In contrast to existing complete techniques, we resort to restructuring merely to remove obstacles to code motion, rather than to carry out the actual optimization.Guiding the optimization with a profile enables additional code growth reduction through selecting those duplications whose cost is justified by sufficient execution-time gains. The paper develops two methods for determining the optimization benefit of restructuring a program region, one based on path-profiles and the other on data-flow frequency analysis. Furthermore, the abstraction underlying the new PRE algorithm enables a simple formulation of speculative code motion guaranteed to have positive dynamic improvements. Finally, we show how to balance the three transformations (code motion, restructuring, and speculation) to achieve a near-complete PRE with very little code growth.We also present algorithms for efficiently computing dynamic benefits. In particular, using an elimination-style data-flow framework, we derive a demand-driven frequency analyzer whose cost can be controlled by permitting a bounded degree of conservative imprecision in the solution.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121732026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
The implementation and evaluation of fusion and contraction in array languages 数组语言中融合与收缩的实现与计算
Christopher M. Lewis, C. Liny, L. Snyder
Array languages such as Fortran 90, HPF and ZPL have many benefits in simplifying array-based computations and expressing data parallelism. However, they can suffer large performance penalties because they introduce intermediate arrays---both at the source level and during the compilation process---which increase memory usage and pollute the cache. Most compilers address this problem by simply scalarizing the array language and relying on a scalar language compiler to perform loop fusion and array contraction. We instead show that there are advantages to performing a form of loop fusion and array contraction at the array level. This paper describes this approach and explains its advantages. Experimental results show that our scheme typically yields runtime improvements of greater than 20% and sometimes up to 400%. In addition, it yields superior memory use when compared against commercial compilers and exhibits comparable memory use when compared with scalar languages. We also explore the interaction between these transformations and communication optimizations.
数组语言如Fortran 90、HPF和ZPL在简化基于数组的计算和表达数据并行性方面有很多好处。然而,它们可能会遭受很大的性能损失,因为它们引入了中间数组——无论是在源代码级别还是在编译过程中——这会增加内存使用并污染缓存。大多数编译器通过简单地缩放数组语言并依赖标量语言编译器来执行循环融合和数组收缩来解决这个问题。相反,我们展示了在数组级别执行循环融合和数组收缩形式的优点。本文介绍了这种方法,并说明了它的优点。实验结果表明,我们的方案通常可以使运行时间提高20%以上,有时甚至高达400%。此外,与商业编译器相比,它产生了更好的内存使用,与标量语言相比,它显示了相当的内存使用。我们还探讨了这些转换和通信优化之间的交互。
{"title":"The implementation and evaluation of fusion and contraction in array languages","authors":"Christopher M. Lewis, C. Liny, L. Snyder","doi":"10.1145/277650.277663","DOIUrl":"https://doi.org/10.1145/277650.277663","url":null,"abstract":"Array languages such as Fortran 90, HPF and ZPL have many benefits in simplifying array-based computations and expressing data parallelism. However, they can suffer large performance penalties because they introduce intermediate arrays---both at the source level and during the compilation process---which increase memory usage and pollute the cache. Most compilers address this problem by simply scalarizing the array language and relying on a scalar language compiler to perform loop fusion and array contraction. We instead show that there are advantages to performing a form of loop fusion and array contraction at the array level. This paper describes this approach and explains its advantages. Experimental results show that our scheme typically yields runtime improvements of greater than 20% and sometimes up to 400%. In addition, it yields superior memory use when compared against commercial compilers and exhibits comparable memory use when compared with scalar languages. We also explore the interaction between these transformations and communication optimizations.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115947480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Communication optimizations for parallel C programs 并行C程序的通信优化
Yingchun Zhu, L. Hendren
This paper presents algorithms for reducing the communication overhead for parallel C programs that use dynamically-allocated data structures. The framework consists of an analysis phase called possible-placement analysis, and a transformation phase called communication selection.The fundamental idea of possible-placement analysis is to find all possible points for insertion of remote memory operations. Remote reads are propagated upwards, whereas remote writes are propagated downwards. Based on the results of the possible-placement analysis, the communication selection transformation selects the "best" place for inserting the communication, and determines if pipelining or blocking of communication should be performed.The framework has been implemented in the EARTH-McCAT optimizing/parallelizing C compiler, and experimental results are presented for five pointer-intensive benchmarks running on the EARTH-MANNA distributed-memory parallel architecture. These experiments show that the communication optimization can provide performance improvements of up to 16% over the unoptimized benchmarks.
本文提出了减少使用动态分配数据结构的并行C程序的通信开销的算法。该框架由一个称为可能放置分析的分析阶段和一个称为通信选择的转换阶段组成。可能放置分析的基本思想是找到远程内存操作插入的所有可能点。远程读向上传播,而远程写向下传播。根据可能放置分析的结果,通信选择变换选择“最佳”位置来插入通信,并确定是进行管道化还是阻塞通信。该框架已在earth - mcat优化/并行C编译器中实现,并在EARTH-MANNA分布式内存并行架构上运行了5个指针密集型基准测试的实验结果。这些实验表明,与未优化的基准测试相比,通信优化可以提供高达16%的性能改进。
{"title":"Communication optimizations for parallel C programs","authors":"Yingchun Zhu, L. Hendren","doi":"10.1145/277650.277723","DOIUrl":"https://doi.org/10.1145/277650.277723","url":null,"abstract":"This paper presents algorithms for reducing the communication overhead for parallel C programs that use dynamically-allocated data structures. The framework consists of an analysis phase called possible-placement analysis, and a transformation phase called communication selection.The fundamental idea of possible-placement analysis is to find all possible points for insertion of remote memory operations. Remote reads are propagated upwards, whereas remote writes are propagated downwards. Based on the results of the possible-placement analysis, the communication selection transformation selects the \"best\" place for inserting the communication, and determines if pipelining or blocking of communication should be performed.The framework has been implemented in the EARTH-McCAT optimizing/parallelizing C compiler, and experimental results are presented for five pointer-intensive benchmarks running on the EARTH-MANNA distributed-memory parallel architecture. These experiments show that the communication optimization can provide performance improvements of up to 16% over the unoptimized benchmarks.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126217580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Exploiting idle floating-point resources for integer execution 利用空闲的浮点资源执行整数
S. Sastry, Subbarao Palacharla, James E. Smith
In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal.To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.
在传统的带有整数和浮点资源分区的标量微体系结构中,在执行整数程序时,所有的浮点资源都是空闲的。Palacharla和Smith[26]解决了这一缺陷,并提出对浮点子系统进行扩充以支持整数运算。所需的硬件更改预计将相当少。为了利用这些空闲的浮点资源,编译器必须识别可以有效地卸载到增强浮点子系统的整数代码。在本文中,我们提出了两种编译算法来做到这一点。基本方案使用现有的程序加载/存储将整数计算卸载给浮点子系统进行分区间通信。对于SPECINT95基准测试,我们表明该方案将总动态指令的5%到29%卸载到浮点子系统。高级方案插入复制指令并重复一些指令以进一步减轻计算负担。我们用时序仿真来评估这两种方案的有效性。我们表明,先进的方案可以将总动态指令的9%到41%卸载到浮点子系统。在这样做的过程中,与传统的微架构相比,可以实现3%到23%的加速。
{"title":"Exploiting idle floating-point resources for integer execution","authors":"S. Sastry, Subbarao Palacharla, James E. Smith","doi":"10.1145/277650.277709","DOIUrl":"https://doi.org/10.1145/277650.277709","url":null,"abstract":"In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal.To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127054319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
The implementation of the Cilk-5 multithreaded language Cilk-5多线程语言的实现
Matteo Frigo, C. Leiserson, K. H. Randall
The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.
多线程语言Cilk的第五个版本使用了一种被证明很好的“偷工作”调度算法,类似于第一个系统,但是语言已经被完全重新设计,运行时系统也被完全重新设计。新实现的效率得益于对调度算法的理论分析所产生的清晰策略:专注于最小化有助于工作的开销,甚至以牺牲有助于关键路径的开销为代价。虽然将开销移到关键路径上似乎是违反直觉的,但这种“工作优先”原则已经导致了可移植的Cilk-5实现,在这种实现中,生成并行线程的典型成本仅是在各种现代机器上调用C函数成本的2到6倍。许多Cilk程序在一个处理器上运行,与同等的C程序相比,几乎没有性能下降。本文描述了在Cilk-5编译器及其运行时系统的设计中如何利用工作优先原则。特别地,我们介绍了Cilk-5新颖的“双克隆”编译策略及其Dijkstra-like互斥协议,用于在工作窃取调度程序中实现就绪队列。
{"title":"The implementation of the Cilk-5 multithreaded language","authors":"Matteo Frigo, C. Leiserson, K. H. Randall","doi":"10.1145/277650.277725","DOIUrl":"https://doi.org/10.1145/277650.277725","url":null,"abstract":"The fifth release of the multithreaded language Cilk uses a provably good \"work-stealing\" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this \"work-first\" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel \"two-clone\" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125491955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1447
Thin locks: featherweight synchronization for Java 瘦锁:轻量级的Java同步
D. F. Bacon, Ravi B. Konuru, Chet Murthy, M. Serrano
Language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs.
语言支持的同步是许多Java程序中严重性能问题的根源。由于Java库的线程安全特性,即使是单线程应用程序也可能花费多达一半的时间来执行无用的同步。我们用一种新算法解决了这个性能问题,该算法允许在大多数情况下仅用少量机器指令执行锁定和锁操作。我们的锁只需要每个对象的部分字,并且在实现时没有增加对象大小。我们给出了在AIX的JDK 1.1.2中实现的测量结果,在微基准测试中显示了高达5倍的速度提升,在实际程序中显示了高达1.7倍的速度提升。
{"title":"Thin locks: featherweight synchronization for Java","authors":"D. F. Bacon, Ravi B. Konuru, Chet Murthy, M. Serrano","doi":"10.1145/277650.277734","DOIUrl":"https://doi.org/10.1145/277650.277734","url":null,"abstract":"Language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133977375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 89
期刊
Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1