International Symposium on Code Generation and Optimization (CGO'07)最新文献

英文中文

Run-Time Support for Optimizations Based on Escape Analysis 基于转义分析的优化运行时支持

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.34

Thomas Kotzmann, H. Mössenböck

We implemented a new escape analysis algorithm for Sun Microsystems' Java HotSpottrade VM. The results are used to replace objects with scalar variables, to allocate objects on the stack, and to remove synchronization. This paper deals with the representation of optimized objects in the debugging information and with reallocation and garbage collection support for a safe execution of optimized methods. Assignments to fields of parameters that can refer to both stack and heap objects are associated with an extended write barrier which skips card marking for stack objects. The traversal of objects during garbage collection uses a wrapper that abstracts from stack objects and presents their pointer fields as root pointers to the garbage collector

我们为Sun Microsystems的Java HotSpottrade虚拟机实现了一种新的转义分析算法。结果用于用标量变量替换对象，在堆栈上分配对象，以及删除同步。本文讨论了调试信息中优化对象的表示，以及对优化方法安全执行的重新分配和垃圾收集支持。对既可以引用堆栈对象又可以引用堆对象的参数字段的赋值与一个扩展写屏障相关联，该屏障跳过堆栈对象的卡片标记。垃圾收集期间的对象遍历使用包装器，该包装器从堆栈对象中抽象，并将其指针字段作为指向垃圾收集器的根指针表示

引用次数: 17

Heterogeneous Clustered VLIW Microarchitectures 异构集群VLIW微架构

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.15

Alex Aletà, J. M. Codina, Antonio González, D. Kaeli

Increasing performance, while at the same time reducing power consumption, is a major design tradeoff in current microprocessors. In this paper, we investigate the potential of using a heterogeneous clustered VLIW microarchitecture. In the proposed microarchitecture, each cluster, the interconnection network and the supporting memory hierarchy can run at different frequencies and voltages. Some of the clusters can then be configured to be performance-oriented and run at high frequency, while the other clusters can be configured to be low-power-oriented and run at lower frequencies, thus reducing overall consumption. For this heterogeneous design to be effective, we need to select the most suitable frequencies and voltages for each component. We propose a scheme to choose these parameters based on a model that estimates the energy consumption and the execution time of floating-point codes at compile time. Finally, we present a modulo scheduling technique based on graph partitioning that exploits the opportunities presented on heterogeneous clustered microarchitectures. Results show that the Energy-Delay product (ED2) can be significantly reduced by 15% on average for a microarchitecture with 4-clusters and by as much as 35% for selected programs

提高性能，同时降低功耗，是当前微处理器设计的主要权衡。在本文中，我们研究了使用异构集群VLIW微架构的潜力。在该微体系结构中，每个集群、互连网络和支持存储层可以在不同的频率和电压下运行。然后可以将其中一些集群配置为面向性能并以高频率运行，而其他集群可以配置为面向低功耗并以较低频率运行，从而降低总体消耗。为了使这种异构设计有效，我们需要为每个组件选择最合适的频率和电压。我们提出了一种基于估算浮点代码在编译时的能耗和执行时间的模型来选择这些参数的方案。最后，我们提出了一种基于图分区的模调度技术，该技术利用了异构集群微架构所提供的机会。结果表明，对于具有4个集群的微架构，能量延迟积(ED2)平均可显着降低15%，对于选定的程序可降低多达35%

{"title":"Heterogeneous Clustered VLIW Microarchitectures","authors":"Alex Aletà, J. M. Codina, Antonio González, D. Kaeli","doi":"10.1109/CGO.2007.15","DOIUrl":"https://doi.org/10.1109/CGO.2007.15","url":null,"abstract":"Increasing performance, while at the same time reducing power consumption, is a major design tradeoff in current microprocessors. In this paper, we investigate the potential of using a heterogeneous clustered VLIW microarchitecture. In the proposed microarchitecture, each cluster, the interconnection network and the supporting memory hierarchy can run at different frequencies and voltages. Some of the clusters can then be configured to be performance-oriented and run at high frequency, while the other clusters can be configured to be low-power-oriented and run at lower frequencies, thus reducing overall consumption. For this heterogeneous design to be effective, we need to select the most suitable frequencies and voltages for each component. We propose a scheme to choose these parameters based on a model that estimates the energy consumption and the execution time of floating-point codes at compile time. Finally, we present a modulo scheduling technique based on graph partitioning that exploits the opportunities presented on heterogeneous clustered microarchitectures. Results show that the Energy-Delay product (ED2) can be significantly reduced by 15% on average for a microarchitecture with 4-clusters and by as much as 35% for selected programs","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133092179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Rapidly Selecting Good Compiler Optimizations using Performance Counters 使用性能计数器快速选择良好的编译器优化

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.32

John Cavazos, G. Fursin, F. Agakov, Edwin V. Bonilla, M. O’Boyle, O. Temam

Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is non-trivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-of-the-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter-based approach outperforms techniques based on static code features. Using our technique we achieve a 17% improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a AMD Athlon 64 3700+ platform

对特定程序应用正确的编译器优化可以对程序性能产生重大影响。然而，由于编译器优化的非线性相互作用，确定最佳设置是非常重要的。已经提出了几种搜索编译器选项空间以找到好的解决方案的技术;然而，这种方法可能代价高昂。本文提出了一种不同的方法，使用性能计数器作为确定良好编译器优化设置的手段。这是通过离线学习模型来实现的，然后可以用来确定任何新程序的良好设置。我们表明，这种方法优于最先进的技术，平均速度快两个数量级。此外，我们还表明，基于性能计数器的方法优于基于静态代码特性的技术。使用我们的技术，我们在AMD Athlon 64 3700+平台上的SPEC基准套件上实现了比商业PathScale EKOPath 2.3.1优化编译器的最高优化设置提高了17%

引用次数: 263

Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors 发散合并处理器中动态预测的配置文件辅助编译器支持

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.31

Hyesoon Kim, José A. Joao, O. Mutlu, Y. Patt

Dynamic predication has been proposed to reduce the branch misprediction penalty due to hard-to-predict branch instructions. A proposed dynamic predication architecture, the diverge-merge processor (DMP), provides large performance improvements by dynamically predicating a large set of complex control-flow graphs that result in branch mispredictions. DMP requires significant support from a profiling compiler to determine which branch instructions and control-flow structures can be dynamically predicated. However, previous work on dynamic predication did not extensively examine the tradeoffs involved in profiling and code generation for dynamic predication architectures. This paper describes compiler support for obtaining high performance in the diverge-merge processor. We describe new profile-driven algorithms and heuristics to select branch instructions that are suitable and profitable for dynamic predication. We also develop a new profile-based analytical cost-benefit model to estimate, at compile-time, the performance benefits of the dynamic predication of different types of control-flow structures including complex hammocks and loops. Our evaluations show that DMP can provide 20.4% average performance improvement over a conventional processor on SPEC integer benchmarks with our optimized compiler algorithms, whereas the average performance improvement of the best-performing alternative simple compiler algorithm is 4.5%. We also find that, with the proposed algorithms, DMP performance is not significantly affected by the differences in profile- and run-time input data sets

动态预测是为了减少由于难以预测的分支指令而导致的分支预测错误。一个被提议的动态预测体系结构，即发散合并处理器(DMP)，通过动态预测大量复杂的控制流图来提供很大的性能改进，这些控制流图会导致分支错误预测。DMP需要分析编译器的大力支持，以确定可以动态预测哪些分支指令和控制流结构。然而，以前关于动态预测的工作并没有广泛地研究动态预测体系结构的分析和代码生成所涉及的权衡。本文描述了编译器支持在发散合并处理器中获得高性能。我们描述了新的概要驱动算法和启发式算法，以选择适合和有利于动态预测的分支指令。我们还开发了一种新的基于剖面的分析成本效益模型，用于在编译时估计不同类型的控制流结构(包括复杂的吊床和循环)的动态预测的性能效益。我们的评估表明，在SPEC整数基准测试中，使用我们优化的编译器算法，DMP可以比传统处理器提供20.4%的平均性能提升，而性能最好的替代简单编译器算法的平均性能提升为4.5%。我们还发现，使用所提出的算法，DMP性能不会受到轮廓和运行时输入数据集差异的显着影响

{"title":"Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors","authors":"Hyesoon Kim, José A. Joao, O. Mutlu, Y. Patt","doi":"10.1109/CGO.2007.31","DOIUrl":"https://doi.org/10.1109/CGO.2007.31","url":null,"abstract":"Dynamic predication has been proposed to reduce the branch misprediction penalty due to hard-to-predict branch instructions. A proposed dynamic predication architecture, the diverge-merge processor (DMP), provides large performance improvements by dynamically predicating a large set of complex control-flow graphs that result in branch mispredictions. DMP requires significant support from a profiling compiler to determine which branch instructions and control-flow structures can be dynamically predicated. However, previous work on dynamic predication did not extensively examine the tradeoffs involved in profiling and code generation for dynamic predication architectures. This paper describes compiler support for obtaining high performance in the diverge-merge processor. We describe new profile-driven algorithms and heuristics to select branch instructions that are suitable and profitable for dynamic predication. We also develop a new profile-based analytical cost-benefit model to estimate, at compile-time, the performance benefits of the dynamic predication of different types of control-flow structures including complex hammocks and loops. Our evaluations show that DMP can provide 20.4% average performance improvement over a conventional processor on SPEC integer benchmarks with our optimized compiler algorithms, whereas the average performance improvement of the best-performing alternative simple compiler algorithm is 4.5%. We also find that, with the proposed algorithms, DMP performance is not significantly affected by the differences in profile- and run-time input data sets","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124410827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time 多面体模型的迭代优化:第一部分，一维时间

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.21

L. Pouchet, C. Bastoul, Albert Cohen, Nicolas Vasilache

Emerging microprocessors offer unprecedented parallel computing capabilities and deeper memory hierarchies, increasing the importance of loop transformations in optimizing compilers. Because compiler heuristics rely on simplistic performance models, and because they are bound to a limited set of transformations sequences, they only uncover a fraction of the peak performance on typical benchmarks. Iterative optimization is a maturing framework to address these limitations, but so far, it was not successfully applied complex loop transformation sequences because of the combinatorics of the optimization search space. We focus on the class of loop transformation which can be expressed as one-dimensional affine schedules. We define a systematic exploration method to enumerate the space of all legal, distinct transformations in this class. This method is based on an upstream characterization, as opposed to state-of-the-art downstream filtering approaches. Our results demonstrate orders of magnitude improvements in the size of the search space and in the convergence speed of a dedicated iterative optimization heuristic

新兴的微处理器提供了前所未有的并行计算能力和更深的内存层次结构，增加了循环转换在优化编译器中的重要性。因为编译器启发式依赖于简单的性能模型，而且因为它们绑定到一组有限的转换序列，所以它们只能揭示典型基准测试中峰值性能的一小部分。迭代优化是一种成熟的框架来解决这些限制，但到目前为止，由于优化搜索空间的组合性，它还没有成功地应用于复杂的循环变换序列。重点研究了一类可以表示为一维仿射调度的环变换。我们定义了一种系统的探索方法来枚举该类中所有合法的、不同的转换的空间。这种方法是基于上游特征，而不是最先进的下游过滤方法。我们的结果表明，搜索空间的大小和专用迭代优化启发式的收敛速度有了数量级的改进

{"title":"Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time","authors":"L. Pouchet, C. Bastoul, Albert Cohen, Nicolas Vasilache","doi":"10.1109/CGO.2007.21","DOIUrl":"https://doi.org/10.1109/CGO.2007.21","url":null,"abstract":"Emerging microprocessors offer unprecedented parallel computing capabilities and deeper memory hierarchies, increasing the importance of loop transformations in optimizing compilers. Because compiler heuristics rely on simplistic performance models, and because they are bound to a limited set of transformations sequences, they only uncover a fraction of the peak performance on typical benchmarks. Iterative optimization is a maturing framework to address these limitations, but so far, it was not successfully applied complex loop transformation sequences because of the combinatorics of the optimization search space. We focus on the class of loop transformation which can be expressed as one-dimensional affine schedules. We define a systematic exploration method to enumerate the space of all legal, distinct transformations in this class. This method is based on an upstream characterization, as opposed to state-of-the-art downstream filtering approaches. Our results demonstrate orders of magnitude improvements in the size of the search space and in the convergence speed of a dedicated iterative optimization heuristic","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"96 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120814820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 131

Virtual Cluster Scheduling Through the Scheduling Graph 通过调度图进行虚拟集群调度

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.39

J. M. Codina, Jesús Sánchez, Antonio González

This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters)

提出了一种面向集群处理器的指令调度和集群分配方法。该技术使用了一种新颖的调度图来描述所有可能的调度。一个强大的演绎过程被应用到这个图中，在每一步减少可能的计划集。与传统的表调度技术相比，该方案试图建立指令之间的关系，而不是将每个指令分配到特定的周期。主要的优点是错误或糟糕的时间表可以提前预测和丢弃。此外，指令的集群分配是使用另一个称为虚拟集群的新概念执行的，它定义了必须在同一集群中执行的指令集。这些集群在演绎过程中进行管理，以识别指令之间的不兼容性。虚拟集群到物理集群的映射被延迟，直到指令调度完成。该方法的优点包括:(1)分配时准确的调度信息;(2)调度决策施加的集群分配约束的准确信息。我们使用从Speclnt95和mediabbench中提取的超级块实现并评估了所提出的方案。结果表明，这种方法比以前的先进技术产生更好的调度。加速高达15%，平均加速从2.5%(2个集群)到9.5%(4个集群)不等。

{"title":"Virtual Cluster Scheduling Through the Scheduling Graph","authors":"J. M. Codina, Jesús Sánchez, Antonio González","doi":"10.1109/CGO.2007.39","DOIUrl":"https://doi.org/10.1109/CGO.2007.39","url":null,"abstract":"This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters)","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130782097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance SuperPin:实时性能的并行动态仪器

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.37

S. Wallace, K. Hazelwood

Dynamic instrumentation systems have proven to be extremely valuable for program introspection, architectural simulation, and bug detection. Yet a major drawback of modern instrumentation systems is that the instrumented applications often execute several orders of magnitude slower than native application performance. In this paper, we present a novel approach to dynamic instrumentation where several non-overlapping slices of an application are launched as separate instrumentation threads and executed in parallel in order to approach real-time performance. A direct implementation of our technique in the Pin dynamic instrumentation system results in dramatic speedups for various instrumentation tasks - often resulting in order-of-magnitude performance improvements. Our implementation is available as part of the Pin distribution, which has been downloaded over 10,000 times since its release

动态检测系统已被证明在程序自省、架构模拟和bug检测方面是非常有价值的。然而，现代仪表系统的一个主要缺点是，仪表化应用程序的执行速度通常比本机应用程序的性能慢几个数量级。在本文中，我们提出了一种新的动态检测方法，其中应用程序的几个不重叠的切片作为单独的检测线程启动并并行执行，以接近实时性能。在Pin动态仪器系统中直接实现我们的技术会导致各种仪器任务的显着加速-通常会导致数量级的性能改进。我们的实现作为Pin发行版的一部分提供，自发布以来已被下载超过10,000次

引用次数: 93

Compiler-Directed Variable Latency Aware SPM Management to CopeWith Timing Problems 编译器导向的可变延迟感知SPM管理以应对时间问题

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.6

O. Ozturk, Guilin Chen, M. Kandemir, Mustafa Karaköy

This paper proposes and experimentally evaluates a compiler-driven approach that operates an on-chip scratch-pad memory (SPM) assuming different latencies for the different SPM lines. Our goal is to reduce execution cycles without creating any reliability problems due to variations in access latencies. The proposed scheme achieves its goal by evaluating the reuse of different data items and adopting a reuse and latency aware data-to-SPM placement. It also employs data migration within SPM when it helps to cut down the number of execution cycles further. We also discuss an alternate scheme that can reduce latency of select SPM locations by controlling a circuit level mechanism in software to further improve performance. We implemented our approach within an optimizing compiler and tested its effectiveness through extensive simulations. Our experiments with twelve embedded application codes show that the proposed approach performs much better than the worst-case based design paradigm (16.2% improvement on the average) and comes close (within 5.7%) to an hypothetical best-case design (i.e., one with no process variation) where every SMP locations uniformly have low latency

本文提出并实验评估了一种编译器驱动的方法，该方法操作片上刮擦存储器(SPM)，假设不同的SPM线具有不同的延迟。我们的目标是减少执行周期，而不会由于访问延迟的变化而产生任何可靠性问题。该方案通过评估不同数据项的可重用性，并采用可重用和延迟感知的数据到spm位置来实现其目标。它还在SPM中使用数据迁移，这有助于进一步减少执行周期的数量。我们还讨论了一种替代方案，该方案可以通过控制软件中的电路级机制来减少选定SPM位置的延迟，以进一步提高性能。我们在一个优化编译器中实现了我们的方法，并通过大量的模拟测试了它的有效性。我们对12个嵌入式应用程序代码的实验表明，所提出的方法比基于最坏情况的设计范式(平均改进16.2%)表现得更好，并且接近(在5.7%以内)假设的最佳情况设计(即没有过程变化的设计)，其中每个SMP位置均匀地具有低延迟

{"title":"Compiler-Directed Variable Latency Aware SPM Management to CopeWith Timing Problems","authors":"O. Ozturk, Guilin Chen, M. Kandemir, Mustafa Karaköy","doi":"10.1109/CGO.2007.6","DOIUrl":"https://doi.org/10.1109/CGO.2007.6","url":null,"abstract":"This paper proposes and experimentally evaluates a compiler-driven approach that operates an on-chip scratch-pad memory (SPM) assuming different latencies for the different SPM lines. Our goal is to reduce execution cycles without creating any reliability problems due to variations in access latencies. The proposed scheme achieves its goal by evaluating the reuse of different data items and adopting a reuse and latency aware data-to-SPM placement. It also employs data migration within SPM when it helps to cut down the number of execution cycles further. We also discuss an alternate scheme that can reduce latency of select SPM locations by controlling a circuit level mechanism in software to further improve performance. We implemented our approach within an optimizing compiler and tested its effectiveness through extensive simulations. Our experiments with twelve embedded application codes show that the proposed approach performs much better than the worst-case based design paradigm (16.2% improvement on the average) and comes close (within 5.7%) to an hypothetical best-case design (i.e., one with no process variation) where every SMP locations uniformly have low latency","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129809925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Code Generation and Optimization for Transactional Memory Constructs in an Unmanaged Language 非托管语言中事务性内存结构的代码生成和优化

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.4

Cheng Wang, Weiyu Chen, Youfeng Wu, Bratin Saha, Ali-Reza Adl-Tabatabai

Transactional memory offers significant advantages for concurrency control compared to locks. This paper presents the design and implementation of transactional memory constructs in an unmanaged language. Unmanaged languages pose a unique set of challenges to transactional memory constructs - for example, lack of type and memory safety, use of function pointers, aliasing of local variables, and others. This paper describes novel compiler and runtime mechanisms that address these challenges and optimize the performance of transactions in an unmanaged environment. We have implemented these mechanisms in a production-quality C compiler and a high-performance software transactional memory runtime. We measure the effectiveness of these optimizations and compare the performance of lock-based versus transaction-based programming on a set of concurrent data structures and the SPLASH-2 benchmark suite. On a 16 processor SMP system, the transaction-based version of the SPLASH-2 benchmarks scales much better than the coarse-grain locking version and performs comparably to the fine-grain locking version. Compiler optimizations significantly reduce the overheads of transactional memory so that, on a single thread, the transaction-based version incurs only about 6.4% overhead compared to the lock-based version for the SPLASH-2 benchmark suite. Thus, our system is the first to demonstrate that transactions integrate well with an unmanaged language, and can perform as well as fine-grain locking while providing the programming ease of coarse-grain locking even on an unmanaged environment

与锁相比，事务性内存为并发控制提供了显著的优势。本文介绍了一种非托管语言的事务性内存结构的设计和实现。非托管语言对事务性内存结构提出了一组独特的挑战——例如，缺乏类型和内存安全、使用函数指针、局部变量的别名等等。本文描述了新的编译器和运行时机制，以解决这些挑战并优化非托管环境中的事务性能。我们已经在生产质量的C编译器和高性能软件事务性内存运行时中实现了这些机制。我们测量了这些优化的有效性，并在一组并发数据结构和SPLASH-2基准套件上比较了基于锁和基于事务的编程的性能。在16个处理器的SMP系统上，基于事务的SPLASH-2基准测试比粗粒度锁定版本伸缩性好得多，性能也与细粒度锁定版本相当。编译器优化显著降低了事务性内存的开销，因此，在单个线程上，与基于锁的版本相比，基于事务的版本只产生大约6.4%的开销。因此，我们的系统是第一个证明事务可以很好地与非托管语言集成的系统，并且即使在非托管环境中，也可以像细粒度锁定一样执行得很好，同时提供粗粒度锁定的编程便利性

{"title":"Code Generation and Optimization for Transactional Memory Constructs in an Unmanaged Language","authors":"Cheng Wang, Weiyu Chen, Youfeng Wu, Bratin Saha, Ali-Reza Adl-Tabatabai","doi":"10.1109/CGO.2007.4","DOIUrl":"https://doi.org/10.1109/CGO.2007.4","url":null,"abstract":"Transactional memory offers significant advantages for concurrency control compared to locks. This paper presents the design and implementation of transactional memory constructs in an unmanaged language. Unmanaged languages pose a unique set of challenges to transactional memory constructs - for example, lack of type and memory safety, use of function pointers, aliasing of local variables, and others. This paper describes novel compiler and runtime mechanisms that address these challenges and optimize the performance of transactions in an unmanaged environment. We have implemented these mechanisms in a production-quality C compiler and a high-performance software transactional memory runtime. We measure the effectiveness of these optimizations and compare the performance of lock-based versus transaction-based programming on a set of concurrent data structures and the SPLASH-2 benchmark suite. On a 16 processor SMP system, the transaction-based version of the SPLASH-2 benchmarks scales much better than the coarse-grain locking version and performs comparably to the fine-grain locking version. Compiler optimizations significantly reduce the overheads of transactional memory so that, on a single thread, the transaction-based version incurs only about 6.4% overhead compared to the lock-based version for the SPLASH-2 benchmark suite. Thus, our system is the first to demonstrate that transactions integrate well with an unmanaged language, and can perform as well as fine-grain locking while providing the programming ease of coarse-grain locking even on an unmanaged environment","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128887439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138

Isla Vista Heap Sizing: Using Feedback to Avoid Paging Isla Vista堆大小:使用反馈来避免分页

International Symposium on Code Generation and Optimization (CGO'07)

Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.20

Chris Grzegorczyk, Sunil Soman, C. Krintz, R. Wolski

Managed runtime environments (MREs) employ garbage collection (GC) for automatic memory management. However, GC induces pressure on the virtual memory (VM) manager, since it may touch pages that are not related to the working set of the application. Paging due to GC can significantly hurt performance, even when the application's working set fits into physical memory. We present a feedback-directed heap resizing mechanism to avoid GC-induced paging, using information from the operating system (OS). We avoid costly GCs when there is physical memory available, and trade off GC for paging when memory is constrained. Our mechanism is simple and uses allocation stall events during GC alone to trigger heap resizing, without user participation or OS kernel modification. Our system enables significant performance improvements when real memory is restricted and similar to, or better performance than, the current state-of-the-art MRE, when memory is unconstrained

托管运行时环境(MREs)使用垃圾收集(GC)进行自动内存管理。但是，GC会给虚拟内存(VM)管理器带来压力，因为它可能会触及与应用程序的工作集无关的页面。由于GC导致的分页可能会严重损害性能，即使应用程序的工作集适合物理内存。我们提出了一种反馈导向的堆大小调整机制，使用来自操作系统(OS)的信息来避免gc诱导的分页。当有可用的物理内存时，我们避免使用代价高昂的GC，当内存受限时，我们用分页交换GC。我们的机制很简单，仅在GC期间使用分配停顿事件来触发堆大小调整，不需要用户参与或修改OS内核。当实际内存受限时，我们的系统可以实现显著的性能改进，并且在内存不受限时，性能与当前最先进的MRE相似或更好

引用次数: 26

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Symposium on Code Generation and Optimization (CGO'07)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀