首页 > 最新文献

International Symposium on Code Generation and Optimization, 2003. CGO 2003.最新文献

英文 中文
Coupling on-line and off-line profile information to improve program performance 耦合在线和离线概要信息以提高程序性能
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191534
C. Krintz
In this paper we describe a novel execution environment for Java programs that substantially improves execution performance by incorporating both on-line and off-line profile information to guide dynamic optimization. By using both types of profile collection techniques, we are able to exploit the strengths of each constituent approach: profile accuracy and low overhead. Such coupling also reduces the negative impact of these approaches when each is used in isolation. On-line profiling introduces overhead for dynamic instrumentation, measurement, and decision making. Off-line profile information can be inaccurate when program inputs for execution and optimization differ from those used for profiling. To combat these drawbacks and to achieve the benefits from both online and off-line profiling, we developed a dynamic compilation system (based on JikesRVM) that makes use of both. As a result, we are able improve Java program performance by 9% on average, for the programs studied.
在本文中,我们描述了一种新的Java程序执行环境,它通过结合在线和离线配置文件信息来指导动态优化,从而大大提高了执行性能。通过使用两种类型的概要收集技术,我们能够利用每种组成方法的优势:概要准确性和低开销。当单独使用每种方法时,这种耦合还可以减少这些方法的负面影响。在线分析引入了动态仪器、测量和决策的开销。当用于执行和优化的程序输入与用于分析的程序输入不同时,离线概要信息可能是不准确的。为了克服这些缺点并从在线和离线分析中获得好处,我们开发了一个动态编译系统(基于JikesRVM),它利用了这两种方法。因此,对于所研究的程序,我们能够将Java程序的性能平均提高9%。
{"title":"Coupling on-line and off-line profile information to improve program performance","authors":"C. Krintz","doi":"10.1109/CGO.2003.1191534","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191534","url":null,"abstract":"In this paper we describe a novel execution environment for Java programs that substantially improves execution performance by incorporating both on-line and off-line profile information to guide dynamic optimization. By using both types of profile collection techniques, we are able to exploit the strengths of each constituent approach: profile accuracy and low overhead. Such coupling also reduces the negative impact of these approaches when each is used in isolation. On-line profiling introduces overhead for dynamic instrumentation, measurement, and decision making. Off-line profile information can be inaccurate when program inputs for execution and optimization differ from those used for profiling. To combat these drawbacks and to achieve the benefits from both online and off-line profiling, we developed a dynamic compilation system (based on JikesRVM) that makes use of both. As a result, we are able improve Java program performance by 9% on average, for the programs studied.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117199209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
An infrastructure for adaptive dynamic optimization 一种自适应动态优化的基础结构
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191551
Derek Bruening, Timothy Garnett, Saman P. Amarasinghe
Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc. To demonstrate the usefulness and effectiveness of our framework, we implemented several optimizations. These improve the performance of some applications by as much as 40% relative to native execution. The average speedup relative to base DynamoRIO performance is 12%.
动态优化作为一种克服传统静态编译的许多障碍的有前途的方法正在出现。但是,虽然有许多用于开发静态优化的编译器基础设施,但用于开发动态优化的编译器基础设施却很少。我们提出了一个实现动态分析和优化的框架。我们提供了一个接口,用于为DynamoRIO动态代码修改系统构建外部模块或客户端。这个接口抽象了DynamoRIO运行时系统的许多底层细节,同时提供了一个简单、强大、高效和轻量级的API。这是通过将优化单元限制为线性代码流并使用自适应细节级别来表示指令来实现的。该接口不局限于优化,还可以用于检测、分析、动态转换等。为了演示框架的有用性和有效性,我们实现了几个优化。相对于本机执行,这可以将某些应用程序的性能提高40%。相对于基本DynamoRIO性能,平均加速速度为12%。
{"title":"An infrastructure for adaptive dynamic optimization","authors":"Derek Bruening, Timothy Garnett, Saman P. Amarasinghe","doi":"10.1109/CGO.2003.1191551","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191551","url":null,"abstract":"Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc. To demonstrate the usefulness and effectiveness of our framework, we implemented several optimizations. These improve the performance of some applications by as much as 40% relative to native execution. The average speedup relative to base DynamoRIO performance is 12%.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126599089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 571
Improving quasi-dynamic schedules through region slip 通过区域滑移改进准动态调度
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191541
Francesco Spadini, Brian Fahs, Sanjay J. Patel, S. Lumetta
Modern processors perform dynamic scheduling to achieve better utilization of execution resources. A schedule created at run-time is often better than one created at compile-time as it can dynamically adapt to specific events encountered at execution-time. In this paper, we examine some fundamental impediments to effective static scheduling. More specifically, we examine the question of why schedules generated quasi-dynamically by a low-level runtime optimizer and executed on a statically scheduled machine perform worse than using a dynamically-scheduled approach. We observe that such schedules suffer because of region boundaries and a skewed distribution of parallelism towards the beginning of a region. To overcome these limitations, we investigate a new concept, region slip, in which the schedules of different statically-scheduled regions can be interleaved in the processor issue queue to reduce the region boundary effects that cause empty issue slots.
现代处理器执行动态调度以更好地利用执行资源。在运行时创建的计划通常比在编译时创建的计划要好,因为它可以动态地适应在执行时遇到的特定事件。本文研究了实现有效静态调度的一些基本障碍。更具体地说,我们研究了为什么由低级运行时优化器准动态生成并在静态调度的机器上执行的调度比使用动态调度的方法执行得更差的问题。我们观察到,由于区域边界和区域开始时平行度的倾斜分布,这种调度受到影响。为了克服这些限制,我们研究了一个新的概念,区域滑动,其中不同的静态调度区域的调度可以在处理器问题队列中交错,以减少导致空问题槽的区域边界效应。
{"title":"Improving quasi-dynamic schedules through region slip","authors":"Francesco Spadini, Brian Fahs, Sanjay J. Patel, S. Lumetta","doi":"10.1109/CGO.2003.1191541","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191541","url":null,"abstract":"Modern processors perform dynamic scheduling to achieve better utilization of execution resources. A schedule created at run-time is often better than one created at compile-time as it can dynamically adapt to specific events encountered at execution-time. In this paper, we examine some fundamental impediments to effective static scheduling. More specifically, we examine the question of why schedules generated quasi-dynamically by a low-level runtime optimizer and executed on a statically scheduled machine perform worse than using a dynamically-scheduled approach. We observe that such schedules suffer because of region boundaries and a skewed distribution of parallelism towards the beginning of a region. To overcome these limitations, we investigate a new concept, region slip, in which the schedules of different statically-scheduled regions can be interleaved in the processor issue queue to reduce the region boundary effects that cause empty issue slots.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121878095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Compiler optimization-space exploration 编译器优化-空间探索
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191546
Spyridon Triantafyllis, Manish Vachharajani, David I. August
To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to guide optimizations by predicting their effects a priori. Unfortunately, the unpredictability of optimization interaction and the irregularity of today's wide-issue machines severely limit the accuracy of these heuristics. As a result, compiler writers may temper high variance optimizations with overly conservative heuristics or may exclude these optimizations entirely. While this process results in a compiler capable of generating good average code quality across the target benchmark set, it is at the cost of missed optimization opportunities in individual code segments. To replace predictive heuristics, researchers have proposed compilers which explore many optimization options, selecting the best one a posteriori. Unfortunately, these existing iterative compilation techniques are not practical for reasons of compile time and applicability. We present the Optimization-Space Exploration (OSE) compiler organization, the first practical iterative compilation strategy applicable to optimizations in general-purpose compilers. Instead of replacing predictive heuristics, OSE uses the compiler writer's knowledge encoded in the heuristics to select a small number of promising optimization alternatives for a given code segment. Compile time is limited by evaluating only these alternatives for hot code segments using a general compile-time performance estimator An OSE-enhanced version of Intel's highly-tuned, aggressively optimizing production compiler for IA-64 yields a significant performance improvement, more than 20% in some cases, on Itanium for SPEC codes.
为了满足现代体系结构的需求,优化编译器必须包含越来越多的复杂转换算法。由于代码转换经常会降低性能或干扰后续的转换,编译器使用预测启发式来通过先验地预测其效果来指导优化。不幸的是,优化交互的不可预测性和当今广泛问题机器的不规则性严重限制了这些启发式的准确性。因此,编译器编写者可能会使用过于保守的启发式来缓和高方差优化,或者可能完全排除这些优化。虽然这个过程导致编译器能够在目标基准集上生成良好的平均代码质量,但代价是在单个代码段中错过了优化机会。为了取代预测启发式,研究人员提出了探索许多优化选项的编译器,并在后验中选择最佳选项。不幸的是,由于编译时间和适用性的原因,这些现有的迭代编译技术并不实用。我们提出了优化空间探索(OSE)编译器组织,这是第一个适用于通用编译器优化的实用迭代编译策略。OSE没有替换预测性启发式,而是使用编译器编写者在启发式中编码的知识,为给定的代码段选择少量有前途的优化替代方案。使用通用的编译时性能估计器对热代码段只评估这些替代方案,从而限制了编译时间。针对IA-64的Intel高度调优的、积极优化的生产编译器的操作系统增强版本在Itanium上对SPEC代码产生了显着的性能改进,在某些情况下超过20%。
{"title":"Compiler optimization-space exploration","authors":"Spyridon Triantafyllis, Manish Vachharajani, David I. August","doi":"10.1109/CGO.2003.1191546","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191546","url":null,"abstract":"To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to guide optimizations by predicting their effects a priori. Unfortunately, the unpredictability of optimization interaction and the irregularity of today's wide-issue machines severely limit the accuracy of these heuristics. As a result, compiler writers may temper high variance optimizations with overly conservative heuristics or may exclude these optimizations entirely. While this process results in a compiler capable of generating good average code quality across the target benchmark set, it is at the cost of missed optimization opportunities in individual code segments. To replace predictive heuristics, researchers have proposed compilers which explore many optimization options, selecting the best one a posteriori. Unfortunately, these existing iterative compilation techniques are not practical for reasons of compile time and applicability. We present the Optimization-Space Exploration (OSE) compiler organization, the first practical iterative compilation strategy applicable to optimizations in general-purpose compilers. Instead of replacing predictive heuristics, OSE uses the compiler writer's knowledge encoded in the heuristics to select a small number of promising optimization alternatives for a given code segment. Compile time is limited by evaluating only these alternatives for hot code segments using a general compile-time performance estimator An OSE-enhanced version of Intel's highly-tuned, aggressively optimizing production compiler for IA-64 yields a significant performance improvement, more than 20% in some cases, on Itanium for SPEC codes.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132469846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 269
The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges Transmeta代码变形/spl交易/软件:使用推测、恢复和自适应重新翻译来解决现实生活中的挑战
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191529
James C. Dehnert, Brian K. Grant, John Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, Jim Mattson
Transmeta's Crusoe microprocessor is a full, system-level implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS), that combines an interpreter, dynamic binary translator, optimizer, and run-time system. In its general structure, CMS resembles other binary translation systems described in the literature, but it is unique in several respects. The wide range of PC workloads that CMS must handle gracefully in real-life operation, plus the need for full system-level x86 compatibility, expose several issues that have received little or no attention in previous literature, such as exceptions and interrupts, I/O, DMA, and self-modifying code. In this paper we discuss some of the challenges raised by these issues, and present the techniques developed in Crusoe and CMS to meet those challenges. The key to these solutions is the Crusoe paradigm of aggressive speculation, recovery to a consistent x86 state using unique hardware commit-and-rollback support, and adaptive retranslation when exceptions occur too often to be handled efficiently by interpretation.
Transmeta的Crusoe微处理器是x86架构的一个完整的系统级实现,包括一个本地VLIW微处理器和一个软件层代码变形软件(CMS),该软件层结合了解释器、动态二进制翻译器、优化器和运行时系统。在其总体结构上,CMS与文献中描述的其他二进制翻译系统相似,但它在几个方面是独特的。CMS必须在实际操作中优雅地处理各种PC工作负载,再加上需要完整的系统级x86兼容性,这暴露了以前文献中很少或根本没有注意到的几个问题,例如异常和中断、I/O、DMA和自修改代码。在本文中,我们讨论了这些问题带来的一些挑战,并介绍了克鲁索和CMS开发的技术来应对这些挑战。这些解决方案的关键是Crusoe范例的积极推测,使用独特的硬件提交和回滚支持恢复到一致的x86状态,以及当异常频繁发生而无法通过解释有效处理时的自适应重新翻译。
{"title":"The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges","authors":"James C. Dehnert, Brian K. Grant, John Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, Jim Mattson","doi":"10.1109/CGO.2003.1191529","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191529","url":null,"abstract":"Transmeta's Crusoe microprocessor is a full, system-level implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS), that combines an interpreter, dynamic binary translator, optimizer, and run-time system. In its general structure, CMS resembles other binary translation systems described in the literature, but it is unique in several respects. The wide range of PC workloads that CMS must handle gracefully in real-life operation, plus the need for full system-level x86 compatibility, expose several issues that have received little or no attention in previous literature, such as exceptions and interrupts, I/O, DMA, and self-modifying code. In this paper we discuss some of the challenges raised by these issues, and present the techniques developed in Crusoe and CMS to meet those challenges. The key to these solutions is the Crusoe paradigm of aggressive speculation, recovery to a consistent x86 state using unique hardware commit-and-rollback support, and adaptive retranslation when exceptions occur too often to be handled efficiently by interpretation.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114142362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 306
Speculative register promotion using advanced load address table (ALAT) 使用高级加载地址表(ALAT)的推测寄存器提升
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191539
Jin Lin, Tong Chen, W. Hsu, P. Yew
The pervasive use of pointers with complicated patterns in C programs often constrains compiler alias analysis to yield conservative register allocation and promotion. Speculative register promotion with hardware support has the potential to more aggressively promote memory references into registers in the presence of aliases. This paper studies the use of the advanced load address table (ALAT), a data speculation feature defined in the IA-64 architecture, for speculative register promotion. An algorithm for speculative register promotion based on partial redundancy elimination is presented. The algorithm is implemented in Intel's open research compiler (ORC). Experiments on SPEC CPU2000 benchmark programs are conducted to show that speculative register promotion can improve performance of some benchmarks by 1% to 7%.
在C程序中普遍使用具有复杂模式的指针,常常限制编译器别名分析,以产生保守的寄存器分配和提升。硬件支持的推测性寄存器提升有可能在存在别名的情况下更积极地将内存引用提升到寄存器中。本文研究了利用IA-64体系结构中定义的一种数据推测特性——高级加载地址表(ALAT)来推测寄存器的提升。提出了一种基于部分冗余消除的推测寄存器提升算法。该算法在Intel的开放式研究编译器(ORC)中实现。在SPEC CPU2000基准测试程序上进行的实验表明,投机寄存器提升可以使某些基准测试的性能提高1%到7%。
{"title":"Speculative register promotion using advanced load address table (ALAT)","authors":"Jin Lin, Tong Chen, W. Hsu, P. Yew","doi":"10.1109/CGO.2003.1191539","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191539","url":null,"abstract":"The pervasive use of pointers with complicated patterns in C programs often constrains compiler alias analysis to yield conservative register allocation and promotion. Speculative register promotion with hardware support has the potential to more aggressively promote memory references into registers in the presence of aliases. This paper studies the use of the advanced load address table (ALAT), a data speculation feature defined in the IA-64 architecture, for speculative register promotion. An algorithm for speculative register promotion based on partial redundancy elimination is presented. The algorithm is implemented in Intel's open research compiler (ORC). Experiments on SPEC CPU2000 benchmark programs are conducted to show that speculative register promotion can improve performance of some benchmarks by 1% to 7%.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131835077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Optimization for the Intel/spl reg/ Itanium/spl reg/ architecture register stack Intel/spl reg/ Itanium/spl reg/架构寄存器栈的优化
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191538
A. Settle, D. Connors, Gerolf Hoflehner, Daniel M. Lavery
The Intel/spl reg/ Itanium/spl reg/ architecture contains a number of innovative compiler-controllable features designed to exploit instruction level parallelism. New code generation and optimization techniques are critical to the application of these features to improve processor performance. For instance, the Itanium/spl reg/ architecture provides a compiler-controllable virtual register stack to reduce the penalty of memory accesses associated with procedure calls. The Itanium/spl reg/ Register Stack Engine (RSE) transparently manages the register stack and saves and restores physical registers to and from memory as needed. Existing code generation techniques for the register stack aggressively allocate virtual registers without regard to the register pressure on different control-flow paths. As such, applications with large data sets may stress the RSE, and cause substantial execution delays due to the high number of register saves and restores. Since the Itanium/spl reg/ architecture is developed around Explicitly Parallel Instruction Computing (EPIC) concepts, solutions to increasing the register stack efficiency favor code generation techniques rather than hardware approaches.
Intel/spl reg/ Itanium/spl reg/架构包含许多创新的编译器可控特性,旨在利用指令级并行性。新的代码生成和优化技术对于应用这些特性来提高处理器性能至关重要。例如,Itanium/spl reg/体系结构提供了一个编译器可控的虚拟寄存器堆栈,以减少与过程调用相关的内存访问的损失。Itanium/spl注册表/寄存器堆栈引擎(RSE)透明地管理寄存器堆栈,并根据需要在内存中保存和恢复物理寄存器。现有的用于寄存器栈的代码生成技术在不考虑不同控制流路径上的寄存器压力的情况下积极地分配虚拟寄存器。因此,具有大型数据集的应用程序可能会对RSE造成压力,并由于大量的寄存器保存和恢复而导致大量的执行延迟。由于Itanium/spl reg/架构是围绕显式并行指令计算(EPIC)概念开发的,因此提高寄存器堆栈效率的解决方案更倾向于代码生成技术,而不是硬件方法。
{"title":"Optimization for the Intel/spl reg/ Itanium/spl reg/ architecture register stack","authors":"A. Settle, D. Connors, Gerolf Hoflehner, Daniel M. Lavery","doi":"10.1109/CGO.2003.1191538","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191538","url":null,"abstract":"The Intel/spl reg/ Itanium/spl reg/ architecture contains a number of innovative compiler-controllable features designed to exploit instruction level parallelism. New code generation and optimization techniques are critical to the application of these features to improve processor performance. For instance, the Itanium/spl reg/ architecture provides a compiler-controllable virtual register stack to reduce the penalty of memory accesses associated with procedure calls. The Itanium/spl reg/ Register Stack Engine (RSE) transparently manages the register stack and saves and restores physical registers to and from memory as needed. Existing code generation techniques for the register stack aggressively allocate virtual registers without regard to the register pressure on different control-flow paths. As such, applications with large data sets may stress the RSE, and cause substantial execution delays due to the high number of register saves and restores. Since the Itanium/spl reg/ architecture is developed around Explicitly Parallel Instruction Computing (EPIC) concepts, solutions to increasing the register stack efficiency favor code generation techniques rather than hardware approaches.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121862775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Adaptive online context-sensitive inlining 自适应在线上下文敏感内联
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191550
K. Hazelwood, D. Grove
As current trends in software development move toward more complex object-oriented programming, inlining has become a vital optimization that provides substantial performance improvements to C++ and Java programs. Yet, the aggressiveness of the inlining algorithm must be carefully monitored to effectively balance performance and code size. The state-of-the-art is to use profile information (associated with call edges) to guide inlining decisions. In the presence of virtual method calls, profile information for one call edge may not be sufficient for making effectual inlining decisions. Therefore, we explore the use of profiling data with additional levels of context sensitivity. In addition to exploring fixed levels of context sensitivity, we explore several adaptive schemes that attempt to find the ideal degree of context sensitivity for each call site. Our techniques are evaluated on the basis of runtime performance, code size and dynamic compilation time. On average, we found that with minimal impact on performance (+/-1%) context sensitivity can enable 10% reductions in compiled code space and compile time. Performance on individual programs varied from -4.2% to 5.3% while reductions in compile time and code space of up to 33.0% and 56.7% respectively were obtained.
随着软件开发的当前趋势转向更复杂的面向对象编程,内联已经成为一种重要的优化,它为c++和Java程序提供了实质性的性能改进。然而,必须仔细监视内联算法的侵略性,以有效地平衡性能和代码大小。最先进的技术是使用概要信息(与调用边缘相关联)来指导内联决策。在存在虚拟方法调用的情况下,一个调用边缘的概要信息可能不足以做出有效的内联决策。因此,我们将探索具有额外上下文敏感性级别的分析数据的使用。除了探索固定水平的上下文敏感性之外,我们还探索了几种自适应方案,试图为每个呼叫站点找到理想的上下文敏感性程度。我们的技术是根据运行时性能、代码大小和动态编译时间来评估的。平均而言,我们发现在对性能影响最小(+/-1%)的情况下,上下文敏感性可以使编译代码空间和编译时间减少10%。单个程序的性能从-4.2%到5.3%不等,而编译时间和代码空间分别减少了33.0%和56.7%。
{"title":"Adaptive online context-sensitive inlining","authors":"K. Hazelwood, D. Grove","doi":"10.1109/CGO.2003.1191550","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191550","url":null,"abstract":"As current trends in software development move toward more complex object-oriented programming, inlining has become a vital optimization that provides substantial performance improvements to C++ and Java programs. Yet, the aggressiveness of the inlining algorithm must be carefully monitored to effectively balance performance and code size. The state-of-the-art is to use profile information (associated with call edges) to guide inlining decisions. In the presence of virtual method calls, profile information for one call edge may not be sufficient for making effectual inlining decisions. Therefore, we explore the use of profiling data with additional levels of context sensitivity. In addition to exploring fixed levels of context sensitivity, we explore several adaptive schemes that attempt to find the ideal degree of context sensitivity for each call site. Our techniques are evaluated on the basis of runtime performance, code size and dynamic compilation time. On average, we found that with minimal impact on performance (+/-1%) context sensitivity can enable 10% reductions in compiled code space and compile time. Performance on individual programs varied from -4.2% to 5.3% while reductions in compile time and code space of up to 33.0% and 56.7% respectively were obtained.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Integrated prepass scheduling for a Java just-in-time compiler on the IA-64 IA-64上Java即时编译器的集成预传调度
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191542
T. Inagaki, H. Komatsu, T. Nakatani
We present a new integrated prepass scheduling (IPS) algorithm for a Java just-in-time (JIT) compiler which integrates register minimization into list scheduling. We use backtracking in the list scheduling when we have used up all the available registers. To reduce the overhead of backtracking, we incrementally maintain a set of candidate instructions for undoing scheduling. To maximize the ILP after undoing scheduling, we select an instruction chain with the smallest increase in the total execution time. We implemented our new algorithm in a production-level Java JIT compiler for the Intel Itanium processor. The experiment showed that, compared to the best known algorithm by Govindarajan et al., our IPS algorithm improved the performance by up to +1.8% while it reduced the compilation time for IPS by 58% on average.
提出了一种新的Java即时(JIT)编译器的集成预调度(IPS)算法,该算法将寄存器最小化集成到列表调度中。当我们用完所有可用的寄存器时,我们在列表调度中使用回溯。为了减少回溯的开销,我们增量地维护一组用于取消调度的候选指令。为了在取消调度后最大化ILP,我们选择了一个总执行时间增量最小的指令链。我们在Intel Itanium处理器的生产级Java JIT编译器中实现了我们的新算法。实验表明,与Govindarajan等人最著名的算法相比,我们的IPS算法的性能提高了+1.8%,而IPS的编译时间平均减少了58%。
{"title":"Integrated prepass scheduling for a Java just-in-time compiler on the IA-64","authors":"T. Inagaki, H. Komatsu, T. Nakatani","doi":"10.1109/CGO.2003.1191542","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191542","url":null,"abstract":"We present a new integrated prepass scheduling (IPS) algorithm for a Java just-in-time (JIT) compiler which integrates register minimization into list scheduling. We use backtracking in the list scheduling when we have used up all the available registers. To reduce the overhead of backtracking, we incrementally maintain a set of candidate instructions for undoing scheduling. To maximize the ILP after undoing scheduling, we select an instruction chain with the smallest increase in the total execution time. We implemented our new algorithm in a production-level Java JIT compiler for the Intel Itanium processor. The experiment showed that, compared to the best known algorithm by Govindarajan et al., our IPS algorithm improved the performance by up to +1.8% while it reduced the compilation time for IPS by 58% on average.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128234215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
TEST: a Tracer for Extracting Speculative Threads 用于提取推测线程的跟踪程序
Pub Date : 2003-03-23 DOI: 10.1109/CGO.2003.1191554
M. Chen, K. Olukotun
Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques for identifying decompositions have practical limitations in real systems. Traditional parallelizing compilers do not work effectively on most integer programs, and software profiling slows down program execution too much for real-time analysis. Tracer for Extracting Speculative Threads (TEST) is hardware support that analyzes sequential program execution to estimate performance of possible thread decompositions. This hardware is used in a dynamic parallelization system that automatically transforms unmodified, sequential Java programs to run on TLS processors. In this system, the best thread decompositions found by TEST are dynamically recompiled to run speculatively. The paper describes the analysis performed by TEST and presents simulation results demonstrating its effectiveness on real programs. Estimates are also provided that show the tracer requires minimal hardware additions to our speculative chip-multiprocessor (< 1% of the total transistor count) and causes only minor slowdowns to programs during analysis (3-25%).
线程级推测(TLS)允许将顺序程序任意分解为可以安全地并行执行的线程。TLS处理器面临的一个关键挑战是选择能够加速程序的线程分解。目前用于识别分解的技术在实际系统中具有实际的局限性。传统的并行编译器在大多数整数程序上不能有效地工作,软件分析大大降低了程序的执行速度,不利于实时分析。用于提取推测线程的跟踪器(TEST)是一种硬件支持,它分析顺序程序执行以估计可能的线程分解的性能。该硬件用于动态并行化系统,该系统自动将未修改的顺序Java程序转换为在TLS处理器上运行。在这个系统中,由TEST发现的最佳线程分解被动态重新编译以推测运行。本文描述了用TEST进行的分析,并给出了仿真结果,证明了其在实际程序中的有效性。还提供了估计,表明示踪剂需要最小的硬件添加到我们的推测芯片多处理器(小于总晶体管数的1%),并且在分析期间只导致程序的轻微减速(3-25%)。
{"title":"TEST: a Tracer for Extracting Speculative Threads","authors":"M. Chen, K. Olukotun","doi":"10.1109/CGO.2003.1191554","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191554","url":null,"abstract":"Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques for identifying decompositions have practical limitations in real systems. Traditional parallelizing compilers do not work effectively on most integer programs, and software profiling slows down program execution too much for real-time analysis. Tracer for Extracting Speculative Threads (TEST) is hardware support that analyzes sequential program execution to estimate performance of possible thread decompositions. This hardware is used in a dynamic parallelization system that automatically transforms unmodified, sequential Java programs to run on TLS processors. In this system, the best thread decompositions found by TEST are dynamically recompiled to run speculatively. The paper describes the analysis performed by TEST and presents simulation results demonstrating its effectiveness on real programs. Estimates are also provided that show the tracer requires minimal hardware additions to our speculative chip-multiprocessor (< 1% of the total transistor count) and causes only minor slowdowns to programs during analysis (3-25%).","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125684835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
期刊
International Symposium on Code Generation and Optimization, 2003. CGO 2003.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1