首页 > 最新文献

International Symposium on Code Generation and Optimization (CGO'07)最新文献

英文 中文
Code Compaction of an Operating System Kernel 操作系统内核的代码压缩
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.3
Haifeng He, J. Trimble, Somu Perianayagam, S. Debray, G. Andrews
General-purpose operating systems, such as Linux, are increasingly being used in embedded systems. Computational resources are usually limited, and embedded processors often have a limited amount of memory. This makes code size especially important. This paper describes techniques for automatically reducing the memory footprint of general-purpose operating systems on embedded platforms. The problem is complicated by the fact that kernel code tends to be quite different from ordinary application code, including the presence of a significant amount of hand-written assembly code, multiple entry points, implicit control flow paths involving interrupt handlers, and frequent indirect control flow via function pointers. We use a novel "approximate decompilation" technique to apply source-level program analysis to hand-written assembly code. A prototype implementation of our ideas on an Intel x86 platform, applied to a Linux kernel that has been configured to exclude unnecessary code, obtains a code size reduction of close to 24%
通用操作系统,如Linux,越来越多地用于嵌入式系统。计算资源通常是有限的,嵌入式处理器通常具有有限的内存量。这使得代码大小尤为重要。本文描述了在嵌入式平台上自动减少通用操作系统内存占用的技术。由于内核代码往往与普通应用程序代码大不相同,包括大量手工编写的汇编代码、多个入口点、涉及中断处理程序的隐式控制流路径以及通过函数指针频繁的间接控制流,因此问题变得更加复杂。我们使用一种新颖的“近似反编译”技术,将源代码级程序分析应用于手写的汇编代码。我们的想法在Intel x86平台上的原型实现,应用到已配置为排除不必要代码的Linux内核上,获得了代码大小减少近24%的结果
{"title":"Code Compaction of an Operating System Kernel","authors":"Haifeng He, J. Trimble, Somu Perianayagam, S. Debray, G. Andrews","doi":"10.1109/CGO.2007.3","DOIUrl":"https://doi.org/10.1109/CGO.2007.3","url":null,"abstract":"General-purpose operating systems, such as Linux, are increasingly being used in embedded systems. Computational resources are usually limited, and embedded processors often have a limited amount of memory. This makes code size especially important. This paper describes techniques for automatically reducing the memory footprint of general-purpose operating systems on embedded platforms. The problem is complicated by the fact that kernel code tends to be quite different from ordinary application code, including the presence of a significant amount of hand-written assembly code, multiple entry points, implicit control flow paths involving interrupt handlers, and frequent indirect control flow via function pointers. We use a novel \"approximate decompilation\" technique to apply source-level program analysis to hand-written assembly code. A prototype implementation of our ideas on an Intel x86 platform, applied to a Linux kernel that has been configured to exclude unnecessary code, obtains a code size reduction of close to 24%","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116432385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
GPU Computing: Programming a Massively Parallel Processor GPU计算:大规模并行处理器的编程
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.13
I. Buck
Summary form only given. Many researchers have observed that general purpose computing with programmable graphics hardware (GPUs) has shown promise to solve many of the world's compute intensive problems, many orders of magnitude faster the conventional CPUs. The challenge has been working within the constraints of a graphics programming environment and limited language support to leverage this huge performance potential. GPU computing with CUDA is a new approach to computing where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex computing problems, transforming the GPU into a massively parallel processor. The NVIDIA C-compiler for the GPU provides a complete development environment that gives developers the tools they need to solve new problems in computation-intensive applications such as product design, data analysis, technical computing, and game physics. In this talk, I will provide a description of how CUDA can solve compute intensive problems and highlight the challenges when compiling parallel programs for GPUs including the differences between graphics shaders vs. CUDA applications
只提供摘要形式。许多研究人员已经注意到,使用可编程图形硬件(gpu)的通用计算已经显示出解决世界上许多计算密集型问题的希望,比传统的cpu快许多个数量级。在图形编程环境的限制和有限的语言支持下,利用这种巨大的性能潜力是一个挑战。基于CUDA的GPU计算是一种新的计算方法,其中数百个片上处理器内核同时通信和协作以解决复杂的计算问题,将GPU转变为大规模并行处理器。针对GPU的NVIDIA c编译器提供了一个完整的开发环境,为开发人员提供了解决计算密集型应用程序(如产品设计、数据分析、技术计算和游戏物理)中新问题所需的工具。在这次演讲中,我将介绍CUDA如何解决计算密集型问题,并强调在为gpu编译并行程序时所面临的挑战,包括图形着色器与CUDA应用程序之间的差异
{"title":"GPU Computing: Programming a Massively Parallel Processor","authors":"I. Buck","doi":"10.1109/CGO.2007.13","DOIUrl":"https://doi.org/10.1109/CGO.2007.13","url":null,"abstract":"Summary form only given. Many researchers have observed that general purpose computing with programmable graphics hardware (GPUs) has shown promise to solve many of the world's compute intensive problems, many orders of magnitude faster the conventional CPUs. The challenge has been working within the constraints of a graphics programming environment and limited language support to leverage this huge performance potential. GPU computing with CUDA is a new approach to computing where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex computing problems, transforming the GPU into a massively parallel processor. The NVIDIA C-compiler for the GPU provides a complete development environment that gives developers the tools they need to solve new problems in computation-intensive applications such as product design, data analysis, technical computing, and game physics. In this talk, I will provide a description of how CUDA can solve compute intensive problems and highlight the challenges when compiling parallel programs for GPUs including the differences between graphics shaders vs. CUDA applications","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121538386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications 持久代码缓存:利用跨执行和应用程序的代码重用
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.29
V. Reddi, D. Connors, R. Cohn, Michael D. Smith
Run-time compilation systems are challenged with the task of translating a program's instruction stream while maintaining low overhead. While software managed code caches are utilized to amortize translation costs, they are ineffective for programs with short run times or large amounts of cold code. Such program characteristics are prevalent in real-life computing environments, ranging from graphical user interface (GUI) programs to large-scale applications such as database management systems. Persistent code caching addresses these issues. It is described and evaluated in an industry-strength dynamic binary instrumentation system - Pin. The proposed approach improves the intra-execution model of code reuse by storing and reusing translations across executions, thereby achieving inter-execution persistence. Dynamically linked programs leverage inter-application persistence by using persistent translations of library code generated by other programs. New translations discovered across executions are automatically accumulated into the persistent code caches, thereby improving performance over time. Inter-execution persistence improves the performance of GUI applications by nearly 90%, while inter-application persistence achieves a 59% improvement. In more specialized uses, the SPEC2K INT benchmark suite experiences a 26% improvement under dynamic binary instrumentation. Finally, a 400% speedup is achieved in translating the Oracle database in a regression testing environment
运行时编译系统面临着翻译程序指令流同时保持低开销的挑战。虽然软件管理的代码缓存用于分摊翻译成本,但对于运行时间短或大量冷代码的程序来说,它们是无效的。这样的程序特征在现实生活中的计算环境中很普遍,从图形用户界面(GUI)程序到数据库管理系统等大规模应用程序。持久代码缓存解决了这些问题。在一个行业级的动态二进制仪表系统- Pin中对其进行了描述和评估。所提出的方法通过存储和重用跨执行的转换来改进代码重用的执行内部模型,从而实现执行间的持久性。动态链接程序通过使用其他程序生成的库代码的持久转换来利用应用程序间的持久性。在执行过程中发现的新翻译会自动累积到持久代码缓存中,从而随着时间的推移提高性能。执行间持久性将GUI应用程序的性能提高了近90%,而应用程序间持久性则提高了59%。在更专业的使用中,SPEC2K INT基准测试套件在动态二进制检测下的性能提高了26%。最后,在回归测试环境中转换Oracle数据库的速度提高了400%
{"title":"Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications","authors":"V. Reddi, D. Connors, R. Cohn, Michael D. Smith","doi":"10.1109/CGO.2007.29","DOIUrl":"https://doi.org/10.1109/CGO.2007.29","url":null,"abstract":"Run-time compilation systems are challenged with the task of translating a program's instruction stream while maintaining low overhead. While software managed code caches are utilized to amortize translation costs, they are ineffective for programs with short run times or large amounts of cold code. Such program characteristics are prevalent in real-life computing environments, ranging from graphical user interface (GUI) programs to large-scale applications such as database management systems. Persistent code caching addresses these issues. It is described and evaluated in an industry-strength dynamic binary instrumentation system - Pin. The proposed approach improves the intra-execution model of code reuse by storing and reusing translations across executions, thereby achieving inter-execution persistence. Dynamically linked programs leverage inter-application persistence by using persistent translations of library code generated by other programs. New translations discovered across executions are automatically accumulated into the persistent code caches, thereby improving performance over time. Inter-execution persistence improves the performance of GUI applications by nearly 90%, while inter-application persistence achieves a 59% improvement. In more specialized uses, the SPEC2K INT benchmark suite experiences a 26% improvement under dynamic binary instrumentation. Finally, a 400% speedup is achieved in translating the Oracle database in a regression testing environment","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126053814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Structure Layout Optimization for Multithreaded Programs 多线程程序的结构布局优化
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.36
Easwaran Raman, R. Hundt, Sandya Mannarswamy
Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline
结构布局优化寻求通过改进数据局部性和重用来提高运行时性能。单线程基准测试的结构布局启发式与在多处理器机器上运行的多线程应用程序的结构布局启发式不同,在多处理器机器上需要考虑错误共享的影响。在本文中,我们提出了一种用于多线程应用程序的结构布局转换技术,该技术同时优化了改进的空间局域性和减少错误共享。我们开发了一种半自动工具,可以为多线程程序生成实际的结构布局,并输出有助于布局决策的关键因素。我们将此工具应用于HP-UX内核,并演示这些转换对具有不同属性集的各种已经高度手动调优的键结构的效果。我们展示了朴素的启发式方法可能会在这样一个高度调优的应用程序上导致大量的性能下降,而我们的技术通常可以避免这些缺陷。我们的工具产生的改进结构在高度调整的基线上提高了高达3.2%的性能
{"title":"Structure Layout Optimization for Multithreaded Programs","authors":"Easwaran Raman, R. Hundt, Sandya Mannarswamy","doi":"10.1109/CGO.2007.36","DOIUrl":"https://doi.org/10.1109/CGO.2007.36","url":null,"abstract":"Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122359722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Compilation Techniques for Real-Time Java Programs 实时Java程序的编译技术
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.5
M. Fulton, Mark G. Stoodley
In this paper, we introduce the IBMreg WebSpherereg real time product, which incorporates a virtual machine that is fully Javatrade compliant as well as compliant with the Real-Time Specification for Java (RTSJ). We describe IBM's real-time Java enhancements, particularly in the area of our Testarossa (TR) ahead-of-time (AOT) compiler, our TR just-in-time (JIT) compiler, and our Metronome (Bacon, et al., 2003) deterministic garbage collector (GC). The main focus of this paper is on the various techniques employed by the TR compilers to optimize and regulate the performance of code running in a real-time Java environment through a simple Java source code example. Through the example, we highlight the additional checks required to provide a conformant RTSJ implementation as well as the performance issues with ahead-of-time code generation and the overheads required to support Metronome. We show how these checks are implemented in a production JVM, and then report the cost of the real-time changes in practice for the example as well as the SPECjvm98 benchmark suite, SPECjbb2000, and SPECjbb2005
在本文中,我们介绍了IBMreg WebSpherereg实时产品,它包含了一个完全符合Javatrade以及符合Java实时规范(RTSJ)的虚拟机。我们描述了IBM的实时Java增强,特别是在Testarossa (TR)提前(AOT)编译器、TR即时(JIT)编译器和Metronome (Bacon, et al., 2003)确定性垃圾收集器(GC)方面。本文主要关注TR编译器所采用的各种技术,通过一个简单的Java源代码示例来优化和调节在实时Java环境中运行的代码的性能。通过这个例子,我们强调了提供一致的RTSJ实现所需的额外检查,以及提前代码生成的性能问题和支持Metronome所需的开销。我们将展示如何在生产JVM中实现这些检查,然后报告示例以及SPECjvm98基准测试套件、SPECjbb2000和SPECjbb2005的实际实时更改的成本
{"title":"Compilation Techniques for Real-Time Java Programs","authors":"M. Fulton, Mark G. Stoodley","doi":"10.1109/CGO.2007.5","DOIUrl":"https://doi.org/10.1109/CGO.2007.5","url":null,"abstract":"In this paper, we introduce the IBMreg WebSpherereg real time product, which incorporates a virtual machine that is fully Javatrade compliant as well as compliant with the Real-Time Specification for Java (RTSJ). We describe IBM's real-time Java enhancements, particularly in the area of our Testarossa (TR) ahead-of-time (AOT) compiler, our TR just-in-time (JIT) compiler, and our Metronome (Bacon, et al., 2003) deterministic garbage collector (GC). The main focus of this paper is on the various techniques employed by the TR compilers to optimize and regulate the performance of code running in a real-time Java environment through a simple Java source code example. Through the example, we highlight the additional checks required to provide a conformant RTSJ implementation as well as the performance issues with ahead-of-time code generation and the overheads required to support Metronome. We show how these checks are implemented in a production JVM, and then report the cost of the real-time changes in practice for the example as well as the SPECjvm98 benchmark suite, SPECjbb2000, and SPECjbb2005","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"33 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122965666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping 利用以数据为中心的子图映射的窄加速器
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.11
Amir Hormati, Nathan Clark, S. Mahlke
The demand for high performance has driven acyclic computation accelerators into extensive use in modern embedded and desktop architectures. Accelerators that are ideal from a software perspective, are difficult or impossible to integrate in many modern architectures, though, due to area and timing requirements. This reality is coupled with the observation that many application domains under-utilize accelerator hardware, because of the narrow data they operate on and the nature of their computation. In this work, we take advantage of these facts to design accelerators capable of executing in modern architectures by narrowing datapath width and reducing interconnect. Novel compiler techniques are developed in order to generate high-quality code for the reduced-cost accelerators and prevent performance loss to the extent possible. First, data width profiling is used to statistically determine how wide program data will be at run time. This information is used by the subgraph mapping algorithm to optimally select subgraphs for execution on targeted narrow accelerators. Overall, our data-centric compilation techniques achieve on average 6.5%, and up to 12%, speed up over previous subgraph mapping algorithms for 8-bit accelerators. We also show that, with appropriate compiler support, the increase in the total number of execution cycles in reduced-interconnect accelerators is less than 1% of the fully-connected accelerator
对高性能的需求推动了非循环计算加速器在现代嵌入式和桌面架构中的广泛应用。从软件的角度来看,加速器是理想的,但是由于面积和时间的要求,很难或不可能集成到许多现代架构中。与此同时,我们还观察到许多应用程序领域没有充分利用加速器硬件,因为它们操作的数据很窄,计算的性质也很复杂。在这项工作中,我们利用这些事实来设计能够在现代架构中执行的加速器,通过缩小数据路径宽度和减少互连。为了为低成本的加速器生成高质量的代码,并尽可能地防止性能损失,开发了新的编译器技术。首先,数据宽度分析用于统计地确定程序数据在运行时的宽度。子图映射算法使用该信息来最佳地选择子图,以便在目标窄加速器上执行。总的来说,我们的以数据为中心的编译技术比以前的8位加速器子图映射算法平均提高6.5%,最高提高12%。我们还表明,在适当的编译器支持下,减少互连加速器中执行周期总数的增加不到完全连接加速器的1%
{"title":"Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping","authors":"Amir Hormati, Nathan Clark, S. Mahlke","doi":"10.1109/CGO.2007.11","DOIUrl":"https://doi.org/10.1109/CGO.2007.11","url":null,"abstract":"The demand for high performance has driven acyclic computation accelerators into extensive use in modern embedded and desktop architectures. Accelerators that are ideal from a software perspective, are difficult or impossible to integrate in many modern architectures, though, due to area and timing requirements. This reality is coupled with the observation that many application domains under-utilize accelerator hardware, because of the narrow data they operate on and the nature of their computation. In this work, we take advantage of these facts to design accelerators capable of executing in modern architectures by narrowing datapath width and reducing interconnect. Novel compiler techniques are developed in order to generate high-quality code for the reduced-cost accelerators and prevent performance loss to the extent possible. First, data width profiling is used to statistically determine how wide program data will be at run time. This information is used by the subgraph mapping algorithm to optimally select subgraphs for execution on targeted narrow accelerators. Overall, our data-centric compilation techniques achieve on average 6.5%, and up to 12%, speed up over previous subgraph mapping algorithms for 8-bit accelerators. We also show that, with appropriate compiler support, the increase in the total number of execution cycles in reduced-interconnect accelerators is less than 1% of the fully-connected accelerator","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127595117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
On the Complexity of Register Coalescing 论语域合并的复杂性
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.26
Florent Bouchez, A. Darte, F. Rastello
Memory transfers are becoming more important to optimize, for both performance and power consumption. With this goal in mind, new register allocation schemes are developed, which revisit not only the spilling problem but also the coalescing problem. Indeed, a more aggressive strategy to avoid load/store instructions may increase the constraints to suppress (coalesce) move instructions. This paper is devoted to the complexity of the coalescing phase, in particular in the light of recent developments on the SSA form. We distinguish several optimizations that occur in coalescing heuristics: a) aggressive coalescing removes as many moves as possible, regardless of the colorability of the resulting interference graph; b) conservative coalescing removes as many moves as possible while keeping the colorability of the graph; c) incremental conservative coalescing removes one particular move while keeping the colorability of the graph; d) optimistic coalescing coalesces moves aggressively, then gives up about as few moves as possible so that the graph becomes colorable again. We almost completely classify the NP-completeness of these problems, discussing also on the structure of the interference graph: arbitrary, chordal, or k-colorable in a greedy fashion. We believe that such a study is a necessary step for designing new coalescing strategies
对于性能和功耗而言,内存传输的优化变得越来越重要。为了实现这一目标,开发了新的寄存器分配方案,这些方案不仅重新考虑溢出问题,而且重新考虑合并问题。事实上,避免加载/存储指令的更积极的策略可能会增加抑制(合并)移动指令的约束。本文致力于合并阶段的复杂性,特别是在SSA形式的最新发展的光。我们区分了合并启发式中出现的几种优化:a)主动合并删除尽可能多的移动,而不管所产生的干涉图的可着色性;B)保守合并在保持图的可着色性的同时尽可能多地去除移动;C)增量保守合并去除一个特定的移动,同时保持图的可着色性;D)乐观聚并,聚并会积极地移动,然后尽可能少地放弃移动,这样图形就可以再次上色了。我们几乎完全分类了这些问题的np完备性,并讨论了干涉图的结构:任意的、弦的或k色的。我们认为,这样的研究是设计新的合并策略的必要步骤
{"title":"On the Complexity of Register Coalescing","authors":"Florent Bouchez, A. Darte, F. Rastello","doi":"10.1109/CGO.2007.26","DOIUrl":"https://doi.org/10.1109/CGO.2007.26","url":null,"abstract":"Memory transfers are becoming more important to optimize, for both performance and power consumption. With this goal in mind, new register allocation schemes are developed, which revisit not only the spilling problem but also the coalescing problem. Indeed, a more aggressive strategy to avoid load/store instructions may increase the constraints to suppress (coalesce) move instructions. This paper is devoted to the complexity of the coalescing phase, in particular in the light of recent developments on the SSA form. We distinguish several optimizations that occur in coalescing heuristics: a) aggressive coalescing removes as many moves as possible, regardless of the colorability of the resulting interference graph; b) conservative coalescing removes as many moves as possible while keeping the colorability of the graph; c) incremental conservative coalescing removes one particular move while keeping the colorability of the graph; d) optimistic coalescing coalesces moves aggressively, then gives up about as few moves as possible so that the graph becomes colorable again. We almost completely classify the NP-completeness of these problems, discussing also on the structure of the interference graph: arbitrary, chordal, or k-colorable in a greedy fashion. We believe that such a study is a necessary step for designing new coalescing strategies","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Shadow Profiling: Hiding Instrumentation Costs with Parallelism 影子剖析:用并行性隐藏检测成本
Pub Date : 2007-03-11 DOI: 10.1109/CGO.2007.35
Tipp Moseley, Alex Shye, V. Reddi, D. Grunwald, R. Peri
In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile. The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly
在分析中,存在信息和开销之间的权衡。例如,硬件采样分析器产生的开销可以忽略不计,但它们收集的信息因此非常粗糙。其他分析器使用仪器工具来收集时间轨迹,比如路径配置文件和热内存流,但是它们有很高的开销。运行时和反馈导向编译系统需要详细的信息来进行优化,但是收集概要文件的成本可能大于收益。影子分析是一种新的方法,利用内核数量增加的趋势,在正常执行的同时对长跟踪的代码进行采样。每个仪器样本的长度可达数百万条指令。主要目标是产生可以忽略不计的开销,同时获得几乎与完美概要文件一样准确的概要信息。该分析器不需要修改操作系统或硬件,并且可以调优以实现更大的覆盖范围或更低的开销。我们对两种常见的基于仪器的分析类型(程序间路径分析和价值分析)评估了这种新的分析技术的性能和准确性。总的来说,使用影子分析框架收集的概要文件与完美值概要文件相比准确率为94%,同时产生的开销不到1%。因此,该技术通过隐藏仪器的高开销和支持在线收集许多类型的配置文件(以前这些配置文件的成本太高),提高了动态和连续优化系统的可行性
{"title":"Shadow Profiling: Hiding Instrumentation Costs with Parallelism","authors":"Tipp Moseley, Alex Shye, V. Reddi, D. Grunwald, R. Peri","doi":"10.1109/CGO.2007.35","DOIUrl":"https://doi.org/10.1109/CGO.2007.35","url":null,"abstract":"In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile. The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127242937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
期刊
International Symposium on Code Generation and Optimization (CGO'07)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1