Haifeng He, J. Trimble, Somu Perianayagam, S. Debray, G. Andrews
General-purpose operating systems, such as Linux, are increasingly being used in embedded systems. Computational resources are usually limited, and embedded processors often have a limited amount of memory. This makes code size especially important. This paper describes techniques for automatically reducing the memory footprint of general-purpose operating systems on embedded platforms. The problem is complicated by the fact that kernel code tends to be quite different from ordinary application code, including the presence of a significant amount of hand-written assembly code, multiple entry points, implicit control flow paths involving interrupt handlers, and frequent indirect control flow via function pointers. We use a novel "approximate decompilation" technique to apply source-level program analysis to hand-written assembly code. A prototype implementation of our ideas on an Intel x86 platform, applied to a Linux kernel that has been configured to exclude unnecessary code, obtains a code size reduction of close to 24%
{"title":"Code Compaction of an Operating System Kernel","authors":"Haifeng He, J. Trimble, Somu Perianayagam, S. Debray, G. Andrews","doi":"10.1109/CGO.2007.3","DOIUrl":"https://doi.org/10.1109/CGO.2007.3","url":null,"abstract":"General-purpose operating systems, such as Linux, are increasingly being used in embedded systems. Computational resources are usually limited, and embedded processors often have a limited amount of memory. This makes code size especially important. This paper describes techniques for automatically reducing the memory footprint of general-purpose operating systems on embedded platforms. The problem is complicated by the fact that kernel code tends to be quite different from ordinary application code, including the presence of a significant amount of hand-written assembly code, multiple entry points, implicit control flow paths involving interrupt handlers, and frequent indirect control flow via function pointers. We use a novel \"approximate decompilation\" technique to apply source-level program analysis to hand-written assembly code. A prototype implementation of our ideas on an Intel x86 platform, applied to a Linux kernel that has been configured to exclude unnecessary code, obtains a code size reduction of close to 24%","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116432385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. Many researchers have observed that general purpose computing with programmable graphics hardware (GPUs) has shown promise to solve many of the world's compute intensive problems, many orders of magnitude faster the conventional CPUs. The challenge has been working within the constraints of a graphics programming environment and limited language support to leverage this huge performance potential. GPU computing with CUDA is a new approach to computing where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex computing problems, transforming the GPU into a massively parallel processor. The NVIDIA C-compiler for the GPU provides a complete development environment that gives developers the tools they need to solve new problems in computation-intensive applications such as product design, data analysis, technical computing, and game physics. In this talk, I will provide a description of how CUDA can solve compute intensive problems and highlight the challenges when compiling parallel programs for GPUs including the differences between graphics shaders vs. CUDA applications
{"title":"GPU Computing: Programming a Massively Parallel Processor","authors":"I. Buck","doi":"10.1109/CGO.2007.13","DOIUrl":"https://doi.org/10.1109/CGO.2007.13","url":null,"abstract":"Summary form only given. Many researchers have observed that general purpose computing with programmable graphics hardware (GPUs) has shown promise to solve many of the world's compute intensive problems, many orders of magnitude faster the conventional CPUs. The challenge has been working within the constraints of a graphics programming environment and limited language support to leverage this huge performance potential. GPU computing with CUDA is a new approach to computing where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex computing problems, transforming the GPU into a massively parallel processor. The NVIDIA C-compiler for the GPU provides a complete development environment that gives developers the tools they need to solve new problems in computation-intensive applications such as product design, data analysis, technical computing, and game physics. In this talk, I will provide a description of how CUDA can solve compute intensive problems and highlight the challenges when compiling parallel programs for GPUs including the differences between graphics shaders vs. CUDA applications","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121538386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Run-time compilation systems are challenged with the task of translating a program's instruction stream while maintaining low overhead. While software managed code caches are utilized to amortize translation costs, they are ineffective for programs with short run times or large amounts of cold code. Such program characteristics are prevalent in real-life computing environments, ranging from graphical user interface (GUI) programs to large-scale applications such as database management systems. Persistent code caching addresses these issues. It is described and evaluated in an industry-strength dynamic binary instrumentation system - Pin. The proposed approach improves the intra-execution model of code reuse by storing and reusing translations across executions, thereby achieving inter-execution persistence. Dynamically linked programs leverage inter-application persistence by using persistent translations of library code generated by other programs. New translations discovered across executions are automatically accumulated into the persistent code caches, thereby improving performance over time. Inter-execution persistence improves the performance of GUI applications by nearly 90%, while inter-application persistence achieves a 59% improvement. In more specialized uses, the SPEC2K INT benchmark suite experiences a 26% improvement under dynamic binary instrumentation. Finally, a 400% speedup is achieved in translating the Oracle database in a regression testing environment
{"title":"Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications","authors":"V. Reddi, D. Connors, R. Cohn, Michael D. Smith","doi":"10.1109/CGO.2007.29","DOIUrl":"https://doi.org/10.1109/CGO.2007.29","url":null,"abstract":"Run-time compilation systems are challenged with the task of translating a program's instruction stream while maintaining low overhead. While software managed code caches are utilized to amortize translation costs, they are ineffective for programs with short run times or large amounts of cold code. Such program characteristics are prevalent in real-life computing environments, ranging from graphical user interface (GUI) programs to large-scale applications such as database management systems. Persistent code caching addresses these issues. It is described and evaluated in an industry-strength dynamic binary instrumentation system - Pin. The proposed approach improves the intra-execution model of code reuse by storing and reusing translations across executions, thereby achieving inter-execution persistence. Dynamically linked programs leverage inter-application persistence by using persistent translations of library code generated by other programs. New translations discovered across executions are automatically accumulated into the persistent code caches, thereby improving performance over time. Inter-execution persistence improves the performance of GUI applications by nearly 90%, while inter-application persistence achieves a 59% improvement. In more specialized uses, the SPEC2K INT benchmark suite experiences a 26% improvement under dynamic binary instrumentation. Finally, a 400% speedup is achieved in translating the Oracle database in a regression testing environment","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126053814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline
{"title":"Structure Layout Optimization for Multithreaded Programs","authors":"Easwaran Raman, R. Hundt, Sandya Mannarswamy","doi":"10.1109/CGO.2007.36","DOIUrl":"https://doi.org/10.1109/CGO.2007.36","url":null,"abstract":"Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122359722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce the IBMreg WebSpherereg real time product, which incorporates a virtual machine that is fully Javatrade compliant as well as compliant with the Real-Time Specification for Java (RTSJ). We describe IBM's real-time Java enhancements, particularly in the area of our Testarossa (TR) ahead-of-time (AOT) compiler, our TR just-in-time (JIT) compiler, and our Metronome (Bacon, et al., 2003) deterministic garbage collector (GC). The main focus of this paper is on the various techniques employed by the TR compilers to optimize and regulate the performance of code running in a real-time Java environment through a simple Java source code example. Through the example, we highlight the additional checks required to provide a conformant RTSJ implementation as well as the performance issues with ahead-of-time code generation and the overheads required to support Metronome. We show how these checks are implemented in a production JVM, and then report the cost of the real-time changes in practice for the example as well as the SPECjvm98 benchmark suite, SPECjbb2000, and SPECjbb2005
在本文中,我们介绍了IBMreg WebSpherereg实时产品,它包含了一个完全符合Javatrade以及符合Java实时规范(RTSJ)的虚拟机。我们描述了IBM的实时Java增强,特别是在Testarossa (TR)提前(AOT)编译器、TR即时(JIT)编译器和Metronome (Bacon, et al., 2003)确定性垃圾收集器(GC)方面。本文主要关注TR编译器所采用的各种技术,通过一个简单的Java源代码示例来优化和调节在实时Java环境中运行的代码的性能。通过这个例子,我们强调了提供一致的RTSJ实现所需的额外检查,以及提前代码生成的性能问题和支持Metronome所需的开销。我们将展示如何在生产JVM中实现这些检查,然后报告示例以及SPECjvm98基准测试套件、SPECjbb2000和SPECjbb2005的实际实时更改的成本
{"title":"Compilation Techniques for Real-Time Java Programs","authors":"M. Fulton, Mark G. Stoodley","doi":"10.1109/CGO.2007.5","DOIUrl":"https://doi.org/10.1109/CGO.2007.5","url":null,"abstract":"In this paper, we introduce the IBMreg WebSpherereg real time product, which incorporates a virtual machine that is fully Javatrade compliant as well as compliant with the Real-Time Specification for Java (RTSJ). We describe IBM's real-time Java enhancements, particularly in the area of our Testarossa (TR) ahead-of-time (AOT) compiler, our TR just-in-time (JIT) compiler, and our Metronome (Bacon, et al., 2003) deterministic garbage collector (GC). The main focus of this paper is on the various techniques employed by the TR compilers to optimize and regulate the performance of code running in a real-time Java environment through a simple Java source code example. Through the example, we highlight the additional checks required to provide a conformant RTSJ implementation as well as the performance issues with ahead-of-time code generation and the overheads required to support Metronome. We show how these checks are implemented in a production JVM, and then report the cost of the real-time changes in practice for the example as well as the SPECjvm98 benchmark suite, SPECjbb2000, and SPECjbb2005","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"33 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122965666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The demand for high performance has driven acyclic computation accelerators into extensive use in modern embedded and desktop architectures. Accelerators that are ideal from a software perspective, are difficult or impossible to integrate in many modern architectures, though, due to area and timing requirements. This reality is coupled with the observation that many application domains under-utilize accelerator hardware, because of the narrow data they operate on and the nature of their computation. In this work, we take advantage of these facts to design accelerators capable of executing in modern architectures by narrowing datapath width and reducing interconnect. Novel compiler techniques are developed in order to generate high-quality code for the reduced-cost accelerators and prevent performance loss to the extent possible. First, data width profiling is used to statistically determine how wide program data will be at run time. This information is used by the subgraph mapping algorithm to optimally select subgraphs for execution on targeted narrow accelerators. Overall, our data-centric compilation techniques achieve on average 6.5%, and up to 12%, speed up over previous subgraph mapping algorithms for 8-bit accelerators. We also show that, with appropriate compiler support, the increase in the total number of execution cycles in reduced-interconnect accelerators is less than 1% of the fully-connected accelerator
{"title":"Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping","authors":"Amir Hormati, Nathan Clark, S. Mahlke","doi":"10.1109/CGO.2007.11","DOIUrl":"https://doi.org/10.1109/CGO.2007.11","url":null,"abstract":"The demand for high performance has driven acyclic computation accelerators into extensive use in modern embedded and desktop architectures. Accelerators that are ideal from a software perspective, are difficult or impossible to integrate in many modern architectures, though, due to area and timing requirements. This reality is coupled with the observation that many application domains under-utilize accelerator hardware, because of the narrow data they operate on and the nature of their computation. In this work, we take advantage of these facts to design accelerators capable of executing in modern architectures by narrowing datapath width and reducing interconnect. Novel compiler techniques are developed in order to generate high-quality code for the reduced-cost accelerators and prevent performance loss to the extent possible. First, data width profiling is used to statistically determine how wide program data will be at run time. This information is used by the subgraph mapping algorithm to optimally select subgraphs for execution on targeted narrow accelerators. Overall, our data-centric compilation techniques achieve on average 6.5%, and up to 12%, speed up over previous subgraph mapping algorithms for 8-bit accelerators. We also show that, with appropriate compiler support, the increase in the total number of execution cycles in reduced-interconnect accelerators is less than 1% of the fully-connected accelerator","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127595117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory transfers are becoming more important to optimize, for both performance and power consumption. With this goal in mind, new register allocation schemes are developed, which revisit not only the spilling problem but also the coalescing problem. Indeed, a more aggressive strategy to avoid load/store instructions may increase the constraints to suppress (coalesce) move instructions. This paper is devoted to the complexity of the coalescing phase, in particular in the light of recent developments on the SSA form. We distinguish several optimizations that occur in coalescing heuristics: a) aggressive coalescing removes as many moves as possible, regardless of the colorability of the resulting interference graph; b) conservative coalescing removes as many moves as possible while keeping the colorability of the graph; c) incremental conservative coalescing removes one particular move while keeping the colorability of the graph; d) optimistic coalescing coalesces moves aggressively, then gives up about as few moves as possible so that the graph becomes colorable again. We almost completely classify the NP-completeness of these problems, discussing also on the structure of the interference graph: arbitrary, chordal, or k-colorable in a greedy fashion. We believe that such a study is a necessary step for designing new coalescing strategies
{"title":"On the Complexity of Register Coalescing","authors":"Florent Bouchez, A. Darte, F. Rastello","doi":"10.1109/CGO.2007.26","DOIUrl":"https://doi.org/10.1109/CGO.2007.26","url":null,"abstract":"Memory transfers are becoming more important to optimize, for both performance and power consumption. With this goal in mind, new register allocation schemes are developed, which revisit not only the spilling problem but also the coalescing problem. Indeed, a more aggressive strategy to avoid load/store instructions may increase the constraints to suppress (coalesce) move instructions. This paper is devoted to the complexity of the coalescing phase, in particular in the light of recent developments on the SSA form. We distinguish several optimizations that occur in coalescing heuristics: a) aggressive coalescing removes as many moves as possible, regardless of the colorability of the resulting interference graph; b) conservative coalescing removes as many moves as possible while keeping the colorability of the graph; c) incremental conservative coalescing removes one particular move while keeping the colorability of the graph; d) optimistic coalescing coalesces moves aggressively, then gives up about as few moves as possible so that the graph becomes colorable again. We almost completely classify the NP-completeness of these problems, discussing also on the structure of the interference graph: arbitrary, chordal, or k-colorable in a greedy fashion. We believe that such a study is a necessary step for designing new coalescing strategies","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tipp Moseley, Alex Shye, V. Reddi, D. Grunwald, R. Peri
In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile. The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly
{"title":"Shadow Profiling: Hiding Instrumentation Costs with Parallelism","authors":"Tipp Moseley, Alex Shye, V. Reddi, D. Grunwald, R. Peri","doi":"10.1109/CGO.2007.35","DOIUrl":"https://doi.org/10.1109/CGO.2007.35","url":null,"abstract":"In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile. The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127242937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}