Achieving good performance in bytecoded language interpreters is difficult without sacrificing both simplicity and portability. This is due to the complexity of dynamic translation ("just-in-time compilation") of bytecodes into native code, which is the mechanism employed universally by high-performance interpreters.We demonstrate that a few simple techniques make it possible to create highly-portable dynamic translators that can attain as much as 70% the performance of optimized C for certain numerical computations. Translators based on such techniques can offer respectable performance without sacrificing either the simplicity or portability of much slower "pure" bytecode interpreters.
{"title":"Optimizing direct threaded code by selective inlining","authors":"Ian Piumarta, F. Riccardi","doi":"10.1145/277650.277743","DOIUrl":"https://doi.org/10.1145/277650.277743","url":null,"abstract":"Achieving good performance in bytecoded language interpreters is difficult without sacrificing both simplicity and portability. This is due to the complexity of dynamic translation (\"just-in-time compilation\") of bytecodes into native code, which is the mechanism employed universally by high-performance interpreters.We demonstrate that a few simple techniques make it possible to create highly-portable dynamic translators that can attain as much as 70% the performance of optimized C for certain numerical computations. Translators based on such techniques can offer respectable performance without sacrificing either the simplicity or portability of much slower \"pure\" bytecode interpreters.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121931694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many program analyses are naturally formulated and implemented using inclusion constraints. We present new results on the scalable implementation of such analyses based on two insights: first, that online elimination of cyclic constraints yields orders-of-magnitude improvements in analysis time for large problems; second, that the choice of constraint representation affects the quality and efficiency of online cycle elimination. We present an analytical model that explains our design choices and show that the model's predictions match well with results from a substantial experiment.
{"title":"Partial online cycle elimination in inclusion constraint graphs","authors":"Manuel Fähndrich, J. Foster, Z. Su, A. Aiken","doi":"10.1145/277650.277667","DOIUrl":"https://doi.org/10.1145/277650.277667","url":null,"abstract":"Many program analyses are naturally formulated and implemented using inclusion constraints. We present new results on the scalable implementation of such analyses based on two insights: first, that online elimination of cyclic constraints yields orders-of-magnitude improvements in analysis time for large problems; second, that the choice of constraint representation affects the quality and efficiency of online cycle elimination. We present an analytical model that explains our design choices and show that the model's predictions match well with results from a substantial experiment.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128536625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Object-oriented applications may contain data members that can be removed from the application without affecting program behavior. Such "dead" data members may occur due to unused functionality in class libraries, or due to the programmer losing track of member usage as the application changes over time. We present a simple and efficient algorithm for detecting dead data members in C++ applications. This algorithm has been implemented using a prototype version of the IBM VisualAge C++ compiler, and applied to a number of realistic benchmark programs ranging from 600 to 58,000 lines of code. For the non-trivial benchmarks, we found that up to 27.3% of the data members in the benchmarks are dead (average 12.5%), and that up to 11.6% of the object space of these applications may be occupied by dead data members at run-time (average 4.4%).
{"title":"A study of dead data members in C++ applications","authors":"P. Sweeney, F. Tip","doi":"10.1145/277650.277750","DOIUrl":"https://doi.org/10.1145/277650.277750","url":null,"abstract":"Object-oriented applications may contain data members that can be removed from the application without affecting program behavior. Such \"dead\" data members may occur due to unused functionality in class libraries, or due to the programmer losing track of member usage as the application changes over time. We present a simple and efficient algorithm for detecting dead data members in C++ applications. This algorithm has been implemented using a prototype version of the IBM VisualAge C++ compiler, and applied to a number of realistic benchmark programs ranging from 600 to 58,000 lines of code. For the non-trivial benchmarks, we found that up to 27.3% of the data members in the benchmarks are dead (average 12.5%), and that up to 11.6% of the object space of these applications may be occupied by dead data members at run-time (average 4.4%).","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128929886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Ayers, Stuart de Jong, John Peyton, R. Schooler
Large applications are typically partitioned into separately compiled modules. Large performance gains in these applications are available by optimizing across module boundaries. One barrier to applying crossmodule optimization (CMO) to large applications is the potentially enormous amount of time and space consumed by the optimization process.We describe a framework for scalable CMO that provides large gains in performance on applications that contain millions of lines of code. Two major techniques are described. First, careful management of in-memory data structures results in sub-linear memory occupancy when compared to the number of lines of code being optimized. Second, profile data is used to focus optimization effort on the performance-critical portions of applications. We also present practical issues that arise in deploying this framework in a production environment. These issues include debuggability and compatibility with existing development tools, such as make. Our framework is deployed in Hewlett-Packard's (HP) UNIX compiler products and speeds up shipped independent software vendors' applications by as much as 71%.
{"title":"Scalable cross-module optimization","authors":"A. Ayers, Stuart de Jong, John Peyton, R. Schooler","doi":"10.1145/277650.277745","DOIUrl":"https://doi.org/10.1145/277650.277745","url":null,"abstract":"Large applications are typically partitioned into separately compiled modules. Large performance gains in these applications are available by optimizing across module boundaries. One barrier to applying crossmodule optimization (CMO) to large applications is the potentially enormous amount of time and space consumed by the optimization process.We describe a framework for scalable CMO that provides large gains in performance on applications that contain millions of lines of code. Two major techniques are described. First, careful management of in-memory data structures results in sub-linear memory occupancy when compared to the number of lines of code being optimized. Second, profile data is used to focus optimization effort on the performance-critical portions of applications. We also present practical issues that arise in deploying this framework in a production environment. These issues include debuggability and compatibility with existing development tools, such as make. Our framework is deployed in Hewlett-Packard's (HP) UNIX compiler products and speeds up shipped independent software vendors' applications by as much as 71%.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114667841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Lo, Fred C. Chow, Robert Kennedy, Shin-Ming Liu, P. Tu
An algorithm for register promotion is presented based on the observation that the circumstances for promoting a memory location's value to register coincide with situations where the program exhibits partial redundancy between accesses to the memory location. The recent SSAPRE algorithm for eliminating partial redundancy using a sparse SSA representation forms the foundation for the present algorithm to eliminate redundancy among memory accesses, enabling us to achieve both computational and live range optimality in our register promotion results. We discuss how to effect speculative code motion in the SSAPRE framework. We present two different algorithms for performing speculative code motion: the conservative speculation algorithm used in the absence of profile data, and the the profile-driven speculation algorithm used when profile data are available. We define the static single use (SSU) form and develop the dual of the SSAPRE algorithm, called SSUPRE, to perform the partial redundancy elimination of stores. We provide measurement data on the SPECint95 benchmark suite to demonstrate the effectiveness of our register promotion approach in removing loads and stores. We also study the relative performance of the different speculative code motion strategies when applied to scalar loads and stores.
{"title":"Register promotion by sparse partial redundancy elimination of loads and stores","authors":"R. Lo, Fred C. Chow, Robert Kennedy, Shin-Ming Liu, P. Tu","doi":"10.1145/277650.277659","DOIUrl":"https://doi.org/10.1145/277650.277659","url":null,"abstract":"An algorithm for register promotion is presented based on the observation that the circumstances for promoting a memory location's value to register coincide with situations where the program exhibits partial redundancy between accesses to the memory location. The recent SSAPRE algorithm for eliminating partial redundancy using a sparse SSA representation forms the foundation for the present algorithm to eliminate redundancy among memory accesses, enabling us to achieve both computational and live range optimality in our register promotion results. We discuss how to effect speculative code motion in the SSAPRE framework. We present two different algorithms for performing speculative code motion: the conservative speculation algorithm used in the absence of profile data, and the the profile-driven speculation algorithm used when profile data are available. We define the static single use (SSU) form and develop the dual of the SSAPRE algorithm, called SSUPRE, to perform the partial redundancy elimination of stores. We provide measurement data on the SPECint95 benchmark suite to demonstrate the effectiveness of our register promotion approach in removing loads and stores. We also study the relative performance of the different speculative code motion strategies when applied to scalar loads and stores.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122179547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents two techniques for improving garbage collection performance: generational stack collection and profile-driven pretenuring. The first is applicable to stack-based implementations of functional languages while the second is useful for any generational collector. We have implemented both techniques in a generational collector used by the TIL compiler (Tarditi, Morrisett, Cheng, Stone, Harper, and Lee 1996), and have observed decreases in garbage collection times of as much as 70% and 30%, respectively.Functional languages encourage the use of recursion which can lead to a long chain of activation records. When a collection occurs, these activation records must be scanned for roots. We show that scanning many activation records can take so long as to become the dominant cost of garbage collection. However, most deep stacks unwind very infrequently, so most of the root information obtained from the stack remains unchanged across successive garbage collections. Generational stack collection greatly reduces the stack scan cost by reusing information from previous scans.Generational techniques have been successful in reducing the cost of garbage collection (Ungar 1984). Various complex heap arrangements and tenuring policies have been proposed to increase the effectiveness of generational techniques by reducing the cost and frequency of scanning and copying. In contrast, we show that by using profile information to make lifetime predictions, pretenuring can avoid copying data altogether. In essence, this technique uses a refinement of the generational hypothesis (most data die young) with a locality principle concerning the age of data: most allocations sites produce data that immediately dies, while a few allocation sites consistently produce data that survives many collections.
{"title":"Generational stack collection and profile-driven pretenuring","authors":"P. Cheng, R. Harper, Peter Lee","doi":"10.1145/277650.277718","DOIUrl":"https://doi.org/10.1145/277650.277718","url":null,"abstract":"This paper presents two techniques for improving garbage collection performance: generational stack collection and profile-driven pretenuring. The first is applicable to stack-based implementations of functional languages while the second is useful for any generational collector. We have implemented both techniques in a generational collector used by the TIL compiler (Tarditi, Morrisett, Cheng, Stone, Harper, and Lee 1996), and have observed decreases in garbage collection times of as much as 70% and 30%, respectively.Functional languages encourage the use of recursion which can lead to a long chain of activation records. When a collection occurs, these activation records must be scanned for roots. We show that scanning many activation records can take so long as to become the dominant cost of garbage collection. However, most deep stacks unwind very infrequently, so most of the root information obtained from the stack remains unchanged across successive garbage collections. Generational stack collection greatly reduces the stack scan cost by reusing information from previous scans.Generational techniques have been successful in reducing the cost of garbage collection (Ungar 1984). Various complex heap arrangements and tenuring policies have been proposed to increase the effectiveness of generational techniques by reducing the cost and frequency of scanning and copying. In contrast, we show that by using profile information to make lifetime predictions, pretenuring can avoid copying data altogether. In essence, this technique uses a refinement of the generational hypothesis (most data die young) with a locality principle concerning the age of data: most allocations sites produce data that immediately dies, while a few allocation sites consistently produce data that survives many collections.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"477 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121349754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The conditional branch has long been considered an expensive operation. The relative cost of conditional branches has increased as recently designed machines are now relying on deeper pipelines and higher multiple issue. Reducing the number of conditional branches executed can often result in a substantial performance benefit. This paper describes a code-improving transformation to reorder sequences of conditional branches. First, sequences of branches that can be reordered are detected in the control flow. Second, profiling information is collected to predict the probability that each branch will transfer control out of the sequence. Third, the cost of performing each conditional branch is estimated. Fourth, the most beneficial ordering of the branches based on the estimated probability and cost is selected. The most beneficial ordering often included the insertion of additional conditional branches that did not previously exist in the sequence. Finally, the control flow is restructured to refflect the new ordering. The results of applying the transformation were significant reductions in the dynamic number of instructions and branches, as well as decreases in execution time.
{"title":"Improving performance by branch reordering","authors":"Minghui Yang, Gang-Ryung Uh, D. Whalley","doi":"10.1145/277650.277711","DOIUrl":"https://doi.org/10.1145/277650.277711","url":null,"abstract":"The conditional branch has long been considered an expensive operation. The relative cost of conditional branches has increased as recently designed machines are now relying on deeper pipelines and higher multiple issue. Reducing the number of conditional branches executed can often result in a substantial performance benefit. This paper describes a code-improving transformation to reorder sequences of conditional branches. First, sequences of branches that can be reordered are detected in the control flow. Second, profiling information is collected to predict the probability that each branch will transfer control out of the sequence. Third, the cost of performing each conditional branch is estimated. Fourth, the most beneficial ordering of the branches based on the estimated probability and cost is selected. The most beneficial ordering often included the insertion of additional conditional branches that did not previously exist in the sequence. Finally, the control flow is restructured to refflect the new ordering. The results of applying the transformation were significant reductions in the dynamic number of instructions and branches, as well as decreases in execution time.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126697383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new register promotion algorithm based on Static Single Assignment (SSA) form. Register promotion is aimed at promoting program names from memory locations to registers. Our algorithm is profile-driven and is based on the scope of intervals. In cases where a complete promotion is not possible because of the presence of function calls or pointer references, the proposed algorithm is capable of eliminating loads and stores on frequently executed paths by placing loads and stores on less frequently executed paths. We also describe an efficient method to incrementally update SSA form when new definitions are cloned from an existing name during register promotion. On SPECInt95 benchmarks, our algorithm removes about ~12% of memory operations which access scalar variables.
{"title":"A new algorithm for scalar register promotion based on SSA form","authors":"A. V. S. Sastry, R. Ju","doi":"10.1145/277650.277656","DOIUrl":"https://doi.org/10.1145/277650.277656","url":null,"abstract":"We present a new register promotion algorithm based on Static Single Assignment (SSA) form. Register promotion is aimed at promoting program names from memory locations to registers. Our algorithm is profile-driven and is based on the scope of intervals. In cases where a complete promotion is not possible because of the presence of function calls or pointer references, the proposed algorithm is capable of eliminating loads and stores on frequently executed paths by placing loads and stores on less frequently executed paths. We also describe an efficient method to incrementally update SSA form when new definitions are cloned from an existing name during register promotion. On SPECInt95 benchmarks, our algorithm removes about ~12% of memory operations which access scalar variables.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126422718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the design and implementation of a compiler that translates programs written in a type-safe subset of the C programming language into highly optimized DEC Alpha assembly language programs, and a certifier that automatically checks the type safety and memory safety of any assembly language program produced by the compiler. The result of the certifier is either a formal proof of type safety or a counterexample pointing to a potential violation of the type system by the target program. The ensemble of the compiler and the certifier is called a certifying compiler.Several advantages of certifying compilation over previous approaches can be claimed. The notion of a certifying compiler is significantly easier to employ than a formal compiler verification, in part because it is generally easier to verify the correctness of the result of a computation than to prove the correctness of the computation itself. Also, the approach can be applied even to highly optimizing compilers, as demonstrated by the fact that our compiler generates target code, for a range of realistic C programs, which is competitive with both the cc and gcc compilers with all optimizations enabled. The certifier also drastically improves the effectiveness of compiler testing because, for each test case, it statically signals compilation errors that might otherwise require many executions to detect. Finally, this approach is a practical way to produce the safety proofs for a Proof-Carrying Code system, and thus may be useful in a system for safe mobile code.
{"title":"The design and implementation of a certifying compiler","authors":"G. Necula, Peter Lee","doi":"10.1145/277650.277752","DOIUrl":"https://doi.org/10.1145/277650.277752","url":null,"abstract":"This paper presents the design and implementation of a compiler that translates programs written in a type-safe subset of the C programming language into highly optimized DEC Alpha assembly language programs, and a certifier that automatically checks the type safety and memory safety of any assembly language program produced by the compiler. The result of the certifier is either a formal proof of type safety or a counterexample pointing to a potential violation of the type system by the target program. The ensemble of the compiler and the certifier is called a certifying compiler.Several advantages of certifying compilation over previous approaches can be claimed. The notion of a certifying compiler is significantly easier to employ than a formal compiler verification, in part because it is generally easier to verify the correctness of the result of a computation than to prove the correctness of the computation itself. Also, the approach can be applied even to highly optimizing compilers, as demonstrated by the fact that our compiler generates target code, for a range of realistic C programs, which is competitive with both the cc and gcc compilers with all optimizations enabled. The certifier also drastically improves the effectiveness of compiler testing because, for each test case, it statically signals compilation errors that might otherwise require many executions to detect. Finally, this approach is a practical way to produce the safety proofs for a Proof-Carrying Code system, and thus may be useful in a system for safe mobile code.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123450527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a type-based approach to eliminating array bound checking and list tag checking by conservatively extending Standard ML with a restricted form of dependent types. This enables the programmer to capture more invariants through types while type-checking remains decidable in theory and can still be performed efficiently in practice. We illustrate our approach through concrete examples and present the result of our preliminary experiments which support support the feasibility and effectiveness of our approach.
{"title":"Eliminating array bound checking through dependent types","authors":"H. Xi, F. Pfenning","doi":"10.1145/277650.277732","DOIUrl":"https://doi.org/10.1145/277650.277732","url":null,"abstract":"We present a type-based approach to eliminating array bound checking and list tag checking by conservatively extending Standard ML with a restricted form of dependent types. This enables the programmer to capture more invariants through types while type-checking remains decidable in theory and can still be performed efficiently in practice. We illustrate our approach through concrete examples and present the result of our preliminary experiments which support support the feasibility and effectiveness of our approach.","PeriodicalId":365404,"journal":{"name":"Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133618901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}