Jason Hiser, Daniel W. Williams, Wei Hu, J. Davidson, Jason Mars, B. Childers
Software dynamic translation (SDT) systems are used for program instrumentation, dynamic optimization, security, intrusion detection, and many other uses. As noted by many researchers, a major source of SDT overhead is the execution of code which is needed to translate an indirect branch's target address into the address of the translated destination block. This paper discusses the sources of indirect branch (IB) overhead in SDT systems and evaluates several techniques for overhead reduction. Measurements using SPEC CPU2000 show that the appropriate choice and configuration of IB translation mechanisms can significantly reduce the IB handling overhead. In addition, cross-architecture evaluation of IB handling mechanisms reveals that the most efficient implementation and configuration can be highly dependent on the implementation of the underlying architecture
{"title":"Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems","authors":"Jason Hiser, Daniel W. Williams, Wei Hu, J. Davidson, Jason Mars, B. Childers","doi":"10.1145/1970386.1970390","DOIUrl":"https://doi.org/10.1145/1970386.1970390","url":null,"abstract":"Software dynamic translation (SDT) systems are used for program instrumentation, dynamic optimization, security, intrusion detection, and many other uses. As noted by many researchers, a major source of SDT overhead is the execution of code which is needed to translate an indirect branch's target address into the address of the translated destination block. This paper discusses the sources of indirect branch (IB) overhead in SDT systems and evaluates several techniques for overhead reduction. Measurements using SPEC CPU2000 show that the appropriate choice and configuration of IB translation mechanisms can significantly reduce the IB handling overhead. In addition, cross-architecture evaluation of IB handling mechanisms reveals that the most efficient implementation and configuration can be highly dependent on the implementation of the underlying architecture","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128010257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Dreweke, M. Wörlein, I. Fischer, Dominic Schell, T. Meinl, M. Philippsen
Procedural abstraction (PA) extracts duplicate code segments into a newly created method and hence reduces code size. For embedded micro computers the amount of memory is still limited so code reduction is an important issue. This paper presents a novel approach to PA, that is especially targeted towards embedded systems. Earlier approaches of PA are blind with respect to code reordering, i.e., two code segments with the same semantic effect but with different instruction orders were not detected as candidates for PA. Instead of instruction sequences, in our approach the data flow graphs of basic blocks are considered. Compared to known PA techniques more than twice the number of instructions can be saved on a set of binaries, by detecting frequently appearing graph fragments with a graph mining tool based on the well known gSpan algorithm. The detection and extraction of graph fragments is not as straight forward as extracting sequential code fragments. NP-complete graph operations and special rules to decide which parts can be abstracted are needed. However, this effort pays off as smaller sizes significantly reduce costs on mass-produced embedded systems
{"title":"Graph-Based Procedural Abstraction","authors":"A. Dreweke, M. Wörlein, I. Fischer, Dominic Schell, T. Meinl, M. Philippsen","doi":"10.1109/CGO.2007.14","DOIUrl":"https://doi.org/10.1109/CGO.2007.14","url":null,"abstract":"Procedural abstraction (PA) extracts duplicate code segments into a newly created method and hence reduces code size. For embedded micro computers the amount of memory is still limited so code reduction is an important issue. This paper presents a novel approach to PA, that is especially targeted towards embedded systems. Earlier approaches of PA are blind with respect to code reordering, i.e., two code segments with the same semantic effect but with different instruction orders were not detected as candidates for PA. Instead of instruction sequences, in our approach the data flow graphs of basic blocks are considered. Compared to known PA techniques more than twice the number of instructions can be saved on a set of binaries, by detecting frequently appearing graph fragments with a graph mining tool based on the well known gSpan algorithm. The detection and extraction of graph fragments is not as straight forward as extracting sequential code fragments. NP-complete graph operations and special rules to decide which parts can be abstracted are needed. However, this effort pays off as smaller sizes significantly reduce costs on mass-produced embedded systems","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128088243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. Moore's Law will continue to increase the number of transistors on die for a couple of decades, as silicon technology moves from 65nm today to 45nm, 32 nm and 22nm in the future. Since the power and thermal constraints increase with frequency, multi-core or many-core will be the way of the future microprocessor. In the near future, HW platforms will have many-cores (>16 cores) on die to achieve >1 TIPs computation power, which will communicate each other through an on-die interconnect fabric with >1 TB/s on-die bandwidth and <30 cycles latency. Off-die D-cache will employ 3D stacked memory technology to tremendously increase off-die cache/memory bandwidth and reduce the latency. Fast copper flex cables will link CPU-DRAM on socket and the optical silicon photonics will provide up to 1 Tb/s I/O bandwidth between boxes. The HW system with TIPs of compute power operating in Tera-bytes of data make this a "Tera-scale" platform. What are the SW implications with the HW changes from uniprocessor to Tera-scale platform with many-cores as "the way of the future?" It will be great challenge for programming environments to help programmers to develop concurrent code for most client software. A good concurrent programming environment should extend the existing programming languages that typical programmers are familiar with, and bring benefits for concurrent programming. There are lots of research topics. Examples of these topics include flexible parallel programming models based on needs from applications, better synchronization mechanisms like Transactional Memory to replace simple "Thread + Lock" structure, nested data parallel language primitives with new protocols, fine-grained synchronization mechanisms with HW support, maybe fine-grained message passing, advanced compiler optimizations for the threaded code, and SW tools in the concurrent programming environment. A more interesting problem is how to use such a many-core system to improve single-threaded performance
{"title":"Parallel Programming Environment: A Key to Translating Tera-Scale Platforms into a Big Success","authors":"J. Fang","doi":"10.1145/1229428.1229430","DOIUrl":"https://doi.org/10.1145/1229428.1229430","url":null,"abstract":"Summary form only given. Moore's Law will continue to increase the number of transistors on die for a couple of decades, as silicon technology moves from 65nm today to 45nm, 32 nm and 22nm in the future. Since the power and thermal constraints increase with frequency, multi-core or many-core will be the way of the future microprocessor. In the near future, HW platforms will have many-cores (>16 cores) on die to achieve >1 TIPs computation power, which will communicate each other through an on-die interconnect fabric with >1 TB/s on-die bandwidth and <30 cycles latency. Off-die D-cache will employ 3D stacked memory technology to tremendously increase off-die cache/memory bandwidth and reduce the latency. Fast copper flex cables will link CPU-DRAM on socket and the optical silicon photonics will provide up to 1 Tb/s I/O bandwidth between boxes. The HW system with TIPs of compute power operating in Tera-bytes of data make this a \"Tera-scale\" platform. What are the SW implications with the HW changes from uniprocessor to Tera-scale platform with many-cores as \"the way of the future?\" It will be great challenge for programming environments to help programmers to develop concurrent code for most client software. A good concurrent programming environment should extend the existing programming languages that typical programmers are familiar with, and bring benefits for concurrent programming. There are lots of research topics. Examples of these topics include flexible parallel programming models based on needs from applications, better synchronization mechanisms like Transactional Memory to replace simple \"Thread + Lock\" structure, nested data parallel language primitives with new protocols, fine-grained synchronization mechanisms with HW support, maybe fine-grained message passing, advanced compiler optimizations for the threaded code, and SW tools in the concurrent programming environment. A more interesting problem is how to use such a many-core system to improve single-threaded performance","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115912786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matlab is a matrix-processing language that offers very efficient built-in operations for data organized in arrays. However Matlab operation is slow when the program accesses data through interpreted loops. Often during the development of a Matlab application writing loop-based code is more intuitive than crafting the data organization into arrays. Furthermore, many Matlab users do not command the linear algebra expertise necessary to write efficient code. Thus loop-based Matlab coding is a fairly common practice. This paper presents a tool that automatically converts loop-based Matlab code into equivalent array-based form and built-in Matlab constructs. Array-based code is produced by checking the input and output dimensions of equations within loops, and by transposing terms when necessary to generate correct code. This paper also describes an extensible loop pattern database that allows user-defined patterns to be discovered and replaced by more efficient Matlab routines that perform the same computation. The safe conversion of loop-based into more efficient array-based code is made possible by the introduction of a new abstract representation for dimensions
{"title":"A Dimension Abstraction Approach to Vectorization in Matlab","authors":"N. Birkbeck, J. Levesque, J. N. Amaral","doi":"10.1109/CGO.2007.1","DOIUrl":"https://doi.org/10.1109/CGO.2007.1","url":null,"abstract":"Matlab is a matrix-processing language that offers very efficient built-in operations for data organized in arrays. However Matlab operation is slow when the program accesses data through interpreted loops. Often during the development of a Matlab application writing loop-based code is more intuitive than crafting the data organization into arrays. Furthermore, many Matlab users do not command the linear algebra expertise necessary to write efficient code. Thus loop-based Matlab coding is a fairly common practice. This paper presents a tool that automatically converts loop-based Matlab code into equivalent array-based form and built-in Matlab constructs. Array-based code is produced by checking the input and output dimensions of equations within loops, and by transposing terms when necessary to generate correct code. This paper also describes an extensible loop pattern database that allows user-defined patterns to be discovered and replaced by more efficient Matlab routines that perform the same computation. The safe conversion of loop-based into more efficient array-based code is made possible by the introduction of a new abstract representation for dimensions","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130640931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Vaswani, M. J. Thazhuthaveetil, Y. Srikant, P. Joseph
This paper proposes the use of empirical modeling techniques for building microarchitecture sensitive models for compiler optimizations. The models we build relate program performance to settings of compiler optimization flags, associated heuristics and key microarchitectural parameters. Unlike traditional analytical modeling methods, this relationship is learned entirely from data obtained by measuring performance at a small number of carefully selected compiler/microarchitecture configurations. We evaluate three different learning techniques in this context viz. linear regression, adaptive regression splines and radial basis function networks. We use the generated models to a) predict program performance at arbitrary compiler/microarchitecture configurations, b) quantify the significance of complex interactions between optimizations and the microarchitecture, and c) efficiently search for 'optimal' settings of optimization flags and heuristics for any given micro architectural configuration. Our evaluation using benchmarks from the SPEC CPU2000 suites suggests that accurate models (< 5% average error in prediction) can be generated using a reasonable number of simulations. We also find that using compiler settings prescribed by a model-based search can improve program performance by as much as 19% (with an average of 9.5%) over highly optimized binaries
{"title":"Microarchitecture Sensitive Empirical Models for Compiler Optimizations","authors":"K. Vaswani, M. J. Thazhuthaveetil, Y. Srikant, P. Joseph","doi":"10.1109/CGO.2007.25","DOIUrl":"https://doi.org/10.1109/CGO.2007.25","url":null,"abstract":"This paper proposes the use of empirical modeling techniques for building microarchitecture sensitive models for compiler optimizations. The models we build relate program performance to settings of compiler optimization flags, associated heuristics and key microarchitectural parameters. Unlike traditional analytical modeling methods, this relationship is learned entirely from data obtained by measuring performance at a small number of carefully selected compiler/microarchitecture configurations. We evaluate three different learning techniques in this context viz. linear regression, adaptive regression splines and radial basis function networks. We use the generated models to a) predict program performance at arbitrary compiler/microarchitecture configurations, b) quantify the significance of complex interactions between optimizations and the microarchitecture, and c) efficiently search for 'optimal' settings of optimization flags and heuristics for any given micro architectural configuration. Our evaluation using benchmarks from the SPEC CPU2000 suites suggests that accurate models (< 5% average error in prediction) can be generated using a reasonable number of simulations. We also find that using compiler settings prescribed by a model-based search can improve program performance by as much as 19% (with an average of 9.5%) over highly optimized binaries","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114798552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each function. However, to make this approach of iterative compilation more widely accepted and deployed in mainstream compilers, it is essential to modify existing algorithms, or develop new ones that find near-optimal solutions quickly. As a step in this direction, in this paper we attempt to identify and understand the important properties of some commonly employed heuristic search methods by using information collected during an exhaustive exploration of the phase order search space. We compare the performance obtained by each algorithm with all others, as well as with the optimal phase ordering performance. Finally, we show how we can use the features of the phase order space to improve existing algorithms as well as devise new and better performing search algorithms
{"title":"Evaluating Heuristic Optimization Phase Order Search Algorithms","authors":"P. Kulkarni, D. Whalley, G. Tyson","doi":"10.1109/CGO.2007.9","DOIUrl":"https://doi.org/10.1109/CGO.2007.9","url":null,"abstract":"Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each function. However, to make this approach of iterative compilation more widely accepted and deployed in mainstream compilers, it is essential to modify existing algorithms, or develop new ones that find near-optimal solutions quickly. As a step in this direction, in this paper we attempt to identify and understand the important properties of some commonly employed heuristic search methods by using information collected during an exhaustive exploration of the phase order search space. We compare the performance obtained by each algorithm with all others, as well as with the optimal phase ordering performance. Finally, we show how we can use the features of the phase order space to improve existing algorithms as well as devise new and better performing search algorithms","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"356 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134085946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To keep up with the explosive Internet packet processing demands, modern network processors (NPs) employ a highly parallel, multi-threaded and multi-core architecture. In such a parallel paradigm, accesses to the shared variables in the external memory (and the associated memory latency) are contained in the critical sections, so that they can be executed atomically and sequentially by different threads in the network processor. In this paper, we present a novel program transformation that is used in the Intelreg Auto-partitioning C Compiler for IXP to exploit the inherent finer-grained parallelism of those critical sections, using the software-controlled caching mechanism available in the NPs. Consequently, those critical sections can be executed in a pipelined fashion by different threads, thereby effectively hiding the memory latency and improving the performance of network applications. Experimental results show that the proposed transformation provides impressive speedup (up-to 9.9times) and scalability (up-to 80 threads) of the performance for the real-world network application (a 10Gbps Ethernet Core/Metro Router)
{"title":"Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors","authors":"J. Dai, Long Li, Bo Huang","doi":"10.1109/CGO.2007.30","DOIUrl":"https://doi.org/10.1109/CGO.2007.30","url":null,"abstract":"To keep up with the explosive Internet packet processing demands, modern network processors (NPs) employ a highly parallel, multi-threaded and multi-core architecture. In such a parallel paradigm, accesses to the shared variables in the external memory (and the associated memory latency) are contained in the critical sections, so that they can be executed atomically and sequentially by different threads in the network processor. In this paper, we present a novel program transformation that is used in the Intelreg Auto-partitioning C Compiler for IXP to exploit the inherent finer-grained parallelism of those critical sections, using the software-controlled caching mechanism available in the NPs. Consequently, those critical sections can be executed in a pipelined fashion by different threads, thereby effectively hiding the memory latency and improving the performance of network applications. Experimental results show that the proposed transformation provides impressive speedup (up-to 9.9times) and scalability (up-to 80 threads) of the performance for the real-world network application (a 10Gbps Ethernet Core/Metro Router)","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133062902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper re-examines the design decisions behind several of these state-of-the-art algorithms, adopting some ideas, rejecting others, all in an attempt to make STMs faster. We created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STM algorithms versus fine-grained hand-crafted ones. We were able to make several illuminating observations regarding lock acquisition order, the interaction of STMs with memory management schemes, and the role of overheads and abort rates in STM performance
{"title":"Understanding Tradeoffs in Software Transactional Memory","authors":"D. Dice, N. Shavit","doi":"10.1109/CGO.2007.38","DOIUrl":"https://doi.org/10.1109/CGO.2007.38","url":null,"abstract":"There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper re-examines the design decisions behind several of these state-of-the-art algorithms, adopting some ideas, rejecting others, all in an attempt to make STMs faster. We created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STM algorithms versus fine-grained hand-crafted ones. We were able to make several illuminating observations regarding lock acquisition order, the interaction of STMs with memory management schemes, and the role of overheads and abort rates in STM performance","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116651130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Denis Barthou, S. Donadio, Patrick Carribault, Alexandre Duchateau, W. Jalby
The increasing complexity of hardware features for recent processors makes high performance code generation very challenging. In particular, several optimization targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level parallelism). Very often, these optimization goals impose different and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix muliply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL)
{"title":"Loop Optimization using Hierarchical Compilation and Kernel Decomposition","authors":"Denis Barthou, S. Donadio, Patrick Carribault, Alexandre Duchateau, W. Jalby","doi":"10.1109/CGO.2007.22","DOIUrl":"https://doi.org/10.1109/CGO.2007.22","url":null,"abstract":"The increasing complexity of hardware features for recent processors makes high performance code generation very challenging. In particular, several optimization targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level parallelism). Very often, these optimization goals impose different and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix muliply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL)","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115773952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As transistors become increasingly smaller and faster with tighter noise margins, modern processors are becoming increasingly more susceptible to transient hardware faults. Existing hardware-based redundant multi-threading (HRMT) approaches rely mostly on special-purpose hardware to replicate the program into redundant execution threads and compare their computation results. In this paper, we present a software-based redundant multi-threading (SRMT) approach for transient fault detection. Our SRMT technique uses compiler to automatically generate redundant threads so they can run on general-purpose chip multi-processors (CMPs). We exploit high-level program information available at compile time to optimize data communication between redundant threads. Furthermore, our software-based technique provides flexible program execution environment where the legacy binary codes and the reliability-enhanced codes can co-exist in a mix-and-match fashion, depending on the desired level of reliability and software compatibility. Our experimental results show that compiler analysis and optimization techniques can reduce data communication requirement by up to 88% of HRMT. With general-purpose intra-chip communication mechanisms in CMP machine, SRMT overhead can be as low as 19%. Moreover, SRMT technique achieves error coverage rates of 99.98% and 99.6% for SPEC CPU2000 integer and floating-point benchmarks, respectively. These results demonstrate the competitiveness of SRMT to HRMT approaches
{"title":"Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection","authors":"Cheng Wang, Ho-Seop Kim, Youfeng Wu, V. Ying","doi":"10.1109/CGO.2007.7","DOIUrl":"https://doi.org/10.1109/CGO.2007.7","url":null,"abstract":"As transistors become increasingly smaller and faster with tighter noise margins, modern processors are becoming increasingly more susceptible to transient hardware faults. Existing hardware-based redundant multi-threading (HRMT) approaches rely mostly on special-purpose hardware to replicate the program into redundant execution threads and compare their computation results. In this paper, we present a software-based redundant multi-threading (SRMT) approach for transient fault detection. Our SRMT technique uses compiler to automatically generate redundant threads so they can run on general-purpose chip multi-processors (CMPs). We exploit high-level program information available at compile time to optimize data communication between redundant threads. Furthermore, our software-based technique provides flexible program execution environment where the legacy binary codes and the reliability-enhanced codes can co-exist in a mix-and-match fashion, depending on the desired level of reliability and software compatibility. Our experimental results show that compiler analysis and optimization techniques can reduce data communication requirement by up to 88% of HRMT. With general-purpose intra-chip communication mechanisms in CMP machine, SRMT overhead can be as low as 19%. Moreover, SRMT technique achieves error coverage rates of 99.98% and 99.6% for SPEC CPU2000 integer and floating-point benchmarks, respectively. These results demonstrate the competitiveness of SRMT to HRMT approaches","PeriodicalId":244171,"journal":{"name":"International Symposium on Code Generation and Optimization (CGO'07)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124326388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}