Shubhendu S. Mukherjee, F. Silla, P. Bannon, J. Emer, S. Lang, David Webb
Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.
{"title":"A comparative study of arbitration algorithms for the Alpha 21364 pipelined router","authors":"Shubhendu S. Mukherjee, F. Silla, P. Bannon, J. Emer, S. Lang, David Webb","doi":"10.1145/605397.605421","DOIUrl":"https://doi.org/10.1145/605397.605421","url":null,"abstract":"Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124592520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents Cool-Mem, a family of memory system architectures that integrate conventional memory system mechanisms, energy-aware address translation, and compiler-enabled cache disambiguation techniques, to reduce energy consumption in general purpose architectures. It combines statically speculative cache access modes, a dynamic CAM based Tag-Cache used as backup for statically mispredicted accesses, various conventional multi-level associative cache organizations, embedded protection checking along all cache access mechanisms, as well as architectural organizations to reduce the power consumed by address translation in virtual memory. Because it is based on speculative static information, the approach removes the burden of provable correctness in compiler analysis passes that extract static information. This makes Cool-Mem applicable for large and complex applications, without having any limitations due to complexity issues in the compiler passes or the presence of precompiled static libraries. Based on extensive evaluation, for both SPEC2000 and Mediabench applications, 12% to 20% total energy savings are obtained in the processor, with performance ranging from 1.2% degradation to 8% improvement, for the applications studied.
{"title":"Cool-Mem: combining statically speculative memory accessing with selective address translation for energy efficiency","authors":"R. Ashok, Saurabh Chheda, C. A. Moritz","doi":"10.1145/605397.605412","DOIUrl":"https://doi.org/10.1145/605397.605412","url":null,"abstract":"This paper presents Cool-Mem, a family of memory system architectures that integrate conventional memory system mechanisms, energy-aware address translation, and compiler-enabled cache disambiguation techniques, to reduce energy consumption in general purpose architectures. It combines statically speculative cache access modes, a dynamic CAM based Tag-Cache used as backup for statically mispredicted accesses, various conventional multi-level associative cache organizations, embedded protection checking along all cache access mechanisms, as well as architectural organizations to reduce the power consumed by address translation in virtual memory. Because it is based on speculative static information, the approach removes the burden of provable correctness in compiler analysis passes that extract static information. This makes Cool-Mem applicable for large and complex applications, without having any limitations due to complexity issues in the compiler passes or the presence of precompiled static libraries. Based on extensive evaluation, for both SPEC2000 and Mediabench applications, 12% to 20% total energy savings are obtained in the processor, with performance ranging from 1.2% degradation to 8% improvement, for the applications studied.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130902409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philo Juang, Hidekazu Oki, Yong Wang, M. Martonosi, L. Peh, D. Rubenstein
Over the past decade, mobile computing and wireless communication have become increasingly important drivers of many new computing applications. The field of wireless sensor networks particularly focuses on applications involving autonomous use of compute, sensing, and wireless communication devices for both scientific and commercial purposes. This paper examines the research decisions and design tradeoffs that arise when applying wireless peer-to-peer networking techniques in a mobile sensor network designed to support wildlife tracking for biology research.The ZebraNet system includes custom tracking collars (nodes) carried by animals under study across a large, wild area; the collars operate as a peer-to-peer network to deliver logged data back to researchers. The collars include global positioning system (GPS), Flash memory, wireless transceivers, and a small CPU; essentially each node is a small, wireless computing device. Since there is no cellular service or broadcast communication covering the region where animals are studied, ad hoc, peer-to-peer routing is needed. Although numerous ad hoc protocols exist, additional challenges arise because the researchers themselves are mobile and thus there is no fixed base station towards which to aim data. Overall, our goal is to use the least energy, storage, and other resources necessary to maintain a reliable system with a very high `data homing' success rate. We plan to deploy a 30-node ZebraNet system at the Mpala Research Centre in central Kenya. More broadly, we believe that the domain-centric protocols and energy tradeoffs presented here for ZebraNet will have general applicability in other wireless and sensor applications.
{"title":"Energy-efficient computing for wildlife tracking: design tradeoffs and early experiences with ZebraNet","authors":"Philo Juang, Hidekazu Oki, Yong Wang, M. Martonosi, L. Peh, D. Rubenstein","doi":"10.1145/605397.605408","DOIUrl":"https://doi.org/10.1145/605397.605408","url":null,"abstract":"Over the past decade, mobile computing and wireless communication have become increasingly important drivers of many new computing applications. The field of wireless sensor networks particularly focuses on applications involving autonomous use of compute, sensing, and wireless communication devices for both scientific and commercial purposes. This paper examines the research decisions and design tradeoffs that arise when applying wireless peer-to-peer networking techniques in a mobile sensor network designed to support wildlife tracking for biology research.The ZebraNet system includes custom tracking collars (nodes) carried by animals under study across a large, wild area; the collars operate as a peer-to-peer network to deliver logged data back to researchers. The collars include global positioning system (GPS), Flash memory, wireless transceivers, and a small CPU; essentially each node is a small, wireless computing device. Since there is no cellular service or broadcast communication covering the region where animals are studied, ad hoc, peer-to-peer routing is needed. Although numerous ad hoc protocols exist, additional challenges arise because the researchers themselves are mobile and thus there is no fixed base station towards which to aim data. Overall, our goal is to use the least energy, storage, and other resources necessary to maintain a reliable system with a very high `data homing' success rate. We plan to deploy a 30-node ZebraNet system at the Mpala Research Centre in central Kenya. More broadly, we believe that the domain-centric protocols and energy tradeoffs presented here for ZebraNet will have general applicability in other wireless and sensor applications.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"09 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124445083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interpreters play an important role in many languages, and their performance is critical particularly for the popular language Java. The performance of the interpreter is important even for high-performance virtual machines that employ just-in-time compiler technology, because there are advantages in delaying the start of compilation and in reducing the number of the target methods to be compiled. Many techniques have been proposed to improve the performance of various interpreters, but none of them has fully addressed the issues of minimizing redundant memory accesses and the overhead of indirect branches inherent to interpreters running on superscalar processors. These issues are especially serious for Java because each bytecode is typically one or a few bytes long and the execution routine for each bytecode is also short due to the low-level, stack-based semantics of Java bytecode. In this paper, we describe three novel techniques of our Java bytecode interpreter, write-through top-of-stack caching (WT), position-based handler customization (PHC), and position-based speculative decoding (PSD), which ameliorate these problems for the PowerPC processors. We show how each technique contributes to improving the overall performance of the interpreter for major Java benchmark programs on an IBM POWER3 processor. Among three, PHC is the most effective one. We also show that the main source of memory accesses is due to bytecode fetches and that PHC successfully eliminates the majority of them, while it keeps the instruction cache miss ratios small.
{"title":"Bytecode fetch optimization for a Java interpreter","authors":"Kazunori Ogata, H. Komatsu, T. Nakatani","doi":"10.1145/605397.605404","DOIUrl":"https://doi.org/10.1145/605397.605404","url":null,"abstract":"Interpreters play an important role in many languages, and their performance is critical particularly for the popular language Java. The performance of the interpreter is important even for high-performance virtual machines that employ just-in-time compiler technology, because there are advantages in delaying the start of compilation and in reducing the number of the target methods to be compiled. Many techniques have been proposed to improve the performance of various interpreters, but none of them has fully addressed the issues of minimizing redundant memory accesses and the overhead of indirect branches inherent to interpreters running on superscalar processors. These issues are especially serious for Java because each bytecode is typically one or a few bytes long and the execution routine for each bytecode is also short due to the low-level, stack-based semantics of Java bytecode. In this paper, we describe three novel techniques of our Java bytecode interpreter, write-through top-of-stack caching (WT), position-based handler customization (PHC), and position-based speculative decoding (PSD), which ameliorate these problems for the PowerPC processors. We show how each technique contributes to improving the overall performance of the interpreter for major Java benchmark programs on an IBM POWER3 processor. Among three, PHC is the most effective one. We also show that the main source of memory accesses is due to bytecode fetches and that PHC successfully eliminates the majority of them, while it keeps the instruction cache miss ratios small.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"1991 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125511970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper advocates the use of a monitor-and-recover programming paradigm to enhance the reliability of software, and proposes an architectural design that allows software and hardware to cooperate in making this paradigm more efficient and easier to program.We propose that programmers write monitoring functions assuming simple sequential execution semantics. Our architecture speeds up the computation by executing the monitoring functions speculatively in parallel with the main computation. For recovery, programmers can define fine-grain transactions whose side effects, including all register modifications and memory writes, can either be committed or aborted under program control. Transactions are implemented efficiently by treating them as speculative threads.Our experimental results suggest that monitored execution is more amenable to parallelization than regular program execution. Code monitoring is sped up by a factor of 1.5 by exploiting single-thread instruction-level parallelism, and by an additional factor of 1.6 using thread-level speculation. This results in an overall improvement of 2.5 times and a sustained 5.4 instructions-per-cycle performance. A monitored execution that used to be 2.5 times slower executes with a degradation of only 12% when compared to the performance on the baseline machine. We also show that the concept of fine-grain transactional programming is useful in catching buffer overrun errors through a number of real-life examples.
{"title":"Enhancing software reliability with speculative threads","authors":"Jeffrey T. Oplinger, M. Lam","doi":"10.1145/605397.605417","DOIUrl":"https://doi.org/10.1145/605397.605417","url":null,"abstract":"This paper advocates the use of a monitor-and-recover programming paradigm to enhance the reliability of software, and proposes an architectural design that allows software and hardware to cooperate in making this paradigm more efficient and easier to program.We propose that programmers write monitoring functions assuming simple sequential execution semantics. Our architecture speeds up the computation by executing the monitoring functions speculatively in parallel with the main computation. For recovery, programmers can define fine-grain transactions whose side effects, including all register modifications and memory writes, can either be committed or aborted under program control. Transactions are implemented efficiently by treating them as speculative threads.Our experimental results suggest that monitored execution is more amenable to parallelization than regular program execution. Code monitoring is sped up by a factor of 1.5 by exploiting single-thread instruction-level parallelism, and by an additional factor of 1.6 using thread-level speculation. This results in an overall improvement of 2.5 times and a sustained 5.4 instructions-per-cycle performance. A monitored execution that used to be 2.5 times slower executes with a degradation of only 12% when compared to the performance on the baseline machine. We also show that the concept of fine-grain transactional programming is useful in catching buffer overrun errors through a number of real-life examples.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124137878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}