A new architectural concept called multithreaded vecrorization is introduced to bmaden the range of “vectorizable” code while keeping the same pipeline efficiency and simplicity as in conventional vector machines. This architecture can be viewed as a compromise between vector and VLIW machines. A compiler algorithm based on the software. pipelining technique is proposed to map loops to multi-threaded architecture. For several conventionally considered nonvectorizable kernels, we show this architecture can deliver as much as 60 percent of performance gain over conventional vector machines.
{"title":"Multi-threaded vectorization","authors":"T. Chiueh","doi":"10.1145/115952.115987","DOIUrl":"https://doi.org/10.1145/115952.115987","url":null,"abstract":"A new architectural concept called multithreaded vecrorization is introduced to bmaden the range of “vectorizable” code while keeping the same pipeline efficiency and simplicity as in conventional vector machines. This architecture can be viewed as a compromise between vector and VLIW machines. A compiler algorithm based on the software. pipelining technique is proposed to map loops to multi-threaded architecture. For several conventionally considered nonvectorizable kernels, we show this architecture can deliver as much as 60 percent of performance gain over conventional vector machines.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134514603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-01DOI: 10.1109/ISCA.1991.1021595
W. Tan, H. Russ, C. Alford
This paperpresen ts thedesignanddevelopmentof anovelprocess01 architecture targeted forhighperformancereakime applications.Theprocessorconsists of four primary components: inputdevices. output devices, a dataflow processing unit (DPU). and a dataflow control unit (DCU). The central element of the processor is the DPU which consumes data from input devices and produces data to output devices. The flow of data for the DPU is orchestrated by the DCU. Implemented in three silicon
{"title":"GT-EP: a novel high-performance real-time architecture","authors":"W. Tan, H. Russ, C. Alford","doi":"10.1109/ISCA.1991.1021595","DOIUrl":"https://doi.org/10.1109/ISCA.1991.1021595","url":null,"abstract":"This paperpresen ts thedesignanddevelopmentof anovelprocess01 architecture targeted forhighperformancereakime applications.Theprocessorconsists of four primary components: inputdevices. output devices, a dataflow processing unit (DPU). and a dataflow control unit (DCU). The central element of the processor is the DPU which consumes data from input devices and produces data to output devices. The flow of data for the DPU is orchestrated by the DCU. Implemented in three silicon<ompiled VLSI chips (one for the DPU and two for the DCU). the design utilizes modem. advanced camputer design concepts and principles to formulate a novel architecture crafted for the target applications. This processor is designated as the \"Executive Processor\" or GT-EP.' Index tams: real-time processing, cornputer architecture, performance constraints. VLSI design, silicon compiler design, dataflow architecture","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114628050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The input/otrtput (1/0) subsystem is often the bottleneck in high performance computer systems where the CPU/Memory technology has been pushed to the limit. But recent microprocessor and workstation speeds ae beginning to shift the system balance to the point that 1/0 is becoming the bottleneck even in mid-range and low-end systems. In this work the 1/0 subsystem’s impact on system performartce is shown by modeling the relative performance of VAX uniprocessors with and without enhancement in the 1/0 subsystem. Tradhional system performance models were enhanced to include the effect of the I/O subsystem. The parameters modeling the I/O subsystem’s effect were identified as D[,O (the number of I/O bytes transferred per instruction executed by the CPU), trl,O (the &ansfer time per I/O byte), and Wq (the waiting time in the I/O subsystem). These parameters were measured on a VAX1 1/780 system by using special purpose hardware and were used to calibrate the enhanced system performance model. It is interesting to note that these measurements indicate that contemporary systems require a factor of eight increase over the I/O bandwidth requirement stated by the Amdhal-Case rule.
{"title":"Modeling and measurement of the impact of input/output on system performance","authors":"J. Akella, D. Siewiorek","doi":"10.1145/115952.115991","DOIUrl":"https://doi.org/10.1145/115952.115991","url":null,"abstract":"The input/otrtput (1/0) subsystem is often the bottleneck in high performance computer systems where the CPU/Memory technology has been pushed to the limit. But recent microprocessor and workstation speeds ae beginning to shift the system balance to the point that 1/0 is becoming the bottleneck even in mid-range and low-end systems. In this work the 1/0 subsystem’s impact on system performartce is shown by modeling the relative performance of VAX uniprocessors with and without enhancement in the 1/0 subsystem. Tradhional system performance models were enhanced to include the effect of the I/O subsystem. The parameters modeling the I/O subsystem’s effect were identified as D[,O (the number of I/O bytes transferred per instruction executed by the CPU), trl,O (the &ansfer time per I/O byte), and Wq (the waiting time in the I/O subsystem). These parameters were measured on a VAX1 1/780 system by using special purpose hardware and were used to calibrate the enhanced system performance model. It is interesting to note that these measurements indicate that contemporary systems require a factor of eight increase over the I/O bandwidth requirement stated by the Amdhal-Case rule.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124227069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For shared-memory systems, the most commonly assumed programmer’s model of memory is sequential consistency. The weaker models of weak ordering, release consistency with sequentially consistent synchronization operations, data-race-free-O, and data-race-free-1 provide higher performance by guaranteeing sequential consistency to only a restricted class of programs - mainly programs that do not exhibit data races. To allow programmers to use the intuition and algorithms already developed for sequentially consistent systems, it is impontant to determine when a program written for a weak system exhibits no data races. In this paper, we investigate the extension of dynamic data race detection techniques developed for sequentially consistent systems to weak systems. A potential problem is that in the presence of a data race, weak systems fail to guarantee sequential consistency and therefore dynamic techniques may not give meaningful results. However, we reason that in practice a weak system will preserve sequential consistency at least until the “first” data races since it cannot predict if a data race will occur. We formalize this condition and show that it allows data races to be dynamically detected. Further, since this condition is already obeyed by all proposed implementations of weak systems, the full performance of weak systems can be exploited.
{"title":"Detecting data races on weak memory systems","authors":"S. Adve, M. Hill, B. Miller, Robert H. B. Netzer","doi":"10.1145/115953.115976","DOIUrl":"https://doi.org/10.1145/115953.115976","url":null,"abstract":"For shared-memory systems, the most commonly assumed programmer’s model of memory is sequential consistency. The weaker models of weak ordering, release consistency with sequentially consistent synchronization operations, data-race-free-O, and data-race-free-1 provide higher performance by guaranteeing sequential consistency to only a restricted class of programs - mainly programs that do not exhibit data races. To allow programmers to use the intuition and algorithms already developed for sequentially consistent systems, it is impontant to determine when a program written for a weak system exhibits no data races. In this paper, we investigate the extension of dynamic data race detection techniques developed for sequentially consistent systems to weak systems. A potential problem is that in the presence of a data race, weak systems fail to guarantee sequential consistency and therefore dynamic techniques may not give meaningful results. However, we reason that in practice a weak system will preserve sequential consistency at least until the “first” data races since it cannot predict if a data race will occur. We formalize this condition and show that it allows data races to be dynamically detected. Further, since this condition is already obeyed by all proposed implementations of weak systems, the full performance of weak systems can be exploited.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128165206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toyohiko Kagimasa, Kikuo Takahashi, Toshiaki Mori, S. Yoshizumi
This paper describes the 'storage management methodology of a very large virtual/real storage systcm called the Super Terabyte System (STS). Advances in semiconductor technology make a vast amount of virtuallreal storage possible in computer systems. One of the most serious problems in supporting large virtual/real storage is the increase in storage management overhead. Adaptive storage management methods, elastic page allocation in multi-size paging architecture, partial analysis controls, partial swapping, and adaptive prepaging are STS's approaches to the problem. We have developed an experimental STS. which realizes virtual storage of 256 terabytes and real storage of 1.5 gigabytes. Evaluation of the system shows that STS prevents the storage management overhead from increasing in most workload environment, and that it can support real storage of a 10gigabyte order and virtual storage of more than a 10gigabyte order.
{"title":"Adaptive storage management for very large virtual/real storage systems","authors":"Toyohiko Kagimasa, Kikuo Takahashi, Toshiaki Mori, S. Yoshizumi","doi":"10.1145/115952.115989","DOIUrl":"https://doi.org/10.1145/115952.115989","url":null,"abstract":"This paper describes the 'storage management methodology of a very large virtual/real storage systcm called the Super Terabyte System (STS). Advances in semiconductor technology make a vast amount of virtuallreal storage possible in computer systems. One of the most serious problems in supporting large virtual/real storage is the increase in storage management overhead. Adaptive storage management methods, elastic page allocation in multi-size paging architecture, partial analysis controls, partial swapping, and adaptive prepaging are STS's approaches to the problem. We have developed an experimental STS. which realizes virtual storage of 256 terabytes and real storage of 1.5 gigabytes. Evaluation of the system shows that STS prevents the storage management overhead from increasing in most workload environment, and that it can support real storage of a 10gigabyte order and virtual storage of more than a 10gigabyte order.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130447872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anoop Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, W. Weber
Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwareconuolled prefetching, and (iv) multiple-context suppon. We some studies of benefits of the individual techniques have been done, no Study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency UNformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally amin better performance than each one on its own. Overall, we show that using suitahle combinations of the techniques, performance can be improved by 4 to 7 dmes
{"title":"Comparative evaluation of latency reducing and tolerating techniques","authors":"Anoop Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, W. Weber","doi":"10.1145/115953.115978","DOIUrl":"https://doi.org/10.1145/115953.115978","url":null,"abstract":"Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwareconuolled prefetching, and (iv) multiple-context suppon. We some studies of benefits of the individual techniques have been done, no Study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency UNformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally amin better performance than each one on its own. Overall, we show that using suitahle combinations of the techniques, performance can be improved by 4 to 7 dmes","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127514443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When address reference degrees of spatial and temporal higher order address lines carry streams exhibit high locality, many of the redundant information. By caching the higher order portions of address references in a set of dynamically allocated base registers, it becomes possible to transmit small register indices between the processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that this technique can significantly reduce processor-to-memory address bus width without an appreciable loss in performance, fhereby increasing available processor bandwidth. Our resulfs imply that as much as 25% of the available 1/0 bandwidth of a processor is used less than 1% of the time.
{"title":"Dynamic base register caching: a technique for reducing address bus width","authors":"M. Farrens, A. Park","doi":"10.1145/115952.115966","DOIUrl":"https://doi.org/10.1145/115952.115966","url":null,"abstract":"When address reference degrees of spatial and temporal higher order address lines carry streams exhibit high locality, many of the redundant information. By caching the higher order portions of address references in a set of dynamically allocated base registers, it becomes possible to transmit small register indices between the processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that this technique can significantly reduce processor-to-memory address bus width without an appreciable loss in performance, fhereby increasing available processor bandwidth. Our resulfs imply that as much as 25% of the available 1/0 bandwidth of a processor is used less than 1% of the time.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Chang, S. Mahlke, William Y. Chen, N. Warter, Wen-mei W. Hwu
The performance of multiple-instruction-issue processors can be severely limited by the com piler’s ability to generate efficient code for concurrent hardware. In the IM P A C T project, we have developed IM P AC T-I, a highly optimizing C compiler to exploit instruction level con currency. The optimization capabilities of the IM P A C T -I C compiler is summarized in this paper. Using the IM P AC T-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors have achieved solid speedup over a high-performance single instruction-issue processor. To address architecture design issues, we ran experiments to charaterize the engineering tradeoffs such as the code scheduling model, the instruction issue rate, the memory load latency, and the function unit resource limitations. Based on the experimental results, we propose the IM P A C T Architectural Framework, a set of architectural features that best support the IM P A C T -I C compiler to generate efficient code for multiple-instruction-issue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IM P A C T -I C compiler.
{"title":"IMPACT: an architectural framework for multiple-instruction-issue processors","authors":"P. Chang, S. Mahlke, William Y. Chen, N. Warter, Wen-mei W. Hwu","doi":"10.1145/285930.286000","DOIUrl":"https://doi.org/10.1145/285930.286000","url":null,"abstract":"The performance of multiple-instruction-issue processors can be severely limited by the com piler’s ability to generate efficient code for concurrent hardware. In the IM P A C T project, we have developed IM P AC T-I, a highly optimizing C compiler to exploit instruction level con currency. The optimization capabilities of the IM P A C T -I C compiler is summarized in this paper. Using the IM P AC T-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors have achieved solid speedup over a high-performance single instruction-issue processor. To address architecture design issues, we ran experiments to charaterize the engineering tradeoffs such as the code scheduling model, the instruction issue rate, the memory load latency, and the function unit resource limitations. Based on the experimental results, we propose the IM P A C T Architectural Framework, a set of architectural features that best support the IM P A C T -I C compiler to generate efficient code for multiple-instruction-issue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IM P A C T -I C compiler.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130398220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}