首页 > 最新文献

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture最新文献

英文 中文
Multi-threaded vectorization 多线程向量化
T. Chiueh
A new architectural concept called multithreaded vecrorization is introduced to bmaden the range of “vectorizable” code while keeping the same pipeline efficiency and simplicity as in conventional vector machines. This architecture can be viewed as a compromise between vector and VLIW machines. A compiler algorithm based on the software. pipelining technique is proposed to map loops to multi-threaded architecture. For several conventionally considered nonvectorizable kernels, we show this architecture can deliver as much as 60 percent of performance gain over conventional vector machines.
引入了一种称为多线程向量化的新架构概念,以扩大“可向量化”代码的范围,同时保持与传统向量机相同的管道效率和简单性。这种体系结构可以看作是向量机和VLIW机之间的折衷。基于该软件的编译算法。提出了将循环映射到多线程体系结构的流水线技术。对于一些传统上被认为是不可向化的内核,我们展示了这种架构可以提供比传统向量机高达60%的性能提升。
{"title":"Multi-threaded vectorization","authors":"T. Chiueh","doi":"10.1145/115952.115987","DOIUrl":"https://doi.org/10.1145/115952.115987","url":null,"abstract":"A new architectural concept called multithreaded vecrorization is introduced to bmaden the range of “vectorizable” code while keeping the same pipeline efficiency and simplicity as in conventional vector machines. This architecture can be viewed as a compromise between vector and VLIW machines. A compiler algorithm based on the software. pipelining technique is proposed to map loops to multi-threaded architecture. For several conventionally considered nonvectorizable kernels, we show this architecture can deliver as much as 60 percent of performance gain over conventional vector machines.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134514603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
GT-EP: a novel high-performance real-time architecture GT-EP:一种新颖的高性能实时架构
W. Tan, H. Russ, C. Alford
This paperpresen ts thedesignanddevelopmentof anovelprocess01 architecture targeted forhighperformancereakime applications.Theprocessorconsists of four primary components: inputdevices. output devices, a dataflow processing unit (DPU). and a dataflow control unit (DCU). The central element of the processor is the DPU which consumes data from input devices and produces data to output devices. The flow of data for the DPU is orchestrated by the DCU. Implemented in three silicon
本文介绍了一种针对高性能计算机应用的新型处理器体系结构的设计与开发。处理器由四个主要组件组成:输入和设备。输出设备,一个数据流处理单元(DPU)。以及数据流控制单元(DCU)。处理器的中心部件是DPU,它从输入设备接收数据并将数据输出到输出设备。DPU的数据流由DCU编排。实现在三个硅编译VLSI芯片(一个用于DPU和两个用于DCU)。本设计利用调制解调器。先进的计算机设计概念和原则,以制定为目标应用程序精心设计的新颖架构。该处理器被指定为“执行处理器”或GT-EP。索引组:实时处理、计算机体系结构、性能约束。VLSI设计,硅编译器设计,数据流架构
{"title":"GT-EP: a novel high-performance real-time architecture","authors":"W. Tan, H. Russ, C. Alford","doi":"10.1109/ISCA.1991.1021595","DOIUrl":"https://doi.org/10.1109/ISCA.1991.1021595","url":null,"abstract":"This paperpresen ts thedesignanddevelopmentof anovelprocess01 architecture targeted forhighperformancereakime applications.Theprocessorconsists of four primary components: inputdevices. output devices, a dataflow processing unit (DPU). and a dataflow control unit (DCU). The central element of the processor is the DPU which consumes data from input devices and produces data to output devices. The flow of data for the DPU is orchestrated by the DCU. Implemented in three silicon<ompiled VLSI chips (one for the DPU and two for the DCU). the design utilizes modem. advanced camputer design concepts and principles to formulate a novel architecture crafted for the target applications. This processor is designated as the \"Executive Processor\" or GT-EP.' Index tams: real-time processing, cornputer architecture, performance constraints. VLSI design, silicon compiler design, dataflow architecture","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114628050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Modeling and measurement of the impact of input/output on system performance 建模和测量输入/输出对系统性能的影响
J. Akella, D. Siewiorek
The input/otrtput (1/0) subsystem is often the bottleneck in high performance computer systems where the CPU/Memory technology has been pushed to the limit. But recent microprocessor and workstation speeds ae beginning to shift the system balance to the point that 1/0 is becoming the bottleneck even in mid-range and low-end systems. In this work the 1/0 subsystem’s impact on system performartce is shown by modeling the relative performance of VAX uniprocessors with and without enhancement in the 1/0 subsystem. Tradhional system performance models were enhanced to include the effect of the I/O subsystem. The parameters modeling the I/O subsystem’s effect were identified as D[,O (the number of I/O bytes transferred per instruction executed by the CPU), trl,O (the &ansfer time per I/O byte), and Wq (the waiting time in the I/O subsystem). These parameters were measured on a VAX1 1/780 system by using special purpose hardware and were used to calibrate the enhanced system performance model. It is interesting to note that these measurements indicate that contemporary systems require a factor of eight increase over the I/O bandwidth requirement stated by the Amdhal-Case rule.
输入/输出(1/0)子系统通常是高性能计算机系统的瓶颈,其中CPU/内存技术已经被推向了极限。但是最近的微处理器和工作站的速度已经开始改变系统的平衡,1/0甚至成为中低端系统的瓶颈。在这项工作中,通过对VAX单处理器在1/0子系统中增强和不增强的相对性能进行建模,显示了1/0子系统对系统性能的影响。传统的系统性能模型得到了增强,以包括I/O子系统的影响。建模I/O子系统效果的参数被确定为D[,O] (CPU执行的每条指令传输的I/O字节数)、trl,O(每个I/O字节的传输时间)和Wq (I/O子系统的等待时间)。这些参数是在VAX1 /780系统上使用专用硬件测量的,并用于校准增强的系统性能模型。有趣的是,这些测量表明,现代系统需要的I/O带宽需求比Amdhal-Case规则规定的带宽需求增加了8倍。
{"title":"Modeling and measurement of the impact of input/output on system performance","authors":"J. Akella, D. Siewiorek","doi":"10.1145/115952.115991","DOIUrl":"https://doi.org/10.1145/115952.115991","url":null,"abstract":"The input/otrtput (1/0) subsystem is often the bottleneck in high performance computer systems where the CPU/Memory technology has been pushed to the limit. But recent microprocessor and workstation speeds ae beginning to shift the system balance to the point that 1/0 is becoming the bottleneck even in mid-range and low-end systems. In this work the 1/0 subsystem’s impact on system performartce is shown by modeling the relative performance of VAX uniprocessors with and without enhancement in the 1/0 subsystem. Tradhional system performance models were enhanced to include the effect of the I/O subsystem. The parameters modeling the I/O subsystem’s effect were identified as D[,O (the number of I/O bytes transferred per instruction executed by the CPU), trl,O (the &ansfer time per I/O byte), and Wq (the waiting time in the I/O subsystem). These parameters were measured on a VAX1 1/780 system by using special purpose hardware and were used to calibrate the enhanced system performance model. It is interesting to note that these measurements indicate that contemporary systems require a factor of eight increase over the I/O bandwidth requirement stated by the Amdhal-Case rule.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124227069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Detecting data races on weak memory systems 检测弱内存系统上的数据竞争
S. Adve, M. Hill, B. Miller, Robert H. B. Netzer
For shared-memory systems, the most commonly assumed programmer’s model of memory is sequential consistency. The weaker models of weak ordering, release consistency with sequentially consistent synchronization operations, data-race-free-O, and data-race-free-1 provide higher performance by guaranteeing sequential consistency to only a restricted class of programs - mainly programs that do not exhibit data races. To allow programmers to use the intuition and algorithms already developed for sequentially consistent systems, it is impontant to determine when a program written for a weak system exhibits no data races. In this paper, we investigate the extension of dynamic data race detection techniques developed for sequentially consistent systems to weak systems. A potential problem is that in the presence of a data race, weak systems fail to guarantee sequential consistency and therefore dynamic techniques may not give meaningful results. However, we reason that in practice a weak system will preserve sequential consistency at least until the “first” data races since it cannot predict if a data race will occur. We formalize this condition and show that it allows data races to be dynamically detected. Further, since this condition is already obeyed by all proposed implementations of weak systems, the full performance of weak systems can be exploited.
对于共享内存系统,最常见的假设是程序员的内存模型是顺序一致性。弱排序的较弱模型,释放一致性与顺序一致的同步操作,数据无竞争- 0和数据无竞争-1,通过保证仅对受限制的一类程序(主要是不表现出数据竞争的程序)的顺序一致性提供更高的性能。为了允许程序员使用已经为顺序一致性系统开发的直觉和算法,确定为弱系统编写的程序何时没有数据争用是很重要的。在本文中,我们研究了为顺序一致系统开发的动态数据竞争检测技术在弱系统中的扩展。一个潜在的问题是,在存在数据竞争的情况下,弱系统无法保证顺序一致性,因此动态技术可能无法给出有意义的结果。然而,我们推断,在实践中,弱系统将至少在“第一次”数据竞争之前保持顺序一致性,因为它无法预测数据竞争是否会发生。我们将此条件形式化,并说明它允许动态检测数据争用。此外,由于所有提出的弱系统实现都已经遵守了这个条件,因此可以利用弱系统的全部性能。
{"title":"Detecting data races on weak memory systems","authors":"S. Adve, M. Hill, B. Miller, Robert H. B. Netzer","doi":"10.1145/115953.115976","DOIUrl":"https://doi.org/10.1145/115953.115976","url":null,"abstract":"For shared-memory systems, the most commonly assumed programmer’s model of memory is sequential consistency. The weaker models of weak ordering, release consistency with sequentially consistent synchronization operations, data-race-free-O, and data-race-free-1 provide higher performance by guaranteeing sequential consistency to only a restricted class of programs - mainly programs that do not exhibit data races. To allow programmers to use the intuition and algorithms already developed for sequentially consistent systems, it is impontant to determine when a program written for a weak system exhibits no data races. In this paper, we investigate the extension of dynamic data race detection techniques developed for sequentially consistent systems to weak systems. A potential problem is that in the presence of a data race, weak systems fail to guarantee sequential consistency and therefore dynamic techniques may not give meaningful results. However, we reason that in practice a weak system will preserve sequential consistency at least until the “first” data races since it cannot predict if a data race will occur. We formalize this condition and show that it allows data races to be dynamically detected. Further, since this condition is already obeyed by all proposed implementations of weak systems, the full performance of weak systems can be exploited.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128165206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 145
Adaptive storage management for very large virtual/real storage systems 用于非常大的虚拟/真实存储系统的自适应存储管理
Toyohiko Kagimasa, Kikuo Takahashi, Toshiaki Mori, S. Yoshizumi
This paper describes the 'storage management methodology of a very large virtual/real storage systcm called the Super Terabyte System (STS). Advances in semiconductor technology make a vast amount of virtuallreal storage possible in computer systems. One of the most serious problems in supporting large virtual/real storage is the increase in storage management overhead. Adaptive storage management methods, elastic page allocation in multi-size paging architecture, partial analysis controls, partial swapping, and adaptive prepaging are STS's approaches to the problem. We have developed an experimental STS. which realizes virtual storage of 256 terabytes and real storage of 1.5 gigabytes. Evaluation of the system shows that STS prevents the storage management overhead from increasing in most workload environment, and that it can support real storage of a 10gigabyte order and virtual storage of more than a 10gigabyte order.
本文描述了一个名为超级太字节系统(STS)的超大型虚拟/真实存储系统的存储管理方法。半导体技术的进步使计算机系统中大量的虚实存储成为可能。支持大型虚拟/真实存储的最严重问题之一是存储管理开销的增加。自适应存储管理方法、多大小分页体系结构中的弹性页面分配、部分分析控制、部分交换和自适应预封装是STS解决这一问题的方法。我们开发了一种试验性STS。它实现了256tb的虚拟存储和1.5 gb的实际存储。对系统的评估表明,在大多数工作负载环境中,STS可以防止存储管理开销的增加,并且它可以支持10gb量级的真实存储和10gb量级以上的虚拟存储。
{"title":"Adaptive storage management for very large virtual/real storage systems","authors":"Toyohiko Kagimasa, Kikuo Takahashi, Toshiaki Mori, S. Yoshizumi","doi":"10.1145/115952.115989","DOIUrl":"https://doi.org/10.1145/115952.115989","url":null,"abstract":"This paper describes the 'storage management methodology of a very large virtual/real storage systcm called the Super Terabyte System (STS). Advances in semiconductor technology make a vast amount of virtuallreal storage possible in computer systems. One of the most serious problems in supporting large virtual/real storage is the increase in storage management overhead. Adaptive storage management methods, elastic page allocation in multi-size paging architecture, partial analysis controls, partial swapping, and adaptive prepaging are STS's approaches to the problem. We have developed an experimental STS. which realizes virtual storage of 256 terabytes and real storage of 1.5 gigabytes. Evaluation of the system shows that STS prevents the storage management overhead from increasing in most workload environment, and that it can support real storage of a 10gigabyte order and virtual storage of more than a 10gigabyte order.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130447872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Comparative evaluation of latency reducing and tolerating techniques 减少和容忍延迟技术的比较评价
Anoop Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, W. Weber
Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwareconuolled prefetching, and (iv) multiple-context suppon. We some studies of benefits of the individual techniques have been done, no Study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency UNformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally amin better performance than each one on its own. Overall, we show that using suitahle combinations of the techniques, performance can be improved by 4 to 7 dmes
能够处理内存访问的大延迟的技术对于在大规模共享内存多处理器中实现高处理器利用率至关重要。在本文中,我们考虑了解决延迟问题的四种架构技术:(i)硬件连贯缓存,(ii)放松内存一致性,(iii)软件控制预取,以及(iv)多上下文支持。我们对个别技术的益处进行了一些研究,没有研究在一致的框架内评估所有技术。本文试图通过使用一组一致的体系结构假设,对这四种技术的好处进行全面的评估,包括单独的和组合的。本文的结果是通过对一个大型共享内存多处理器的详细仿真得到的。我们的结果表明,缓存和放松一致性可以均匀地提高性能。预取和多上下文带来的改进是相当大的,但更依赖于应用程序。各种技术的组合通常比单独使用一种技术的性能更好。总的来说,我们表明,使用适当的技术组合,性能可以提高4到7个百分点
{"title":"Comparative evaluation of latency reducing and tolerating techniques","authors":"Anoop Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, W. Weber","doi":"10.1145/115953.115978","DOIUrl":"https://doi.org/10.1145/115953.115978","url":null,"abstract":"Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwareconuolled prefetching, and (iv) multiple-context suppon. We some studies of benefits of the individual techniques have been done, no Study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency UNformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally amin better performance than each one on its own. Overall, we show that using suitahle combinations of the techniques, performance can be improved by 4 to 7 dmes","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127514443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 229
Dynamic base register caching: a technique for reducing address bus width 动态基寄存器缓存:一种减少地址总线宽度的技术
M. Farrens, A. Park
When address reference degrees of spatial and temporal higher order address lines carry streams exhibit high locality, many of the redundant information. By caching the higher order portions of address references in a set of dynamically allocated base registers, it becomes possible to transmit small register indices between the processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that this technique can significantly reduce processor-to-memory address bus width without an appreciable loss in performance, fhereby increasing available processor bandwidth. Our resulfs imply that as much as 25% of the available 1/0 bandwidth of a processor is used less than 1% of the time.
当空间和时间的高阶地址线的地址参考度携带流表现出高局部性时,许多冗余信息。通过在一组动态分配的基寄存器中缓存地址引用的高阶部分,可以在处理器和内存之间传输小的寄存器索引,而不是高阶地址位本身。跟踪驱动仿真表明,该技术可以显著减少处理器到存储器的地址总线宽度,而不会造成明显的性能损失,从而增加可用的处理器带宽。我们的结果表明,处理器的可用1/0带宽的25%被使用的时间少于1%。
{"title":"Dynamic base register caching: a technique for reducing address bus width","authors":"M. Farrens, A. Park","doi":"10.1145/115952.115966","DOIUrl":"https://doi.org/10.1145/115952.115966","url":null,"abstract":"When address reference degrees of spatial and temporal higher order address lines carry streams exhibit high locality, many of the redundant information. By caching the higher order portions of address references in a set of dynamically allocated base registers, it becomes possible to transmit small register indices between the processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that this technique can significantly reduce processor-to-memory address bus width without an appreciable loss in performance, fhereby increasing available processor bandwidth. Our resulfs imply that as much as 25% of the available 1/0 bandwidth of a processor is used less than 1% of the time.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 86
IMPACT: an architectural framework for multiple-instruction-issue processors 影响:多指令问题处理器的体系结构框架
P. Chang, S. Mahlke, William Y. Chen, N. Warter, Wen-mei W. Hwu
The performance of multiple-instruction-issue processors can be severely limited by the com piler’s ability to generate efficient code for concurrent hardware. In the IM P A C T project, we have developed IM P AC T-I, a highly optimizing C compiler to exploit instruction level con currency. The optimization capabilities of the IM P A C T -I C compiler is summarized in this paper. Using the IM P AC T-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors have achieved solid speedup over a high-performance single instruction-issue processor. To address architecture design issues, we ran experiments to charaterize the engineering tradeoffs such as the code scheduling model, the instruction issue rate, the memory load latency, and the function unit resource limitations. Based on the experimental results, we propose the IM P A C T Architectural Framework, a set of architectural features that best support the IM P A C T -I C compiler to generate efficient code for multiple-instruction-issue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IM P A C T -I C compiler.
com编译器为并发硬件生成高效代码的能力严重限制了多指令问题处理器的性能。在impact项目中,我们开发了impact - i,一个高度优化的C语言编译器,利用指令级并行。本文综述了impacct -I - C编译器的优化能力。利用impac T-I - C编译器,对多指令处理器执行一些重要的非数值程序的性能进行了实验分析。与高性能的单指令处理器相比,多指令处理器实现了稳定的加速。为了解决架构设计问题,我们进行了实验,以表征工程权衡,如代码调度模型、指令发布率、内存负载延迟和功能单元资源限制。基于实验结果,我们提出了imptc体系结构框架,这是一组最能支持imptc -I - C编译器为多指令问题处理器生成高效代码的体系结构特征。通过支持这些体系结构特性,现有体系结构和新体系结构的多指令问题实现可以立即获得imp&a T -I - C编译器的编译支持。
{"title":"IMPACT: an architectural framework for multiple-instruction-issue processors","authors":"P. Chang, S. Mahlke, William Y. Chen, N. Warter, Wen-mei W. Hwu","doi":"10.1145/285930.286000","DOIUrl":"https://doi.org/10.1145/285930.286000","url":null,"abstract":"The performance of multiple-instruction-issue processors can be severely limited by the com piler’s ability to generate efficient code for concurrent hardware. In the IM P A C T project, we have developed IM P AC T-I, a highly optimizing C compiler to exploit instruction level con currency. The optimization capabilities of the IM P A C T -I C compiler is summarized in this paper. Using the IM P AC T-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors have achieved solid speedup over a high-performance single instruction-issue processor. To address architecture design issues, we ran experiments to charaterize the engineering tradeoffs such as the code scheduling model, the instruction issue rate, the memory load latency, and the function unit resource limitations. Based on the experimental results, we propose the IM P A C T Architectural Framework, a set of architectural features that best support the IM P A C T -I C compiler to generate efficient code for multiple-instruction-issue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IM P A C T -I C compiler.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130398220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 250
期刊
[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1