WaveScalar架构

IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Computer Systems Pub Date : 2007-05-01 DOI:10.1145/1233307.1233308
S. Swanson, Andrew Schwerin, M. Kim, Andrew Petersen, Andrew Putnam, Ken Michelson, M. Oskin, S. Eggers
{"title":"WaveScalar架构","authors":"S. Swanson, Andrew Schwerin, M. Kim, Andrew Petersen, Andrew Putnam, Ken Michelson, M. Oskin, S. Eggers","doi":"10.1145/1233307.1233308","DOIUrl":null,"url":null,"abstract":"Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function.\n To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication.\n This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"7 1","pages":"4:1-4:54"},"PeriodicalIF":2.0000,"publicationDate":"2007-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"143","resultStr":"{\"title\":\"The WaveScalar architecture\",\"authors\":\"S. Swanson, Andrew Schwerin, M. Kim, Andrew Petersen, Andrew Putnam, Ken Michelson, M. Oskin, S. Eggers\",\"doi\":\"10.1145/1233307.1233308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function.\\n To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication.\\n This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.\",\"PeriodicalId\":50918,\"journal\":{\"name\":\"ACM Transactions on Computer Systems\",\"volume\":\"7 1\",\"pages\":\"4:1-4:54\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2007-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"143\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Computer Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/1233307.1233308\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/1233307.1233308","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 143

摘要

硅技术将继续为原始晶体管的可用性提供指数级增长。然而,将这种资源有效地转化为应用程序性能是一个公开的挑战,传统的超标量设计将无法满足这一挑战。我们将WaveScalar作为传统设计的可扩展替代方案。WaveScalar是一个数据流指令集和执行模型,专为可伸缩、低复杂度/高性能处理器设计。与以前的数据流机器不同,WaveScalar可以有效地提供命令式语言所需的顺序内存语义。为了让程序员能够轻松地表达并行性,WaveScalar支持线程风格的粗粒度多线程和数据流风格的细粒度线程。此外,它还允许在应用程序甚至单个函数中混合这两种样式。为了执行wavecular程序,我们设计了一个可扩展的、基于tile的处理器架构,叫做WaveCache。当程序执行时,WaveCache将程序的指令映射到它的处理元素数组(pe)上。对于许多调用,指令保持在它们的处理元素上,当指令的工作集发生变化时,WaveCache删除未使用的指令,并在它们的位置上映射新的指令。这些指令通过可扩展的、分层的片上互连直接相互通信,避免了对长线和广播通信的需要。本文介绍了WaveScalar指令集,并对基于当前技术的仿真实现进行了评估。对于单线程应用程序,WaveCache实现了与传统处理器相当的性能,但在更小的区域。对于粗粒度线程应用程序,WaveCache在最多64个线程的情况下实现了近乎线性的加速,并且在众所周知的细粒度线程版本的内核上每个周期可以维持7- 14次乘法累加。最后,我们将两种风格的线程应用于Spec2000中的equake,并将其速度提高9倍,与串行版本相比。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The WaveScalar architecture
Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function. To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication. This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Computer Systems
ACM Transactions on Computer Systems 工程技术-计算机:理论方法
CiteScore
4.00
自引率
0.00%
发文量
7
审稿时长
1 months
期刊介绍: ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized. TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.
期刊最新文献
PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation Trinity: High-Performance and Reliable Mobile Emulation through Graphics Projection Hardware-software Collaborative Tiered-memory Management Framework for Virtualization Diciclo: Flexible User-level Services for Efficient Multitenant Isolation Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1