Evaluation of a multithreaded architecture for cellular computing

Calin Cascaval, J. Castaños, L. Ceze, Monty Denneau, Manish Gupta, D. Lieber, J. Moreira, K. Strauss, H. S. Warren
{"title":"Evaluation of a multithreaded architecture for cellular computing","authors":"Calin Cascaval, J. Castaños, L. Ceze, Monty Denneau, Manish Gupta, D. Lieber, J. Moreira, K. Strauss, H. S. Warren","doi":"10.1109/HPCA.2002.995720","DOIUrl":null,"url":null,"abstract":"Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"49 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Eighth International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2002.995720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 59

Abstract

Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个用于蜂窝计算的多线程体系结构的评估
Cyclops是IBM T. J. Watson研究中心为高性能并行计算机开发的一种新架构。该体系结构的基本单元是一个单芯片SMP(对称多处理器)系统,具有多个执行线程、嵌入式存储器和集成通信硬件。大量芯片内并行被用来容忍内存和功能单元延迟。通过按规则复制这种基本细胞,可以构建包含数千个芯片的大型系统。在本文中,我们描述了Cyclops架构,并评估了它的两个新的硬件特性:具有灵活缓存组织的内存层次结构和快速屏障硬件。我们对STREAM基准测试的实验表明,特定设计可以实现40 GB/s的可持续内存带宽,等于峰值硬件带宽,类似于128处理器的SGI Origin 3800的性能。对于小向量,我们观察到缓存内带宽超过80 GB/s。我们还表明,快速屏障硬件可以将Splash-2 FFT内核的性能提高10%。我们的研究结果表明,在同一芯片中集成大量简单处理元件和多个存储库的Cyclops方法是设计高性能系统的有效选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors Tuning garbage collection in an embedded Java environment Power issues related to branch prediction Using internal redundant representations and limited bypass to support pipelined adders and register files Modeling value speculation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1