Improving Utilization of Dataflow Unit for Multi-Batch Processing.

IF 1.5 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2023-12-18 DOI:10.1145/3637906
Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An
{"title":"Improving Utilization of Dataflow Unit for Multi-Batch Processing.","authors":"Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An","doi":"10.1145/3637906","DOIUrl":null,"url":null,"abstract":"<p>Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3637906","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
提高数据流单元在多批次处理中的利用率。
与通用内核相比,数据流架构可以实现更好的性能和更高的效率,在保持可编程性的同时,接近专用设计的性能。然而,先进的应用场景在跨域和多批次处理方面对硬件提出了更高的要求。在本文中,我们提出了一种统一的标度矢量架构,它可以在多种模式下工作,并能有效地适应不同的算法和要求。首先,我们提出了一种新颖的可重构互连结构,它可以将执行单元组织成不同的集群类型,以此来适应不同的数据级并行性。其次,我们将每个 DFG 节点内的线程解耦为连续的流水线阶段,并提供架构支持。通过在这些阶段进行时间复用,数据流硬件可以实现更高的利用率和性能。此外,基于任务的程序模型还能利用多级并行性,高效地部署应用程序。在数字信号处理算法、CNN 和科学计算算法等广泛的基准测试中,我们的设计与 GPU(V100)相比,能效(每瓦性能)提高了 11.95 倍,与最先进的数据流架构相比,能效提高了 2.01 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization 工程技术-计算机:理论方法
CiteScore
3.60
自引率
6.20%
发文量
78
审稿时长
6-12 weeks
期刊介绍: ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.
期刊最新文献
A Survey of General-purpose Polyhedral Compilers Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1