Improving Utilization of Dataflow Unit for Multi-Batch Processing.

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2023-12-18 DOI:10.1145/3637906

Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An

{"title":"Improving Utilization of Dataflow Unit for Multi-Batch Processing.","authors":"Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An","doi":"10.1145/3637906","DOIUrl":null,"url":null,"abstract":"<p>Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3637906","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

提高数据流单元在多批次处理中的利用率。

与通用内核相比，数据流架构可以实现更好的性能和更高的效率，在保持可编程性的同时，接近专用设计的性能。然而，先进的应用场景在跨域和多批次处理方面对硬件提出了更高的要求。在本文中，我们提出了一种统一的标度矢量架构，它可以在多种模式下工作，并能有效地适应不同的算法和要求。首先，我们提出了一种新颖的可重构互连结构，它可以将执行单元组织成不同的集群类型，以此来适应不同的数据级并行性。其次，我们将每个 DFG 节点内的线程解耦为连续的流水线阶段，并提供架构支持。通过在这些阶段进行时间复用，数据流硬件可以实现更高的利用率和性能。此外，基于任务的程序模型还能利用多级并行性，高效地部署应用程序。在数字信号处理算法、CNN 和科学计算算法等广泛的基准测试中，我们的设计与 GPU（V100）相比，能效（每瓦性能）提高了 11.95 倍，与最先进的数据流架构相比，能效提高了 2.01 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.