Ara2:探索单核和多核矢量处理与一个高效的RVV1.0兼容的开源处理器

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07493

Perotti, Matteo, Cavalcante, Matheus, Andri, Renzo, Cavigelli, Lukas, Benini, Luca

{"title":"Ara2:探索单核和多核矢量处理与一个高效的RVV1.0兼容的开源处理器","authors":"Perotti, Matteo, Cavalcante, Matheus, Andri, Renzo, Cavigelli, Lukas, Benini, Luca","doi":"10.48550/arxiv.2311.07493","DOIUrl":null,"url":null,"abstract":"Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"109 12","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ara2: Exploring Single- and Multi-Core Vector Processing with an\\n Efficient RVV1.0 Compliant Open-Source Processor\",\"authors\":\"Perotti, Matteo, Cavalcante, Matheus, Andri, Renzo, Cavigelli, Lukas, Benini, Luca\",\"doi\":\"10.48550/arxiv.2311.07493\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.\",\"PeriodicalId\":496270,\"journal\":{\"name\":\"arXiv (Cornell University)\",\"volume\":\"109 12\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv (Cornell University)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arxiv.2311.07493\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07493","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

矢量处理在提高数据并行工作负载的处理器性能和效率方面非常有效。在本文中，我们提出了Ara2，这是第一个完全开源的矢量处理器，支持RISC-V V 1.0冻结ISA。针对不同的问题大小和向量单元配置，我们在不同的数据并行内核上评估了Ara2的性能，在最计算密集型的内核上实现了95%的平均功能单元利用率。我们指出了性能提升和瓶颈，包括标量核心、内存和矢量架构，提供了对主要矢量架构性能驱动因素的见解。利用设计的开放性，我们在22nm技术中实现了Ara2，在各种配置(2-16通道)上表征了其PPA指标，并分析了其微架构和实现瓶颈。Ara2实现了最先进的能源效率37.8 DP-GFLOPS/W (0.8V)和1.35GHz时钟频率(关键路径:~40个FO4栅极)。最后，我们探讨了多核矢量处理器的性能和能效权衡:我们发现多核矢量处理器有助于克服限制短矢量性能的标量核问题率界限。例如，当执行32x32x32矩阵乘法时，8个2通道的Ara2集群(16个fpu)的性能比16通道的单核Ara2集群(16个fpu)的性能提高3倍以上，能效提高1.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV1.0 Compliant Open-Source Processor

Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量

期刊最新文献

Low-Rank Approximation by Randomly Pivoted LU CCD Photometry of the Globular Cluster NGC 5897 The Distribution of Sandpile Groups of Random Graphs with their Pairings CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings Full-dry Flipping Transfer Method for van der Waals Heterostructure