首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
2024 Reviewers List 2024审稿人名单
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-22 DOI: 10.1109/LCA.2025.3528619
{"title":"2024 Reviewers List","authors":"","doi":"10.1109/LCA.2025.3528619","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528619","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"i-ii"},"PeriodicalIF":1.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10849623","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network 基于Winograd的高性能卷积神经网络加速器架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-08 DOI: 10.1109/LCA.2025.3525970
Vardhana M;Rohan Pinto
Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.
卷积神经网络主要部署在gpu或cpu上。然而,由于体系结构的复杂性和性能需求的增长,这些平台可能不适合部署推理引擎。ASIC和FPGA实现正在成为实现所需性能的基于软件的解决方案的优越替代方案。本文提出了一种利用Winograd变换加速卷积的高效架构,并在FPGA上实现。与传统的基于gem的实现相比,所提出的加速器消耗的资源减少了38%。分析结果表明,在250 MHz下,我们的加速器对VGG16、ResNet18和MobileNetV2的cnn分别可以达到3.5、1.28和1.42 TOP/s。与现有技术相比,所提出的加速器显示出最佳的能源效率。
{"title":"High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network","authors":"Vardhana M;Rohan Pinto","doi":"10.1109/LCA.2025.3525970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3525970","url":null,"abstract":"Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors PINSim:用于模拟智能视觉传感器的处理内传感器和近传感器模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-25 DOI: 10.1109/LCA.2024.3522777
Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi
This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over $25,000times$ speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.
这封信介绍了PINSim,一个用户友好且灵活的框架,用于在早期设计阶段模拟新兴的智能视觉传感器。PINSim能够实现传感器附近和传感器内部的集成传感和处理,有效地解决数据移动和耗电的模数转换器等挑战。该框架提供了一个灵活的接口和广泛的设计选项,可以使用分层结构定制基于传感器的加速器的效率和精度。它的组织从设备级向上跨越到算法级。PINSim实现了电路级性能指标的指令精确评估。与SPICE模拟相比,PINSim实现了超过25,000倍的加速,平均错误率低于4.1%。此外,它支持多层感知器(MLP)和卷积神经网络(CNN)模型,其局限性取决于物联网预算约束。通过促进各种设计参数的探索和优化,PiNSim使研究人员和工程师能够为广泛的应用开发节能和高性能的智能视觉传感器。
{"title":"PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors","authors":"Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi","doi":"10.1109/LCA.2024.3522777","DOIUrl":"https://doi.org/10.1109/LCA.2024.3522777","url":null,"abstract":"This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over <inline-formula><tex-math>$25,000times$</tex-math></inline-formula> speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs ZoneBuffer:适用于 ZNS 固态硬盘的高效缓冲区管理方案
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-16 DOI: 10.1109/LCA.2024.3498103
Hongtao Wang;Peiquan Jin
The introduction of Zoned Namespace SSDs (ZNS SSDs) presents new challenges for existing buffer management schemes. In addition to traditional SSD characteristics such as read/write asymmetry and limited write endurance, ZNS SSDs possess unique constraints, such as requiring sequential writes within each zone. These features make conventional buffering policies incompatible with ZNS SSDs. This paper introduces ZoneBuffer, a novel buffering scheme designed specifically for ZNS SSDs. ZoneBuffer's innovation lies in two key aspects. First, it introduces a new buffer structure comprising a Work Region and a Priority Region. The Priority Region is further divided into a clean page queue and a zone cluster of dirty pages. By confining buffer replacement to the Priority Region, ZoneBuffer ensures optimization for ZNS SSDs. Second, ZoneBuffer incorporates a lifetime-based clustering algorithm to group dirty pages within the Priority Region, optimizing write operations. Preliminary experiments conducted on a real ZNS SSD demonstrate the effectiveness of ZoneBuffer. Compared with conventional schemes like LRU and CFLRU, the results indicate that ZoneBuffer significantly improves performance.
分区命名空间ssd (ZNS ssd)的引入对现有的缓冲区管理方案提出了新的挑战。除了读写不对称和有限的写入持久性等传统SSD特性外,ZNS SSD还具有独特的约束,例如要求在每个区域内进行顺序写入。这些特性使得传统的缓冲策略与ZNS ssd不兼容。本文介绍了ZoneBuffer,一种专为ZNS固态硬盘设计的新型缓冲方案。ZoneBuffer的创新在于两个关键方面。首先,它引入了一个新的缓冲区结构,包括一个工作区和一个优先级区域。优先级区域进一步划分为干净页面队列和脏页面区域集群。通过将缓冲区替换限制在优先级区域,ZoneBuffer确保了ZNS ssd的优化。其次,ZoneBuffer结合了一个基于生命周期的聚类算法,将脏页分组在优先级区域内,优化写操作。在真实的ZNS固态硬盘上进行的初步实验证明了ZoneBuffer的有效性。与LRU和CFLRU等传统方案相比,ZoneBuffer的性能得到了显著提高。
{"title":"ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs","authors":"Hongtao Wang;Peiquan Jin","doi":"10.1109/LCA.2024.3498103","DOIUrl":"https://doi.org/10.1109/LCA.2024.3498103","url":null,"abstract":"The introduction of Zoned Namespace SSDs (ZNS SSDs) presents new challenges for existing buffer management schemes. In addition to traditional SSD characteristics such as read/write asymmetry and limited write endurance, ZNS SSDs possess unique constraints, such as requiring sequential writes within each zone. These features make conventional buffering policies incompatible with ZNS SSDs. This paper introduces ZoneBuffer, a novel buffering scheme designed specifically for ZNS SSDs. ZoneBuffer's innovation lies in two key aspects. First, it introduces a new buffer structure comprising a Work Region and a Priority Region. The Priority Region is further divided into a clean page queue and a zone cluster of dirty pages. By confining buffer replacement to the Priority Region, ZoneBuffer ensures optimization for ZNS SSDs. Second, ZoneBuffer incorporates a lifetime-based clustering algorithm to group dirty pages within the Priority Region, optimizing write operations. Preliminary experiments conducted on a real ZNS SSD demonstrate the effectiveness of ZoneBuffer. Compared with conventional schemes like LRU and CFLRU, the results indicate that ZoneBuffer significantly improves performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"239-242"},"PeriodicalIF":1.4,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Straw: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs Straw:一种应力感知的基于wl的高密度NAND闪存固态硬盘读回收技术
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-12 DOI: 10.1109/LCA.2024.3516205
Myoungjun Chun;Jaeyong Lee;Inhyuk Choi;Jisung Park;Myungsuk Kim;Jihong Kim
Although read disturbance has emerged as a major reliability concern, managing read disturbance in modern NAND flash memory has not been thoroughly investigated yet. From a device characterization study using real modern NAND flash memory, we observe that reading a page incurs heterogeneous reliability impacts on each WL, which makes the existing block-level read reclaim extremely inefficient. We propose a new WL-level read-reclaim technique, called Straw, which keeps track of the accumulated read-disturbance effect on each WL and reclaims only heavily-disturbed WLs. By avoiding unnecessary read-reclaim operations, Straw reduces read-reclaim-induced page writes by 83.6% with negligible storage overhead.
虽然读干扰已经成为一个主要的可靠性问题,但在现代NAND闪存中,读干扰的管理还没有得到彻底的研究。通过使用现代NAND闪存的器件特性研究,我们观察到读取页面会对每个WL产生异构可靠性影响,这使得现有的块级读取回收效率极低。我们提出了一种新的WL级读取回收技术,称为Straw,它可以跟踪每个WL上累积的读取干扰效应,并且只回收受严重干扰的WL。通过避免不必要的读回收操作,Straw减少了83.6%由读回收引起的页写,而存储开销可以忽略不计。
{"title":"Straw: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs","authors":"Myoungjun Chun;Jaeyong Lee;Inhyuk Choi;Jisung Park;Myungsuk Kim;Jihong Kim","doi":"10.1109/LCA.2024.3516205","DOIUrl":"https://doi.org/10.1109/LCA.2024.3516205","url":null,"abstract":"Although read disturbance has emerged as a major reliability concern, managing read disturbance in modern NAND flash memory has not been thoroughly investigated yet. From a device characterization study using real modern NAND flash memory, we observe that reading a page incurs heterogeneous reliability impacts on each WL, which makes the existing block-level read reclaim extremely inefficient. We propose a new WL-level read-reclaim technique, called \u0000<sc>Straw</small>\u0000, which keeps track of the accumulated read-disturbance effect on each WL and reclaims only heavily-disturbed WLs. By avoiding unnecessary read-reclaim operations, \u0000<sc>Straw</small>\u0000 reduces read-reclaim-induced page writes by 83.6% with negligible storage overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"5-8"},"PeriodicalIF":1.4,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Electra: Eliminating the Ineffectual Computations on Bitmap Compressed Matrices 消除位图压缩矩阵的无效计算
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-12 DOI: 10.1109/LCA.2024.3516057
Chaithanya Krishna Vadlamudi;Bahar Asgari
The primary computations in several applications, such as deep learning recommendation models, graph neural networks, and scientific computing, involve sparse matrix sparse matrix multiplications (SpMSpM). Unlike standard multiplications, SpMSpMs introduce ineffective computations that can negatively impact performance. While several accelerators have been proposed to execute SpMSpM more efficiently, they often incur additional overhead in identifying the effectual arithmetic computations. To solve this issue, we propose Electra, a novel approach designed to reduce ineffectual computations in bitmap-compressed matrices. Electra achieves this by i) performing logical operations on the bitmap data to know whether the arithmetic computation has a zero or non-zero value, and ii) implementing finer granular scheduling of non-zero elements to arithmetic units. Our evaluations suggest that on average, Electra achieves a speedup of 1.27× over the state-of-the-art SpMSpM accelerator with a small area overhead of 64.92 $text{mm}^{2}$ based on 45 nm process.
在深度学习推荐模型、图神经网络和科学计算等应用中,主要的计算涉及稀疏矩阵稀疏矩阵乘法(SpMSpM)。与标准乘法不同,spmspm引入了可能对性能产生负面影响的无效计算。虽然已经提出了一些加速器来更有效地执行SpMSpM,但它们通常会在识别有效的算术计算时产生额外的开销。为了解决这个问题,我们提出了Electra,一种新颖的方法,旨在减少位图压缩矩阵中的无效计算。Electra通过以下方式实现了这一点:1)对位图数据执行逻辑操作,以了解算术计算是否具有零值或非零值;2)实现更精细的非零元素到算术单元的粒度调度。我们的评估表明,平均而言,Electra比最先进的SpMSpM加速器实现了1.27倍的加速,而基于45纳米工艺的面积开销仅为64.92 $text{mm}^{2}$。
{"title":"Electra: Eliminating the Ineffectual Computations on Bitmap Compressed Matrices","authors":"Chaithanya Krishna Vadlamudi;Bahar Asgari","doi":"10.1109/LCA.2024.3516057","DOIUrl":"https://doi.org/10.1109/LCA.2024.3516057","url":null,"abstract":"The primary computations in several applications, such as deep learning recommendation models, graph neural networks, and scientific computing, involve sparse matrix sparse matrix multiplications (SpMSpM). Unlike standard multiplications, SpMSpMs introduce ineffective computations that can negatively impact performance. While several accelerators have been proposed to execute SpMSpM more efficiently, they often incur additional overhead in identifying the effectual arithmetic computations. To solve this issue, we propose Electra, a novel approach designed to reduce ineffectual computations in bitmap-compressed matrices. Electra achieves this by i) performing logical operations on the bitmap data to know whether the arithmetic computation has a zero or non-zero value, and ii) implementing finer granular scheduling of non-zero elements to arithmetic units. Our evaluations suggest that on average, Electra achieves a speedup of 1.27× over the state-of-the-art SpMSpM accelerator with a small area overhead of 64.92 \u0000<inline-formula><tex-math>$text{mm}^{2}$</tex-math></inline-formula>\u0000 based on 45 nm process.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"9-12"},"PeriodicalIF":1.4,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IntervalSim++: Enhanced Interval Simulation for Unbalanced Processor Designs IntervalSim++:非平衡处理器设计的增强间隔仿真
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-09 DOI: 10.1109/LCA.2024.3514917
Haseung Bong;Nahyeon Kang;Youngsok Kim;Joonsung Kim;Hanhwi Jang
As processor microarchitecture is getting complicated, an accurate analytic model becomes crucial for exploring large processor design space within limited development time. An interval simulation is a widely used analytic model for processor designs in the early stage. However, it cannot accurately model modern microarchitecture, which has an unbalanced pipeline. In this work, we introduce IntervalSim++, an accurate analytic model for a modern microarchitecture design based on the interval simulation. We identify key components highly related to the unbalanced pipeline and propose new modeling techniques atop the interval simulation without incurring significant overheads. Our evaluations show IntervalSim++ accurately models a modern out-of-order processor with minimal overheads, showing 1% average CPI error and only 8.8% simulation time increase compared to the baseline interval simulation.
随着处理器微体系结构的日益复杂,在有限的开发时间内探索大型处理器设计空间,精确的分析模型变得至关重要。区间仿真是一种广泛应用于处理器早期设计的分析模型。然而,它不能准确地模拟具有不平衡管道的现代微体系结构。本文介绍了基于区间仿真的现代微架构设计精确分析模型IntervalSim++。我们确定了与不平衡管道高度相关的关键组件,并在不产生重大开销的情况下,在区间模拟的基础上提出了新的建模技术。我们的评估表明,IntervalSim++以最小的开销准确地模拟了一个现代无序处理器,与基线间隔模拟相比,显示出1%的平均CPI误差和仅8.8%的模拟时间增加。
{"title":"IntervalSim++: Enhanced Interval Simulation for Unbalanced Processor Designs","authors":"Haseung Bong;Nahyeon Kang;Youngsok Kim;Joonsung Kim;Hanhwi Jang","doi":"10.1109/LCA.2024.3514917","DOIUrl":"https://doi.org/10.1109/LCA.2024.3514917","url":null,"abstract":"As processor microarchitecture is getting complicated, an accurate analytic model becomes crucial for exploring large processor design space within limited development time. An interval simulation is a widely used analytic model for processor designs in the early stage. However, it cannot accurately model modern microarchitecture, which has an \u0000<italic>unbalanced</i>\u0000 pipeline. In this work, we introduce IntervalSim++, an accurate analytic model for a modern microarchitecture design based on the interval simulation. We identify key components highly related to the unbalanced pipeline and propose new modeling techniques atop the interval simulation without incurring significant overheads. Our evaluations show IntervalSim++ accurately models a modern out-of-order processor with minimal overheads, showing 1% average CPI error and only 8.8% simulation time increase compared to the baseline interval simulation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"1-4"},"PeriodicalIF":1.4,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication 缩放:可伸缩和面积有效的收缩加速器为三元多项式乘法
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-25 DOI: 10.1109/LCA.2024.3505872
Samuel Coulon;Tianyou Bao;Jiafeng Xie
Polynomial multiplication is a key component in many post-quantum cryptography and homomorphic encryption schemes. One recurring variation, ternary polynomial multiplication over ring $mathbb {Z}_{q}/(x^{n}+1)$ where one input polynomial has ternary coefficients {−1,0,1} and the other has large integer coefficients {0, $q-1$}, has recently drawn significant attention from various communities. Following this trend, this paper presents a novel SCALable and area-Efficient Systolic (SCALES) accelerator for ternary polynomial multiplication. In total, we have carried out three layers of coherent interdependent efforts. First, we have rigorously derived a novel block-processing strategy and algorithm based on the schoolbook method for polynomial multiplication. Then, we have innovatively implemented the proposed algorithm as the SCALES accelerator with the help of a number of field-programmable gate array (FPGA)-oriented optimization techniques. Lastly, we have conducted a thorough implementation analysis to showcase the efficiency of the proposed accelerator. The comparison demonstrated that the SCALES accelerator has at least 19.0% and 23.8% less equivalent area-time product (eATP) than the state-of-the-art designs. We hope this work can stimulate continued research in the field.
多项式乘法是许多后量子密码和同态加密方案的关键组成部分。一个反复出现的变化,三元多项式乘法在环$mathbb {Z}_{q}/(x^{n}+1)$上,其中一个输入多项式具有三元系数{- 1,0,1},另一个具有大整数系数{0,$q-1$},最近引起了各个社区的极大关注。根据这一趋势,本文提出了一种新的可伸缩和面积高效的三元多项式乘法收缩加速器。总的来说,我们进行了三个层次的连贯的相互依赖的努力。首先,我们严格推导了一种新的基于教科书方法的多项式乘法块处理策略和算法。然后,我们借助一些面向现场可编程门阵列(FPGA)的优化技术,创新地实现了所提出的算法作为SCALES加速器。最后,我们进行了全面的实施分析,以展示拟议加速器的效率。对比表明,与最先进的设计相比,SCALES加速器的等效面积时间积(eATP)至少减少19.0%和23.8%。我们希望这项工作能够促进该领域的进一步研究。
{"title":"SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication","authors":"Samuel Coulon;Tianyou Bao;Jiafeng Xie","doi":"10.1109/LCA.2024.3505872","DOIUrl":"https://doi.org/10.1109/LCA.2024.3505872","url":null,"abstract":"Polynomial multiplication is a key component in many post-quantum cryptography and homomorphic encryption schemes. One recurring variation, ternary polynomial multiplication over ring \u0000<inline-formula><tex-math>$mathbb {Z}_{q}/(x^{n}+1)$</tex-math></inline-formula>\u0000 where one input polynomial has ternary coefficients {−1,0,1} and the other has large integer coefficients {0, \u0000<inline-formula><tex-math>$q-1$</tex-math></inline-formula>\u0000}, has recently drawn significant attention from various communities. Following this trend, this paper presents a novel \u0000<b>SCAL</b>\u0000able and area-\u0000<b>E</b>\u0000fficient \u0000<b>S</b>\u0000ystolic (SCALES) accelerator for ternary polynomial multiplication. In total, we have carried out three layers of coherent interdependent efforts. First, we have rigorously derived a novel block-processing strategy and algorithm based on the schoolbook method for polynomial multiplication. Then, we have innovatively implemented the proposed algorithm as the SCALES accelerator with the help of a number of field-programmable gate array (FPGA)-oriented optimization techniques. Lastly, we have conducted a thorough implementation analysis to showcase the efficiency of the proposed accelerator. The comparison demonstrated that the SCALES accelerator has at least 19.0% and 23.8% less equivalent area-time product (eATP) than the state-of-the-art designs. We hope this work can stimulate continued research in the field.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"243-246"},"PeriodicalIF":1.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Case for Hardware Memoization in Server CPUs 服务器cpu硬件记忆的案例
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-22 DOI: 10.1109/LCA.2024.3505075
Farid Samandi;Natheesan Ratnasegar;Michael Ferdman
Server applications exhibit a high degree of code repetition because they handle many similar requests. In turn, repeated execution of the same code, often with identical inputs, highlights an inefficiency in the execution of server software and suggests memoization as a way to improve performance. Memoization has been extensively explored in software, and several hardware- and hardware-assisted memoization schemes have been proposed in the literature. However, these works targeted memoization of mathematical or algorithmic processing, whereas server applications call for a different approach. We observe that the opportunity for memoization in servers arises not from eliminating the repetition of complex computation, but from eliminating the repetition of software orchestration code. This work studies hardware memoization in servers, ultimately focusing on one pattern, instruction sequences starting with indirect jumps. We explore how an out-of-order pipeline can be extended to support memoization of these instruction sequences, demonstrating the potential of hardware memoization for servers. Using 26 applications to make our case (3 CloudSuite workloads and 23 vSwarm serverless functions), we show how targeting just this one pattern of instruction sequences can memoize over 10% (up to 15.6%) of the dynamically executed instructions in these server applications.
服务器应用程序表现出高度的代码重复,因为它们处理许多类似的请求。反过来,重复执行相同的代码,通常使用相同的输入,突出了服务器软件执行的低效率,并建议将记忆作为提高性能的一种方法。记忆已经在软件中进行了广泛的探索,并且在文献中提出了几种硬件和硬件辅助的记忆方案。然而,这些工作针对的是数学或算法处理的记忆,而服务器应用程序则需要不同的方法。我们观察到,服务器中记忆的机会不是来自消除复杂计算的重复,而是来自消除软件编排代码的重复。这项工作研究了服务器中的硬件记忆,最终集中在一种模式上,即从间接跳转开始的指令序列。我们探讨了如何扩展乱序管道以支持这些指令序列的记忆,展示了服务器硬件记忆的潜力。使用26个应用程序(3个CloudSuite工作负载和23个vSwarm无服务器功能),我们展示了如何仅针对这一指令序列模式就可以记住这些服务器应用程序中超过10%(最多15.6%)的动态执行指令。
{"title":"A Case for Hardware Memoization in Server CPUs","authors":"Farid Samandi;Natheesan Ratnasegar;Michael Ferdman","doi":"10.1109/LCA.2024.3505075","DOIUrl":"https://doi.org/10.1109/LCA.2024.3505075","url":null,"abstract":"Server applications exhibit a high degree of code repetition because they handle many similar requests. In turn, repeated execution of the same code, often with identical inputs, highlights an inefficiency in the execution of server software and suggests memoization as a way to improve performance. Memoization has been extensively explored in software, and several hardware- and hardware-assisted memoization schemes have been proposed in the literature. However, these works targeted memoization of mathematical or algorithmic processing, whereas server applications call for a different approach. We observe that the opportunity for memoization in servers arises not from eliminating the repetition of complex computation, but from eliminating the repetition of software orchestration code. This work studies hardware memoization in servers, ultimately focusing on one pattern, instruction sequences starting with indirect jumps. We explore how an out-of-order pipeline can be extended to support memoization of these instruction sequences, demonstrating the potential of hardware memoization for servers. Using 26 applications to make our case (3 CloudSuite workloads and 23 vSwarm serverless functions), we show how targeting just this one pattern of instruction sequences can memoize over 10% (up to 15.6%) of the dynamically executed instructions in these server applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"231-234"},"PeriodicalIF":1.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterization and Analysis of the 3D Gaussian Splatting Rendering Pipeline 三维高斯飞溅渲染管道的表征与分析
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-21 DOI: 10.1109/LCA.2024.3504579
Jiwon Lee;Yunjae Lee;Youngeun Kwon;Minsoo Rhu
Novel view synthesis, a task generating a 2D image frame from a specific viewpoint within a 3D object or scene, plays a crucial role in 3D rendering. Neural Radiance Field (NeRF) emerged as a prominent method for implementing novel view synthesis, but 3D Gaussian Splatting (3DGS) recently began to emerge as a viable alternative. Despite the tremendous interest from both academia and industry, there has been a lack of research to identify the computational bottlenecks of 3DGS, which is critical for its deployment in real-world products. In this work, we present a comprehensive end-to-end characterization of the 3DGS rendering pipeline, identifying the alpha blending stage within the tile-based rasterizer as causing a significant performance bottleneck. Based on our findings, we discuss several future research directions aiming to inspire continued exploration within this burgeoning application domain.
新颖视图合成是一种从3D对象或场景中的特定视点生成2D图像帧的任务,在3D渲染中起着至关重要的作用。神经辐射场(NeRF)是实现新视图合成的重要方法,但3D高斯飞溅(3DGS)最近开始成为一种可行的替代方法。尽管学术界和工业界对此都非常感兴趣,但目前还缺乏研究来确定3DGS的计算瓶颈,而这对于其在实际产品中的部署至关重要。在这项工作中,我们提出了3DGS渲染管道的全面端到端表征,确定了基于tile的光栅化器中的alpha混合阶段,这导致了显著的性能瓶颈。基于我们的发现,我们讨论了几个未来的研究方向,旨在激发在这个新兴应用领域的持续探索。
{"title":"Characterization and Analysis of the 3D Gaussian Splatting Rendering Pipeline","authors":"Jiwon Lee;Yunjae Lee;Youngeun Kwon;Minsoo Rhu","doi":"10.1109/LCA.2024.3504579","DOIUrl":"https://doi.org/10.1109/LCA.2024.3504579","url":null,"abstract":"Novel view synthesis, a task generating a 2D image frame from a specific viewpoint within a 3D object or scene, plays a crucial role in 3D rendering. Neural Radiance Field (NeRF) emerged as a prominent method for implementing novel view synthesis, but 3D Gaussian Splatting (3DGS) recently began to emerge as a viable alternative. Despite the tremendous interest from both academia and industry, there has been a lack of research to identify the computational bottlenecks of 3DGS, which is critical for its deployment in real-world products. In this work, we present a comprehensive end-to-end characterization of the 3DGS rendering pipeline, identifying the alpha blending stage within the tile-based rasterizer as causing a significant performance bottleneck. Based on our findings, we discuss several future research directions aiming to inspire continued exploration within this burgeoning application domain.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"13-16"},"PeriodicalIF":1.4,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1