首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit 生成式推荐模型的表征:层次序列转导单元的研究
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-28 DOI: 10.1109/LCA.2025.3546811
Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu
Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.
推荐系统对于在线平台上的个性化用户体验至关重要。虽然深度学习推荐模型(dlrm)已经发展了近十年,但它们的可扩展性有限,因为模型质量随着计算的增长而下降。最近,已经有了将Transformer体系结构应用于推荐系统的研究工作,并且提出了分层顺序事务单元(Hierarchical Sequential Transaction Unit, HSTU),一种编码器体系结构,以解决可伸缩性挑战。尽管基于htu的生成式推荐显示出巨大的潜力,但它们很少受到计算机架构师的关注。在本文中,我们分析了基于htu的生成式推荐的推理过程,并对模型进行了深入的表征。我们的研究结果表明,注意机制是一个主要的性能瓶颈。我们进一步讨论了有潜力的研究方向和优化策略,可以潜在地提高HSTU模型的效率。
{"title":"A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit","authors":"Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu","doi":"10.1109/LCA.2025.3546811","DOIUrl":"https://doi.org/10.1109/LCA.2025.3546811","url":null,"abstract":"Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"85-88"},"PeriodicalIF":1.4,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX 雷神:利用英特尔AMX的非投机值依赖定时侧信道攻击
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-27 DOI: 10.1109/LCA.2025.3544989
Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz
The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.
在人工智能(AI)和专业应用日益增长的需求的推动下,片上加速器的兴起标志着计算领域的重大转变。这些加速器之所以受到欢迎,是因为它们能够大幅提高性能、减少能源使用、降低总拥有成本(TCO),并促进可持续性。英特尔的高级矩阵扩展(AMX)就是这样一种芯片上加速器,专门设计用于处理涉及大型矩阵乘法的任务,通常用于机器学习(ML)模型、图像处理和其他计算量大的操作。在本文中,我们介绍了一种新的英特尔AMX中与值相关的定时旁信道漏洞。通过利用这一弱点,我们展示了一种基于软件的、依赖于值的定时侧信道攻击,该攻击能够推断神经网络权重的稀疏性,而无需了解置信度评分、特权访问或物理接近度。我们的攻击方法可以在50分钟内完全恢复分配给64个输入元素的权值的稀疏性,比Hertzbleed攻击所达到的最大泄漏速率快631%。
{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks 基于数据预取器的1000核RISC-V处理器高效处理图神经网络
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3545799
Omer Khan
Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.
基于图的神经网络已经被广泛应用于对大量现实世界的图执行复杂的预测分析。硬件加速的趋势已经确定了在大规模并行规模下利用图局部性和由于超稀疏和不规则矩阵计算而导致的工作负载不平衡所带来的重大挑战。最先进的硬件加速器利用gpu中的大规模多线程和异步执行来实现高功耗下的并行性能。本文旨在利用以能效为中心的RISC-V生态系统弥合功率性能差距。提出了一种1000核RISC-V处理器来解锁基于图的矩阵运算中的大规模并行性,以实现硬件中的低延迟数据访问范式,从而实现鲁棒的功率性能扩展。每个核心实现了一个单线程管道,具有新颖的图形感知数据预取器,在1000核规模上提供比最先进的NVIDIA GPU平均每瓦20倍的性能优势。
{"title":"A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks","authors":"Omer Khan","doi":"10.1109/LCA.2025.3545799","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545799","url":null,"abstract":"Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"73-76"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143698183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture QuArch:计算机体系结构中AI智能体的问答数据集
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3541961
Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi
We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at https://quarch.ai/.
我们介绍QuArch,这是一个由1500对人类验证的问答对组成的数据集,旨在评估和增强语言模型对计算机体系结构的理解。该数据集涵盖的领域包括处理器设计、内存系统和性能优化。我们的分析突出了一个显著的性能差距:最好的闭源模型达到了84%的准确率,而最好的小开源模型达到了72%。我们观察到关于存储系统和互连网络的QAs的显著斗争。QuArch的微调将小模型的精度提高了8%,为推进人工智能驱动的计算机架构研究奠定了基础。数据集和排行榜可访问https://quarch.ai/。
{"title":"QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture","authors":"Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi","doi":"10.1109/LCA.2025.3541961","DOIUrl":"https://doi.org/10.1109/LCA.2025.3541961","url":null,"abstract":"We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at <uri>https://quarch.ai/</uri>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"105-108"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator 基于dsp的高精度可扩展MAC混合数据流用于任意基量化CNN加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-24 DOI: 10.1109/LCA.2025.3545145
Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong
Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.
精确扩展卷积神经网络(cnn)为平衡网络精度和硬件效率提供了一个有前途的解决方案,促进了嵌入式设备的高性能执行。然而,在精确可扩展(PS)网络中对小粒度乘法计算的需求导致了对FPGA平台的有限探索。研究发现,PS加速器的部署面临以下挑战:基于lut的乘法累加(mac)不能充分利用DSP,基于DSP的mac支持有限的精度组合,不能有效地利用DSP。因此,本文提出了一种基于DSP的高精度可扩展MAC,它具有混合数据流,支持大多数精度组合,并确保DSP和LUT资源的高效利用。在4b /8b混合VGG16上进行评估,与8b基线相比,该加速器的性能提高了3.97倍,精度仅下降了0.37%。此外,与最先进的加速器相比,该加速器的DSP效率提高了1.20 ×−2.69倍,LUT效率提高了1.63 ×−6.34倍。
{"title":"A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator","authors":"Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3545145","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545145","url":null,"abstract":"Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"65-68"},"PeriodicalIF":1.4,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference 用于大型语言模型训练和推理的光连接多栈HBM模块
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-18 DOI: 10.1109/LCA.2025.3540058
Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten
Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.
大型语言模型(llm)的规模呈指数级增长,对传统的内存体系结构提出了重大挑战。当前的高带宽存储器(HBM)系统受到芯片I/O带宽的限制,并且由于封装限制,可以集成的HBM堆栈数量有限。在这封信中,我们提出了一种新的存储系统架构,利用硅光子互连来增加计算设备的存储容量和带宽。通过引入光连接的多层HBM模块,我们将HBM存储系统扩展到计算芯片之外,显著增加了HBM堆栈的数量。我们的评估表明,与建模的A100基线相比,该架构可以将万亿参数模型的训练效率提高1.4倍,同时如果修改L2以提供足够的带宽,则还可以将推理性能提高4.2倍。
{"title":"Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference","authors":"Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten","doi":"10.1109/LCA.2025.3540058","DOIUrl":"https://doi.org/10.1109/LCA.2025.3540058","url":null,"abstract":"Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"49-52"},"PeriodicalIF":1.4,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization 成组LLM量化中DRAM-PIM的成本效益扩展
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-13 DOI: 10.1109/LCA.2025.3532682
Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park
Processing-in-Memory (PIM) is emerging as a promising next-generation hardware to address memory bottlenecks in large language model (LLM) inference by leveraging internal memory bandwidth, enabling more energy-efficient on-device AI. However, LLMs’ large footprint poses significant challenges for accelerating them on PIM due to limited available space. Recent advances in weight-only quantization, especially group-wise weight quantization (GWQ), reduce LLM model sizes, enabling parameters to be stored at 4-bit precision or lower with minimal accuracy loss. Despite this, current PIM architectures experience performance degradation when handling the additional computations required for quantized weights. While incorporating extra logic could mitigate this degradation, it is often prohibitively expensive due to the constraints of memory technology, necessitating solutions with minimal area overhead. This work introduces two key innovations: 1) scale cascading, and 2) an INT2FP converter, to support GWQ-applied LLMs on PIM with minimal dequantization latency and area overhead compared to FP16 GEMV. Experimental results show that the proposed approach adds less than 0.6% area overhead to the existing PIM unit and achieves a 7% latency overhead for dequantization and GEMV in 4-bit GWQ with a group size of 128, compared to FP16 GEMV, while offering a 1.55× performance gain over baseline dequantization.
内存中处理(PIM)正在成为有前途的下一代硬件,通过利用内部内存带宽来解决大型语言模型(LLM)推理中的内存瓶颈,从而实现更节能的设备上人工智能。然而,由于可用空间有限,llm的大占用空间对在PIM上加速llm提出了重大挑战。仅权重量化的最新进展,特别是分组加权量化(GWQ),减小了LLM模型的尺寸,使参数能够以4位精度或更低的精度存储,同时精度损失最小。尽管如此,当前的PIM体系结构在处理量化权重所需的额外计算时仍会出现性能下降。虽然合并额外的逻辑可以缓解这种退化,但由于内存技术的限制,它通常非常昂贵,需要具有最小面积开销的解决方案。这项工作引入了两个关键的创新:1)规模级联,2)INT2FP转换器,与FP16 GEMV相比,以最小的去量化延迟和面积开销支持PIM上应用gwq的llm。实验结果表明,与FP16 GEMV相比,该方法在现有PIM单元的基础上增加了不到0.6%的面积开销,在分组大小为128的4位GWQ中实现了7%的去量化和GEMV延迟开销,同时提供了1.55倍的性能增益。
{"title":"Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization","authors":"Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park","doi":"10.1109/LCA.2025.3532682","DOIUrl":"https://doi.org/10.1109/LCA.2025.3532682","url":null,"abstract":"Processing-in-Memory (PIM) is emerging as a promising next-generation hardware to address memory bottlenecks in large language model (LLM) inference by leveraging internal memory bandwidth, enabling more energy-efficient on-device AI. However, LLMs’ large footprint poses significant challenges for accelerating them on PIM due to limited available space. Recent advances in weight-only quantization, especially group-wise weight quantization (GWQ), reduce LLM model sizes, enabling parameters to be stored at 4-bit precision or lower with minimal accuracy loss. Despite this, current PIM architectures experience performance degradation when handling the additional computations required for quantized weights. While incorporating extra logic could mitigate this degradation, it is often prohibitively expensive due to the constraints of memory technology, necessitating solutions with minimal area overhead. This work introduces two key innovations: 1) scale cascading, and 2) an INT2FP converter, to support GWQ-applied LLMs on PIM with minimal dequantization latency and area overhead compared to FP16 GEMV. Experimental results show that the proposed approach adds less than 0.6% area overhead to the existing PIM unit and achieves a 7% latency overhead for dequantization and GEMV in 4-bit GWQ with a group size of 128, compared to FP16 GEMV, while offering a 1.55× performance gain over baseline dequantization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"53-56"},"PeriodicalIF":1.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10886951","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs 图形处理器上图形神经网络聚合的综合设计空间探索
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539371
Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller
Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.
图神经网络(GNN)已成为提取和预测图上数据表示的最先进技术。随着加速 GNN 计算的需求不断增加,GPU 已成为 GNN 训练和推理的主流平台。GNN 由计算绑定的组合阶段和内存绑定的聚合阶段组成。尽管最近的微体系结构有所改进,但聚合阶段的内存访问模式仍然是 GPU 的主要性能瓶颈。虽然已经进行了 GNN 特性分析来研究这一瓶颈,但它们并未揭示架构修改的影响。然而,要为聚合阶段设计 GPU 优化方案,就必须全面了解此类修改带来的改进。在这封信中,我们通过评估一系列架构修改的性能改进潜力,探索了 GPU 的聚合设计空间。我们发现,随着线程级并行性的提高,聚合的低局部性会使性能下降,而内存访问优化后性能会显著提高,即使进行软件优化也依然有效。我们的分析为硬件优化提供了启示,可显著提高 GPU 上的 GNN 聚合性能。
{"title":"Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs","authors":"Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller","doi":"10.1109/LCA.2025.3539371","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539371","url":null,"abstract":"Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring 安全辅助小芯片:安全硬件监控的新范例
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539282
Pooya Aghanoury;Santosh Ghosh;Nader Sehatbakhsh
Hardware-assisted security features are a powerful tool for safeguarding computing systems against various attacks. However, integrating hardware security features (HWSFs) within complex System-on-Chip (SoC) architectures often leads to scalability issues and/or resource competition, impacting metrics such as area and power, ultimately leading to an undesirable trade-off between security and performance. In this study, we propose re-evaluating HWSF design constraints in light of the recent paradigm shift from integrated SoCs to chiplet-based architectures. Specifically, we explore the possibility of leveraging a centralized and versatile security module based on chiplets called security helper chiplets. We study the cost implications of using such a model by developing a new framework for cost analysis. Our analysis highlights the cost tradeoffs across different design strategies.
硬件辅助安全特性是保护计算系统免受各种攻击的强大工具。然而,在复杂的片上系统(SoC)架构中集成硬件安全特性(hwsf)通常会导致可伸缩性问题和/或资源竞争,影响诸如面积和功耗等指标,最终导致安全性和性能之间的不良权衡。在这项研究中,我们建议根据最近从集成soc到基于芯片的架构的范式转变,重新评估HWSF的设计约束。具体来说,我们探讨了利用一种基于小芯片(称为安全助手小芯片)的集中和通用安全模块的可能性。我们通过开发一个新的成本分析框架来研究使用这种模型的成本含义。我们的分析强调了不同设计策略之间的成本权衡。
{"title":"Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring","authors":"Pooya Aghanoury;Santosh Ghosh;Nader Sehatbakhsh","doi":"10.1109/LCA.2025.3539282","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539282","url":null,"abstract":"Hardware-assisted security features are a powerful tool for safeguarding computing systems against various attacks. However, integrating hardware security features (<italic>HWSFs</i>) within complex System-on-Chip (SoC) architectures often leads to scalability issues and/or resource competition, impacting metrics such as area and power, ultimately leading to an undesirable trade-off between security and performance. In this study, we propose re-evaluating HWSF design constraints in light of the recent paradigm shift from integrated SoCs to chiplet-based architectures. Specifically, we explore the possibility of leveraging a centralized and versatile security module based on chiplets called <italic>security helper chiplets</i>. We study the <italic>cost</i> implications of using such a model by developing a new framework for cost analysis. Our analysis highlights the cost tradeoffs across different design strategies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"61-64"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models RoPIM:一种加速变压器模型旋转位置嵌入的内存处理架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3535470
Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim
The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.
基于注意力的Transformer模型(如GPT、BERT和LLaMA)的出现,通过显著提高各种应用程序的性能,彻底改变了自然语言处理(NLP)。推动这些改进的一个关键因素是位置嵌入的使用,这对于捕获序列中标记之间的上下文关系至关重要。然而,当前的位置嵌入方法面临着挑战,特别是在管理长序列的性能开销和有效捕获相邻标记之间的关系方面。因此,旋转位置嵌入(RoPE)作为一种有效嵌入位置信息的方法,即使是长序列,也不需要对模型进行再训练。尽管RoPE很有效,但它在推理过程中引入了相当大的性能瓶颈。我们观察到,由于大量的数据移动和执行依赖,RoPE占GPU执行时间的61%。在本文中,我们介绍了RoPIM,一种内存处理(PIM)架构,旨在有效地加速Transformer模型中的RoPE操作。RoPIM通过使用银行级加速器来实现这一目标,该加速器通过支持乘法加法操作来减少片外数据移动,并通过并行数据重排来最小化操作依赖性。此外,RoPIM提出了一种优化的数据映射策略,该策略利用银行级和行级映射来实现并行执行,消除银行间通信,并减少DRAM激活。实验结果表明,与传统系统相比,RoPIM实现了高达307.9倍的性能提升和914.1倍的节能。
{"title":"RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models","authors":"Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim","doi":"10.1109/LCA.2025.3535470","DOIUrl":"https://doi.org/10.1109/LCA.2025.3535470","url":null,"abstract":"The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce <monospace>RoPIM</monospace>, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. <monospace>RoPIM</monospace> achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, <monospace>RoPIM</monospace> proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that <monospace>RoPIM</monospace> achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1