首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture QuArch:计算机体系结构中AI智能体的问答数据集
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3541961
Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi
We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at https://quarch.ai/.
我们介绍QuArch,这是一个由1500对人类验证的问答对组成的数据集,旨在评估和增强语言模型对计算机体系结构的理解。该数据集涵盖的领域包括处理器设计、内存系统和性能优化。我们的分析突出了一个显著的性能差距:最好的闭源模型达到了84%的准确率,而最好的小开源模型达到了72%。我们观察到关于存储系统和互连网络的QAs的显著斗争。QuArch的微调将小模型的精度提高了8%,为推进人工智能驱动的计算机架构研究奠定了基础。数据集和排行榜可访问https://quarch.ai/。
{"title":"QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture","authors":"Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi","doi":"10.1109/LCA.2025.3541961","DOIUrl":"https://doi.org/10.1109/LCA.2025.3541961","url":null,"abstract":"We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at <uri>https://quarch.ai/</uri>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"105-108"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator 基于dsp的高精度可扩展MAC混合数据流用于任意基量化CNN加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-24 DOI: 10.1109/LCA.2025.3545145
Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong
Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.
精确扩展卷积神经网络(cnn)为平衡网络精度和硬件效率提供了一个有前途的解决方案,促进了嵌入式设备的高性能执行。然而,在精确可扩展(PS)网络中对小粒度乘法计算的需求导致了对FPGA平台的有限探索。研究发现,PS加速器的部署面临以下挑战:基于lut的乘法累加(mac)不能充分利用DSP,基于DSP的mac支持有限的精度组合,不能有效地利用DSP。因此,本文提出了一种基于DSP的高精度可扩展MAC,它具有混合数据流,支持大多数精度组合,并确保DSP和LUT资源的高效利用。在4b /8b混合VGG16上进行评估,与8b基线相比,该加速器的性能提高了3.97倍,精度仅下降了0.37%。此外,与最先进的加速器相比,该加速器的DSP效率提高了1.20 ×−2.69倍,LUT效率提高了1.63 ×−6.34倍。
{"title":"A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator","authors":"Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3545145","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545145","url":null,"abstract":"Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"65-68"},"PeriodicalIF":1.4,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference 用于大型语言模型训练和推理的光连接多栈HBM模块
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-18 DOI: 10.1109/LCA.2025.3540058
Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten
Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.
大型语言模型(llm)的规模呈指数级增长,对传统的内存体系结构提出了重大挑战。当前的高带宽存储器(HBM)系统受到芯片I/O带宽的限制,并且由于封装限制,可以集成的HBM堆栈数量有限。在这封信中,我们提出了一种新的存储系统架构,利用硅光子互连来增加计算设备的存储容量和带宽。通过引入光连接的多层HBM模块,我们将HBM存储系统扩展到计算芯片之外,显著增加了HBM堆栈的数量。我们的评估表明,与建模的A100基线相比,该架构可以将万亿参数模型的训练效率提高1.4倍,同时如果修改L2以提供足够的带宽,则还可以将推理性能提高4.2倍。
{"title":"Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference","authors":"Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten","doi":"10.1109/LCA.2025.3540058","DOIUrl":"https://doi.org/10.1109/LCA.2025.3540058","url":null,"abstract":"Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"49-52"},"PeriodicalIF":1.4,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization 成组LLM量化中DRAM-PIM的成本效益扩展
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-13 DOI: 10.1109/LCA.2025.3532682
Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park
Processing-in-Memory (PIM) is emerging as a promising next-generation hardware to address memory bottlenecks in large language model (LLM) inference by leveraging internal memory bandwidth, enabling more energy-efficient on-device AI. However, LLMs’ large footprint poses significant challenges for accelerating them on PIM due to limited available space. Recent advances in weight-only quantization, especially group-wise weight quantization (GWQ), reduce LLM model sizes, enabling parameters to be stored at 4-bit precision or lower with minimal accuracy loss. Despite this, current PIM architectures experience performance degradation when handling the additional computations required for quantized weights. While incorporating extra logic could mitigate this degradation, it is often prohibitively expensive due to the constraints of memory technology, necessitating solutions with minimal area overhead. This work introduces two key innovations: 1) scale cascading, and 2) an INT2FP converter, to support GWQ-applied LLMs on PIM with minimal dequantization latency and area overhead compared to FP16 GEMV. Experimental results show that the proposed approach adds less than 0.6% area overhead to the existing PIM unit and achieves a 7% latency overhead for dequantization and GEMV in 4-bit GWQ with a group size of 128, compared to FP16 GEMV, while offering a 1.55× performance gain over baseline dequantization.
内存中处理(PIM)正在成为有前途的下一代硬件,通过利用内部内存带宽来解决大型语言模型(LLM)推理中的内存瓶颈,从而实现更节能的设备上人工智能。然而,由于可用空间有限,llm的大占用空间对在PIM上加速llm提出了重大挑战。仅权重量化的最新进展,特别是分组加权量化(GWQ),减小了LLM模型的尺寸,使参数能够以4位精度或更低的精度存储,同时精度损失最小。尽管如此,当前的PIM体系结构在处理量化权重所需的额外计算时仍会出现性能下降。虽然合并额外的逻辑可以缓解这种退化,但由于内存技术的限制,它通常非常昂贵,需要具有最小面积开销的解决方案。这项工作引入了两个关键的创新:1)规模级联,2)INT2FP转换器,与FP16 GEMV相比,以最小的去量化延迟和面积开销支持PIM上应用gwq的llm。实验结果表明,与FP16 GEMV相比,该方法在现有PIM单元的基础上增加了不到0.6%的面积开销,在分组大小为128的4位GWQ中实现了7%的去量化和GEMV延迟开销,同时提供了1.55倍的性能增益。
{"title":"Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization","authors":"Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park","doi":"10.1109/LCA.2025.3532682","DOIUrl":"https://doi.org/10.1109/LCA.2025.3532682","url":null,"abstract":"Processing-in-Memory (PIM) is emerging as a promising next-generation hardware to address memory bottlenecks in large language model (LLM) inference by leveraging internal memory bandwidth, enabling more energy-efficient on-device AI. However, LLMs’ large footprint poses significant challenges for accelerating them on PIM due to limited available space. Recent advances in weight-only quantization, especially group-wise weight quantization (GWQ), reduce LLM model sizes, enabling parameters to be stored at 4-bit precision or lower with minimal accuracy loss. Despite this, current PIM architectures experience performance degradation when handling the additional computations required for quantized weights. While incorporating extra logic could mitigate this degradation, it is often prohibitively expensive due to the constraints of memory technology, necessitating solutions with minimal area overhead. This work introduces two key innovations: 1) scale cascading, and 2) an INT2FP converter, to support GWQ-applied LLMs on PIM with minimal dequantization latency and area overhead compared to FP16 GEMV. Experimental results show that the proposed approach adds less than 0.6% area overhead to the existing PIM unit and achieves a 7% latency overhead for dequantization and GEMV in 4-bit GWQ with a group size of 128, compared to FP16 GEMV, while offering a 1.55× performance gain over baseline dequantization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"53-56"},"PeriodicalIF":1.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10886951","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs 图形处理器上图形神经网络聚合的综合设计空间探索
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539371
Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller
Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.
图神经网络(GNN)已成为提取和预测图上数据表示的最先进技术。随着加速 GNN 计算的需求不断增加,GPU 已成为 GNN 训练和推理的主流平台。GNN 由计算绑定的组合阶段和内存绑定的聚合阶段组成。尽管最近的微体系结构有所改进,但聚合阶段的内存访问模式仍然是 GPU 的主要性能瓶颈。虽然已经进行了 GNN 特性分析来研究这一瓶颈,但它们并未揭示架构修改的影响。然而,要为聚合阶段设计 GPU 优化方案,就必须全面了解此类修改带来的改进。在这封信中,我们通过评估一系列架构修改的性能改进潜力,探索了 GPU 的聚合设计空间。我们发现,随着线程级并行性的提高,聚合的低局部性会使性能下降,而内存访问优化后性能会显著提高,即使进行软件优化也依然有效。我们的分析为硬件优化提供了启示,可显著提高 GPU 上的 GNN 聚合性能。
{"title":"Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs","authors":"Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller","doi":"10.1109/LCA.2025.3539371","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539371","url":null,"abstract":"Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring 安全辅助小芯片:安全硬件监控的新范例
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539282
Pooya Aghanoury;Santosh Ghosh;Nader Sehatbakhsh
Hardware-assisted security features are a powerful tool for safeguarding computing systems against various attacks. However, integrating hardware security features (HWSFs) within complex System-on-Chip (SoC) architectures often leads to scalability issues and/or resource competition, impacting metrics such as area and power, ultimately leading to an undesirable trade-off between security and performance. In this study, we propose re-evaluating HWSF design constraints in light of the recent paradigm shift from integrated SoCs to chiplet-based architectures. Specifically, we explore the possibility of leveraging a centralized and versatile security module based on chiplets called security helper chiplets. We study the cost implications of using such a model by developing a new framework for cost analysis. Our analysis highlights the cost tradeoffs across different design strategies.
硬件辅助安全特性是保护计算系统免受各种攻击的强大工具。然而,在复杂的片上系统(SoC)架构中集成硬件安全特性(hwsf)通常会导致可伸缩性问题和/或资源竞争,影响诸如面积和功耗等指标,最终导致安全性和性能之间的不良权衡。在这项研究中,我们建议根据最近从集成soc到基于芯片的架构的范式转变,重新评估HWSF的设计约束。具体来说,我们探讨了利用一种基于小芯片(称为安全助手小芯片)的集中和通用安全模块的可能性。我们通过开发一个新的成本分析框架来研究使用这种模型的成本含义。我们的分析强调了不同设计策略之间的成本权衡。
{"title":"Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring","authors":"Pooya Aghanoury;Santosh Ghosh;Nader Sehatbakhsh","doi":"10.1109/LCA.2025.3539282","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539282","url":null,"abstract":"Hardware-assisted security features are a powerful tool for safeguarding computing systems against various attacks. However, integrating hardware security features (<italic>HWSFs</i>) within complex System-on-Chip (SoC) architectures often leads to scalability issues and/or resource competition, impacting metrics such as area and power, ultimately leading to an undesirable trade-off between security and performance. In this study, we propose re-evaluating HWSF design constraints in light of the recent paradigm shift from integrated SoCs to chiplet-based architectures. Specifically, we explore the possibility of leveraging a centralized and versatile security module based on chiplets called <italic>security helper chiplets</i>. We study the <italic>cost</i> implications of using such a model by developing a new framework for cost analysis. Our analysis highlights the cost tradeoffs across different design strategies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"61-64"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters 社论:一封来自IEEE计算机体系结构快报总编辑的信
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3528276
Sudhanva Gurumurthi;Mattan Erez
{"title":"Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters","authors":"Sudhanva Gurumurthi;Mattan Erez","doi":"10.1109/LCA.2025.3528276","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528276","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"iii-iv"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856691","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models RoPIM:一种加速变压器模型旋转位置嵌入的内存处理架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3535470
Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim
The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.
基于注意力的Transformer模型(如GPT、BERT和LLaMA)的出现,通过显著提高各种应用程序的性能,彻底改变了自然语言处理(NLP)。推动这些改进的一个关键因素是位置嵌入的使用,这对于捕获序列中标记之间的上下文关系至关重要。然而,当前的位置嵌入方法面临着挑战,特别是在管理长序列的性能开销和有效捕获相邻标记之间的关系方面。因此,旋转位置嵌入(RoPE)作为一种有效嵌入位置信息的方法,即使是长序列,也不需要对模型进行再训练。尽管RoPE很有效,但它在推理过程中引入了相当大的性能瓶颈。我们观察到,由于大量的数据移动和执行依赖,RoPE占GPU执行时间的61%。在本文中,我们介绍了RoPIM,一种内存处理(PIM)架构,旨在有效地加速Transformer模型中的RoPE操作。RoPIM通过使用银行级加速器来实现这一目标,该加速器通过支持乘法加法操作来减少片外数据移动,并通过并行数据重排来最小化操作依赖性。此外,RoPIM提出了一种优化的数据映射策略,该策略利用银行级和行级映射来实现并行执行,消除银行间通信,并减少DRAM激活。实验结果表明,与传统系统相比,RoPIM实现了高达307.9倍的性能提升和914.1倍的节能。
{"title":"RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models","authors":"Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim","doi":"10.1109/LCA.2025.3535470","DOIUrl":"https://doi.org/10.1109/LCA.2025.3535470","url":null,"abstract":"The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce <monospace>RoPIM</monospace>, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. <monospace>RoPIM</monospace> achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, <monospace>RoPIM</monospace> proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that <monospace>RoPIM</monospace> achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware-Accelerated Kernel-Space Memory Compression Using Intel QAT 使用Intel QAT的硬件加速内核空间内存压缩
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3534831
Qirong Xia;Houxiang Ji;Yang Zhou;Nam Sung Kim
Data compression has been widely used by datacenters to decrease the consumption of not only the memory and storage capacity but also the interconnect bandwidth. Nonetheless, the CPU cycles consumed for data compression notably contribute to the overall datacenter taxes. To provide a cost-efficient data compression capability for datacenters, Intel has introduced QuickAssist Technology (QAT), a PCIe-attached data-compression accelerator. In this work, we first comprehensively evaluate the compression/decompression performance of the latest on-chip QAT accelerator and then compare it with that of the previous-generation off-chip QAT accelerator. Subsequently, as a compelling application for QAT, we take a Linux memory optimization kernel feature: compressed cache for swap pages (zswap), re-implement it to use QAT efficiently, and then compare the performance of QAT-based zswap with that of CPU-based zswap. Our evaluation shows that the deployment of CPU-based zswap increases the tail latency of a co-running latency-sensitive application, Redis by 3.2-12.1×, while that of QAT-based zswap does not notably increase the tail latency compared to no deployment of zswap.
数据压缩技术被广泛应用于数据中心,不仅可以减少内存和存储容量的消耗,还可以减少互连带宽的消耗。尽管如此,用于数据压缩的CPU周期明显增加了数据中心的总体负担。为了为数据中心提供经济高效的数据压缩能力,英特尔推出了QuickAssist Technology (QAT),这是一种附着在pcie上的数据压缩加速器。在这项工作中,我们首先全面评估了最新的片上QAT加速器的压缩/解压缩性能,然后将其与上一代片外QAT加速器进行了比较。随后,作为QAT的一个引人注目的应用程序,我们采用Linux内存优化内核特性:交换页面压缩缓存(zswap),重新实现它以有效地使用QAT,然后比较基于QAT的zswap与基于cpu的zswap的性能。我们的评估表明,部署基于cpu的zswap会使共同运行的对延迟敏感的应用程序Redis的尾部延迟增加3.2-12.1倍,而基于qat的zswap与不部署zswap相比,并没有明显增加尾部延迟。
{"title":"Hardware-Accelerated Kernel-Space Memory Compression Using Intel QAT","authors":"Qirong Xia;Houxiang Ji;Yang Zhou;Nam Sung Kim","doi":"10.1109/LCA.2025.3534831","DOIUrl":"https://doi.org/10.1109/LCA.2025.3534831","url":null,"abstract":"Data compression has been widely used by datacenters to decrease the consumption of not only the memory and storage capacity but also the interconnect bandwidth. Nonetheless, the CPU cycles consumed for data compression notably contribute to the overall datacenter taxes. To provide a cost-efficient data compression capability for datacenters, Intel has introduced QuickAssist Technology (QAT), a PCIe-attached data-compression accelerator. In this work, we first comprehensively evaluate the compression/decompression performance of the latest <italic>on-chip</i> QAT accelerator and then compare it with that of the previous-generation <italic>off-chip</i> QAT accelerator. Subsequently, as a compelling application for QAT, we take a Linux memory optimization kernel feature: compressed cache for swap pages (<monospace>zswap</monospace>), re-implement it to use QAT efficiently, and then compare the performance of QAT-based <monospace>zswap</monospace> with that of CPU-based <monospace>zswap</monospace>. Our evaluation shows that the deployment of CPU-based <monospace>zswap</monospace> increases the tail latency of a co-running latency-sensitive application, Redis by 3.2-12.1×, while that of QAT-based <monospace>zswap</monospace> does not notably increase the tail latency compared to no deployment of <monospace>zswap</monospace>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"57-60"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856688","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143619076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Scalable RDMA Through Resource Prefetching 通过资源预取实现可扩展的RDMA
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-27 DOI: 10.1109/LCA.2025.3534188
Zhenlong Ma;Ning Kang;Fan Yang;Chongyang Hong;Jing Xu;Guojun Yuan;Peiheng Zhang;Zhan Wang;Ninghui Sun
RDMA network is being widely deployed in data centers, high-performance computing, and AI clusters. By offloading the network processing protocol stack to hardware, RDMA bypasses the operating system kernel, thereby enabling high performance and low CPU overhead. However, the protocol processing demands substantial communication resources, and due to the limited hardware resources, commercial NICs (Network Interface Cards) experience a significant number of cache misses in large-scale connection scenarios. This results in performance degradation, indicating that RDMA lacks scalability. In this paper, we first analyze the characteristics of resource access in RDMA. Based on these characteristics, we propose a resource access prediction and prefetching mechanism in the hardware, which preemptively fetches the resources required by the protocol processing pipeline to the on-chip cache. This mechanism increases the NIC’s cache hit ratio. Evaluation results demonstrate that our approach improves throughput by 125% and reduces latency by 17.9% under large-scale communication scenarios.
RDMA网络在数据中心、高性能计算、人工智能集群等领域得到广泛应用。通过将网络处理协议栈卸载到硬件,RDMA绕过了操作系统内核,从而实现了高性能和低CPU开销。但是,协议处理需要大量的通信资源,并且由于硬件资源有限,商用网卡在大规模连接场景下会出现大量的缓存丢失。这将导致性能下降,表明RDMA缺乏可伸缩性。本文首先分析了RDMA中资源访问的特点。基于这些特点,我们提出了一种硬件资源访问预测和预取机制,该机制可以将协议处理管道所需的资源抢占到片上缓存中。这种机制增加了网卡的缓存命中率。评估结果表明,在大规模通信场景下,我们的方法将吞吐量提高了125%,延迟降低了17.9%。
{"title":"Toward Scalable RDMA Through Resource Prefetching","authors":"Zhenlong Ma;Ning Kang;Fan Yang;Chongyang Hong;Jing Xu;Guojun Yuan;Peiheng Zhang;Zhan Wang;Ninghui Sun","doi":"10.1109/LCA.2025.3534188","DOIUrl":"https://doi.org/10.1109/LCA.2025.3534188","url":null,"abstract":"RDMA network is being widely deployed in data centers, high-performance computing, and AI clusters. By offloading the network processing protocol stack to hardware, RDMA bypasses the operating system kernel, thereby enabling high performance and low CPU overhead. However, the protocol processing demands substantial communication resources, and due to the limited hardware resources, commercial NICs (Network Interface Cards) experience a significant number of cache misses in large-scale connection scenarios. This results in performance degradation, indicating that RDMA lacks scalability. In this paper, we first analyze the characteristics of resource access in RDMA. Based on these characteristics, we propose a resource access prediction and prefetching mechanism in the hardware, which preemptively fetches the resources required by the protocol processing pipeline to the on-chip cache. This mechanism increases the NIC’s cache hit ratio. Evaluation results demonstrate that our approach improves throughput by 125% and reduces latency by 17.9% under large-scale communication scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"77-80"},"PeriodicalIF":1.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1