首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
SPGPU: Spatially Programmed GPU SPGPU:空间编程 GPU
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-14 DOI: 10.1109/LCA.2024.3499339
Shizhuo Zhu;Illia Shkirko;Jacob Levinson;Zhengrong Wang;Tony Nowatzki
Communication is a critical bottleneck for GPUs, manifesting as energy and performance overheads due to network-on-chip (NoC) delay and congestion. While many algorithms exhibit locality among thread blocks and accessed data, modern GPUs lack the interface to exploit this locality: GPU thread blocks are mapped to cores obliviously. In this work, we explore a simple extension to the conventional GPU programming interface to enable control over the spatial placement of data and threads, yielding new opportunities for aggressive locality optimizations within a GPU kernel. Across 7 workloads that can take advantage of these optimizations, for a 32 (or 128) SM GPU: we achieve a 1.28× (1.54×) speedup and 35% (44%) reduction in NoC traffic, compared to baseline non-spatial GPUs.
通信是 GPU 的一个关键瓶颈,由于片上网络(NoC)延迟和拥塞,通信表现为能耗和性能开销。虽然许多算法在线程块和访问数据之间表现出局部性,但现代 GPU 缺乏利用这种局部性的接口:GPU 线程块是被无意识地映射到内核上的。在这项工作中,我们探索了对传统 GPU 编程接口的简单扩展,以实现对数据和线程空间位置的控制,为在 GPU 内核中进行积极的位置优化提供新的机会。与基线非空间 GPU 相比,在 7 种可利用这些优化的工作负载中,对于 32(或 128)SM GPU,我们实现了 1.28 倍(1.54 倍)的速度提升和 35%(44%)的 NoC 流量减少。
{"title":"SPGPU: Spatially Programmed GPU","authors":"Shizhuo Zhu;Illia Shkirko;Jacob Levinson;Zhengrong Wang;Tony Nowatzki","doi":"10.1109/LCA.2024.3499339","DOIUrl":"https://doi.org/10.1109/LCA.2024.3499339","url":null,"abstract":"Communication is a critical bottleneck for GPUs, manifesting as energy and performance overheads due to network-on-chip (NoC) delay and congestion. While many algorithms exhibit locality among thread blocks and accessed data, modern GPUs lack the interface to exploit this locality: GPU thread blocks are mapped to cores obliviously. In this work, we explore a simple extension to the conventional GPU programming interface to enable control over the spatial placement of data and threads, yielding new opportunities for aggressive locality optimizations within a GPU kernel. Across 7 workloads that can take advantage of these optimizations, for a 32 (or 128) SM GPU: we achieve a 1.28× (1.54×) speedup and 35% (44%) reduction in NoC traffic, compared to baseline non-spatial GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"223-226"},"PeriodicalIF":1.4,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantum Assertion Scheme for Assuring Qudit Robustness 确保 Qudit 稳健性的量子断言方案
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-04 DOI: 10.1109/LCA.2024.3483840
Navnil Choudhury;Chao Lu;Kanad Basu
Noisy Intermediate-Scale Quantum (NISQ) computers are impeded by constraints such as limited qubit count and susceptibility to noise, hindering the progression towards fault-tolerant quantum computing for intricate and practical applications. To augment the computational capabilities of quantum computers, research is gravitating towards qudits featuring more than two energy levels. This paper presents the inaugural examination of the repercussions of errors in qudit circuits. Subsequently, we introduce an innovative qudit-based assertion framework aimed at automatically detecting and reporting errors and warnings during the quantum circuit design and compilation process. Our proposed framework, when subjected to evaluation on existing quantum computing platforms, can detect both new and existing bugs with up to 100% coverage of the bugs mentioned in this paper.
有噪声的中等规模量子(NISQ)计算机受到诸如有限的量子比特数和对噪声的易感性等限制的阻碍,阻碍了向复杂和实际应用的容错量子计算的发展。为了增强量子计算机的计算能力,研究正趋向于具有两个以上能级的量子。本文提出了在量子电路误差的影响的初步检查。随后,我们引入了一个创新的基于量子比特的断言框架,旨在自动检测和报告量子电路设计和编译过程中的错误和警告。我们提出的框架在现有的量子计算平台上进行评估时,可以检测到新的和现有的错误,并且对本文提到的错误的覆盖率高达100%。
{"title":"Quantum Assertion Scheme for Assuring Qudit Robustness","authors":"Navnil Choudhury;Chao Lu;Kanad Basu","doi":"10.1109/LCA.2024.3483840","DOIUrl":"https://doi.org/10.1109/LCA.2024.3483840","url":null,"abstract":"Noisy Intermediate-Scale Quantum (NISQ) computers are impeded by constraints such as limited qubit count and susceptibility to noise, hindering the progression towards fault-tolerant quantum computing for intricate and practical applications. To augment the computational capabilities of quantum computers, research is gravitating towards qudits featuring more than two energy levels. This paper presents the inaugural examination of the repercussions of errors in qudit circuits. Subsequently, we introduce an innovative qudit-based assertion framework aimed at automatically detecting and reporting errors and warnings during the quantum circuit design and compilation process. Our proposed framework, when subjected to evaluation on existing quantum computing platforms, can detect both new and existing bugs with up to 100% coverage of the bugs mentioned in this paper.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"247-250"},"PeriodicalIF":1.4,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ONNXim: A Fast, Cycle-Level Multi-Core NPU Simulator ONNXim:快速、周期级多核 NPU 仿真器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-10-22 DOI: 10.1109/LCA.2024.3484648
Hyungkyu Ham;Wonhyuk Yang;Yunseon Shin;Okkyun Woo;Guseul Heo;Sangyeop Lee;Jongse Park;Gwangsun Kim
As DNNs (Deep Neural Networks) demand increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) has become more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. For ease of simulation, it takes DNN models in the ONNX graph format generated from various deep learning frameworks. In addition, based on the observation that typical NPU cores process tensor tiles from SRAM with deterministic compute latency, we model computation accurately with an event-driven approach, avoiding the overhead of modeling cycle-level activities. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 365× over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities.
随着 DNN(深度神经网络)对计算和内存的要求越来越高,设计高效、高性能的 NPU(神经处理单元)变得越来越重要。然而,现有的架构 NPU 仿真器缺乏对高速仿真、多核建模、多租户场景、详细 DRAM/NoC 建模和/或不同深度学习框架的支持。为了解决这些局限性,本研究提出了 ONNXim,这是一种用于 DNN 服务系统中多核 NPU 的快速周期级模拟器。为了便于仿真,它采用了由各种深度学习框架生成的 ONNX 图格式 DNN 模型。此外,根据对典型 NPU 内核以确定性计算延迟从 SRAM 处理张量瓦片的观察,我们采用事件驱动方法对计算进行了精确建模,避免了周期级活动建模的开销。ONNXim 还保留了计算与瓦片 DMA 之间的依赖关系。同时,DRAM 和 NoC 在周期级建模,以正确模拟可执行不同 DNN 模型的多核之间的竞争,从而实现多租户。因此,ONNXim 比现有仿真器快得多(例如,比 Accel-sim 快达 365 倍),并能进行各种案例研究,例如多租户 NPU,而这些案例研究以前由于速度慢和/或缺乏功能而不切实际。
{"title":"ONNXim: A Fast, Cycle-Level Multi-Core NPU Simulator","authors":"Hyungkyu Ham;Wonhyuk Yang;Yunseon Shin;Okkyun Woo;Guseul Heo;Sangyeop Lee;Jongse Park;Gwangsun Kim","doi":"10.1109/LCA.2024.3484648","DOIUrl":"https://doi.org/10.1109/LCA.2024.3484648","url":null,"abstract":"As DNNs (Deep Neural Networks) demand increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) has become more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes \u0000<italic>ONNXim</i>\u0000, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. For ease of simulation, it takes DNN models in the ONNX graph format generated from various deep learning frameworks. In addition, based on the observation that typical NPU cores process tensor tiles from SRAM with \u0000<italic>deterministic</i>\u0000 compute latency, we model computation accurately with an event-driven approach, avoiding the overhead of modeling cycle-level activities. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 365× over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"219-222"},"PeriodicalIF":1.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems 基于高性能和高能效芯片系统的灵活混合互连设计
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-10-09 DOI: 10.1109/LCA.2024.3477253
Md Tareq Mahmud;Ke Wang
Chiplet-based multi-die integration has prevailed in modern computing system designs as it provides an agile solution for improving processing power with reduced manufacturing costs. In chiplet-based implementations, complete electronic systems are created by integrating individual hardware components through interconnection networks that consist of intra-chiplet network-on-chips (NoCs) and an inter-chiplet silicon interposer. Unfortunately, current interconnection designs have become the limiting factor in further scaling performance and energy efficiency. Specifically, inter-chiplet communication through silicon interposers is expensive due to the limited throughput. The existing wired Network-on-Chip (NoC) design is not good for multicast and broadcast communication because of limited bandwidth, high hop count and limited hardware resources leading to high overhead, latency and power consumption. On the other hand, wireless components might be helpful for multicast/broadcast communications, but they require high setup latency which cannot be used for one-to-one communication. In this paper, we propose a hybrid interconnection design for high-performance and low-power communications in chiplet-based systems. The proposed design consists of both wired and wireless interconnects that can adapt to diverse communication patterns and requirements. A dynamic control policy is proposed to maximize the performance and minimize power consumption by allocating all traffic to wireless or wired hardware components based on the communication patterns. Evaluation results show that the proposed hybrid design achieves 8% to 46% lower average end-to-end delay and 0.93 to 2.7× energy saving over the existing designs with minimized overhead.
基于芯片组的多芯片集成在现代计算系统设计中非常普遍,因为它为提高处理能力和降低制造成本提供了一种灵活的解决方案。在基于芯片组的实施中,通过由芯片组内片上网络(NoC)和芯片组间硅内插件组成的互连网络,将单个硬件组件集成在一起,从而创建出完整的电子系统。遗憾的是,当前的互连设计已成为进一步提升性能和能效的限制因素。具体来说,由于吞吐量有限,通过硅内插器进行芯片间通信的成本很高。现有的有线片上网络(NoC)设计不利于组播和广播通信,因为带宽有限、跳数高、硬件资源有限,导致开销大、延迟长、功耗高。另一方面,无线组件可能有助于组播/广播通信,但它们需要较高的设置延迟,无法用于一对一通信。在本文中,我们提出了一种在基于芯片的系统中实现高性能和低功耗通信的混合互连设计。建议的设计由有线和无线互连组成,可适应不同的通信模式和要求。该设计提出了一种动态控制策略,可根据通信模式将所有流量分配给无线或有线硬件组件,从而实现性能最大化和功耗最小化。评估结果表明,与现有设计相比,所提出的混合设计在最大限度减少开销的情况下,平均端到端延迟降低了 8%-46%,能耗降低了 0.93-2.7 倍。
{"title":"A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems","authors":"Md Tareq Mahmud;Ke Wang","doi":"10.1109/LCA.2024.3477253","DOIUrl":"https://doi.org/10.1109/LCA.2024.3477253","url":null,"abstract":"Chiplet-based multi-die integration has prevailed in modern computing system designs as it provides an agile solution for improving processing power with reduced manufacturing costs. In chiplet-based implementations, complete electronic systems are created by integrating individual hardware components through interconnection networks that consist of intra-chiplet network-on-chips (NoCs) and an inter-chiplet silicon interposer. Unfortunately, current interconnection designs have become the limiting factor in further scaling performance and energy efficiency. Specifically, inter-chiplet communication through silicon interposers is expensive due to the limited throughput. The existing wired Network-on-Chip (NoC) design is not good for multicast and broadcast communication because of limited bandwidth, high hop count and limited hardware resources leading to high overhead, latency and power consumption. On the other hand, wireless components might be helpful for multicast/broadcast communications, but they require high setup latency which cannot be used for one-to-one communication. In this paper, we propose a hybrid interconnection design for high-performance and low-power communications in chiplet-based systems. The proposed design consists of both wired and wireless interconnects that can adapt to diverse communication patterns and requirements. A dynamic control policy is proposed to maximize the performance and minimize power consumption by allocating all traffic to wireless or wired hardware components based on the communication patterns. Evaluation results show that the proposed hybrid design achieves 8% to 46% lower average end-to-end delay and 0.93 to 2.7× energy saving over the existing designs with minimized overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"215-218"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142679284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance GCStack:一个GPU周期核算机制,提供准确的GPU性能洞察
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-10-09 DOI: 10.1109/LCA.2024.3476909
Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim
Cycles Per Instruction (CPI) stacks help computer architects gain insight into the performance of their target architectures and applications. To bring the benefits of CPI stacks to Graphics Processing Units (GPUs), prior studies have proposed GPU cycle accounting mechanisms that can identify the stall cycles and their stall events on GPU architectures. Unfortunately, the prior studies cannot provide accurate insight into the GPU performance due to their coarse-grained, priority-driven, and issue-centric cycle accounting mechanisms. In this letter, we present GCStack, a fine-grained GPU cycle accounting mechanism that constructs accurate CPI stacks and accurately identifies primary GPU performance bottlenecks. GCStack first exposes all the stall events of the outstanding warps of a warp scheduler, most of which get hidden by the existing mechanisms. Then, GCStack defers the classification of structural stalls, which the existing mechanisms cannot correctly identify with their issue-stage-centric stall classification, to the later stages of the GPU pipeline. We implement GCStack on Accel-Sim and show that GCStack provides more accurate CPI stacks and GPU performance insight than GSI, the state-of-the-art GPU cycle accounting mechanism whose primary focus is on characterizing memory-related stalls.
每指令周期(CPI)堆栈帮助计算机架构师深入了解其目标体系结构和应用程序的性能。为了将CPI堆栈的好处带给图形处理单元(GPU),先前的研究已经提出了GPU周期会计机制,可以识别GPU架构上的失速周期及其失速事件。不幸的是,由于其粗粒度、优先级驱动和以问题为中心的周期会计机制,先前的研究无法提供对GPU性能的准确洞察。在这封信中,我们提出了GCStack,这是一种细粒度的GPU周期会计机制,可以构建准确的CPI堆栈并准确识别主要GPU性能瓶颈。GCStack首先公开经纱调度器中所有未完成的经纱的失速事件,其中大部分被现有机制隐藏。然后,GCStack将结构失速的分类推迟到GPU管道的后期阶段,现有机制无法正确识别其以问题阶段为中心的失速分类。我们在Accel-Sim上实现了GCStack,并表明GCStack提供了比GSI更准确的CPI堆栈和GPU性能洞察,GSI是最先进的GPU周期会计机制,其主要重点是表征与内存相关的延迟。
{"title":"GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance","authors":"Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim","doi":"10.1109/LCA.2024.3476909","DOIUrl":"https://doi.org/10.1109/LCA.2024.3476909","url":null,"abstract":"Cycles Per Instruction (CPI) stacks help computer architects gain insight into the performance of their target architectures and applications. To bring the benefits of CPI stacks to Graphics Processing Units (GPUs), prior studies have proposed GPU cycle accounting mechanisms that can identify the stall cycles and their stall events on GPU architectures. Unfortunately, the prior studies cannot provide accurate insight into the GPU performance due to their coarse-grained, priority-driven, and issue-centric cycle accounting mechanisms. In this letter, we present \u0000<italic>GCStack</i>\u0000, a fine-grained GPU cycle accounting mechanism that constructs accurate CPI stacks and accurately identifies primary GPU performance bottlenecks. GCStack first exposes all the stall events of the outstanding warps of a warp scheduler, most of which get hidden by the existing mechanisms. Then, GCStack defers the classification of structural stalls, which the existing mechanisms cannot correctly identify with their issue-stage-centric stall classification, to the later stages of the GPU pipeline. We implement GCStack on Accel-Sim and show that GCStack provides more accurate CPI stacks and GPU performance insight than GSI, the state-of-the-art GPU cycle accounting mechanism whose primary focus is on characterizing memory-related stalls.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"235-238"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterization and Analysis of Text-to-Image Diffusion Models 文本到图像扩散模型的特征和分析
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/LCA.2024.3466118
Eunyeong Cho;Jehyeon Bang;Minsoo Rhu
Diffusion models have rapidly emerged as a prominent AI model for image generation. Despite its importance, however, little have been understood within the computer architecture community regarding this emerging AI algorithm. We conduct a workload characterization on the inference process of diffusion models using Stable Diffusion. Our characterization uncovers several critical performance bottlenecks of diffusion models, the computational overhead of which gets aggravated as image size increases. We also discuss several performance optimization opportunities that leverage approximation and sparsity, which help alleviate diffusion model's computational complexity. These findings highlight the need for domain-specific hardware that reaps out the benefits of our proposal, paving the way for accelerated image generation.
扩散模型已迅速成为图像生成方面的一个重要人工智能模型。然而,尽管扩散模型非常重要,计算机体系结构界对这种新兴的人工智能算法却知之甚少。我们利用稳定扩散技术对扩散模型的推理过程进行了工作量鉴定。我们的分析发现了扩散模型的几个关键性能瓶颈,其计算开销随着图像大小的增加而加剧。我们还讨论了利用近似性和稀疏性进行性能优化的几个机会,这有助于减轻扩散模型的计算复杂性。这些发现凸显了对特定领域硬件的需求,这些硬件可以充分利用我们建议的优势,为加速图像生成铺平道路。
{"title":"Characterization and Analysis of Text-to-Image Diffusion Models","authors":"Eunyeong Cho;Jehyeon Bang;Minsoo Rhu","doi":"10.1109/LCA.2024.3466118","DOIUrl":"https://doi.org/10.1109/LCA.2024.3466118","url":null,"abstract":"Diffusion models have rapidly emerged as a prominent AI model for image generation. Despite its importance, however, little have been understood within the computer architecture community regarding this emerging AI algorithm. We conduct a workload characterization on the inference process of diffusion models using Stable Diffusion. Our characterization uncovers several critical performance bottlenecks of diffusion models, the computational overhead of which gets aggravated as image size increases. We also discuss several performance optimization opportunities that leverage approximation and sparsity, which help alleviate diffusion model's computational complexity. These findings highlight the need for domain-specific hardware that reaps out the benefits of our proposal, paving the way for accelerated image generation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"227-230"},"PeriodicalIF":1.4,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware 在可重构硬件上高效实现 Knuth Yao 采样器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-03 DOI: 10.1109/LCA.2024.3454490
Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath
Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).
由于能抵御量子攻击,基于晶格的加密技术为传统加密方案提供了一种前景广阔的替代方案。离散高斯采样在基于网格的加密算法中起着至关重要的作用,如用于生成多项式系数的有误差环学习(R-LWE)。Knuth Yao 采样器是基于网格的密码学中广泛使用的离散高斯采样技术。另一方面,基于网格的密码学涉及资源密集型的复杂计算。由于存在固有的并行性,基于现场可编程门阵列(FPGA)的可重构硬件可以成为实现基于网格的加密算法的良好平台。本研究提出了一种在可重构硬件上高效实现 Knuth Yao 采样器的方法,不仅降低了资源利用率,还提高了采样操作的速度。与作者在(Sinha Roy 等人,2014 年)中提出的方法相比,所提出的方法减少了近 29% 的查找表(LUT)需求,速度提高了近 17 倍。
{"title":"Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware","authors":"Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath","doi":"10.1109/LCA.2024.3454490","DOIUrl":"10.1109/LCA.2024.3454490","url":null,"abstract":"Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"195-198"},"PeriodicalIF":1.4,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization SmartQuant:基于 CXL 的人工智能模型存储,支持运行时可配置的权重量化
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-02 DOI: 10.1109/LCA.2024.3452699
Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang
Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.
最近的研究发现,在生成式人工智能模型(如变压器)的推理过程中,不同权重的重要性会表现出很大的上下文依赖性变化。这自然体现了自适应配置权重量化以提高生成式人工智能推理效率的巨大潜力。虽然可配置的权重量化可以轻松利用现代 GPU 和人工智能加速器对可变精度算术的硬件支持,但之前的研究很少涉及如何利用可变权重量化成比例地提高人工智能模型的内存访问速度和能效。在快速成熟的 CXL 生态系统的推动下,这项工作开发了一种基于 CXL 的设计解决方案,以填补这一空白。关键是让 CXL 内存控制器在支持和利用运行时可配置权重量化方面发挥积极作用。我们使用变压器作为具有代表性的生成式人工智能模型进行了实验,很好地证明了所提设计方案的有效性。
{"title":"SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization","authors":"Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2024.3452699","DOIUrl":"10.1109/LCA.2024.3452699","url":null,"abstract":"Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"199-202"},"PeriodicalIF":1.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training 在冷数据上主动嵌入,用于深度学习推荐模型训练
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-28 DOI: 10.1109/LCA.2024.3445948
Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim
Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how cold data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.
深度学习推荐模型(DLRM)是一类重要的深度学习网络,常用于许多应用中。DRLM 带来了独特的挑战,尤其是在扩展训练方面,因为它不仅有计算和内存密集型组件,而且多个 GPU 之间的通信也是关键路径。在这项工作中,我们提出了如何利用 DRLRM 嵌入表中的冷数据来实现主动嵌入。特别是,主动嵌入允许提前访问嵌入表,通过将嵌入访问与通信重叠来减少内存访问延迟的影响。我们对主动嵌入的分析表明,它能将整体训练性能提高 46%。
{"title":"Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training","authors":"Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim","doi":"10.1109/LCA.2024.3445948","DOIUrl":"10.1109/LCA.2024.3445948","url":null,"abstract":"Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how \u0000<italic>cold</i>\u0000 data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"203-206"},"PeriodicalIF":1.4,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Octopus: A Cycle-Accurate Cache System Simulator 章鱼:周期精确的高速缓存系统模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-12 DOI: 10.1109/LCA.2024.3441941
Mohamed Hossam;Salah Hessien;Mohamed Hassan
This paper introduces Octopus1, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.
本文介绍了具有灵活互连模型的开源周期精确高速缓存系统模拟器 Octopus1。Octopus 可细致模拟各种高速缓存系统和互连组件,包括控制器、数据阵列、一致性协议和仲裁器。周期精确性使 Octopus 能够精确模拟目标系统的行为,同时逐周期监控每个内存请求。Octopus 的设计方法有别于现有的高速缓冲存储器模拟器,因为它不强制执行固定的内存系统架构,而是灵活配置组件连接和参数,从而能够模拟各种内存架构。此外,该模拟器还提供了独立和全系统模拟两种双重操作模式,从而实现了两全其美的效果:快速模拟和高精度。
{"title":"Octopus: A Cycle-Accurate Cache System Simulator","authors":"Mohamed Hossam;Salah Hessien;Mohamed Hassan","doi":"10.1109/LCA.2024.3441941","DOIUrl":"10.1109/LCA.2024.3441941","url":null,"abstract":"This paper introduces Octopus\u0000<sup>1</sup>\u0000, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"191-194"},"PeriodicalIF":1.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1