首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines 降低低温 QAOA 设备的温间带宽
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-09 DOI: 10.1109/LCA.2023.3322700
Yosuke Ueno;Yuna Tomida;Teruo Tanimoto;Masamitsu Tanaka;Yutaka Tabuchi;Koji Inoue;Hiroshi Nakamura
The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.
低温和室温环境之间的带宽限制是超导噪声中型量子计算机的一个关键瓶颈。本文介绍了首次针对量子近似优化算法的算法感知系统级优化试验,以解决这一问题。我们基于计数器的低温架构使用单流量子逻辑,显示出指数级带宽降低,并减少了温间电缆的热流入和外围功耗,这有助于提高超导量子计算机的可扩展性。
{"title":"Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines","authors":"Yosuke Ueno;Yuna Tomida;Teruo Tanimoto;Masamitsu Tanaka;Yutaka Tabuchi;Koji Inoue;Hiroshi Nakamura","doi":"10.1109/LCA.2023.3322700","DOIUrl":"10.1109/LCA.2023.3322700","url":null,"abstract":"The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"6-9"},"PeriodicalIF":2.3,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136053842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NoHammer: Preventing Row Hammer With Last-Level Cache Management NoHammer:防止行锤与最后一级缓存管理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-29 DOI: 10.1109/LCA.2023.3320670
Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim
Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose NoHammer, an efficient RH mitigation technique to prevent the bit-flips caused by the RH attack by utilizing Last-Level Cache (LLC) management. NoHammer temporarily extends the associativity of the cache set that is being targeted by utilizing another cache set as the extended set and keeps the cache lines of aggressor rows on the extended set under the eviction-based RH attack. Along with the modification of the LLC replacement policy, NoHammer ensures that the aggressor row's cache lines are not evicted from the LLC under the RH attack. In our evaluation, we demonstrate that NoHammer gives 6% higher performance than a baseline without any RH mitigation technique by replacing excessive cache misses caused by the RH attack with LLC hits through sophisticated LLC management, while requiring 45% less storage than prior proposals.
行锤(RH)是一种电路级现象,其中重复激活DRAM行导致相邻行的位翻转。先前的研究依赖于额外的刷新来减轻RH脆弱性,表明可以有效地防止比特翻转。然而,由于为缓解相对湿度而进行的额外刷新导致了显著的性能下降和能源开销,因此其实现具有挑战性。为了克服挑战,一些研究提出了在不依赖额外刷新的情况下减轻RH攻击的技术。这些技术包括延迟激活攻击行一段时间,或将攻击行与另一行交换以将其与受害者行隔离开来。虽然这些技术不需要额外的刷新来减轻RH,但激活延迟技术可能会导致假阳性情况下的高性能下降,并且交换技术需要高存储开销来跟踪交换信息。我们提出了NoHammer,一种有效的RH缓解技术,通过利用最后一级缓存(LLC)管理来防止由RH攻击引起的位翻转。NoHammer通过利用另一个缓存集作为扩展集来临时扩展缓存集的关联性,并在基于驱逐的RH攻击下将攻击者行的缓存行保留在扩展集上。随着对LLC替换策略的修改,NoHammer确保攻击者行的缓存行在RH攻击下不会从LLC中被驱逐。在我们的评估中,我们证明NoHammer在没有任何RH缓解技术的情况下,通过复杂的LLC管理将RH攻击导致的过多缓存丢失替换为LLC命中,从而比基线性能提高6%,同时所需的存储空间比之前的建议减少45%。
{"title":"NoHammer: Preventing Row Hammer With Last-Level Cache Management","authors":"Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim","doi":"10.1109/LCA.2023.3320670","DOIUrl":"https://doi.org/10.1109/LCA.2023.3320670","url":null,"abstract":"Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose \u0000<monospace>NoHammer</monospace>\u0000, an efficient RH mitigation technique to prevent the bit-flips caused by the RH attack by utilizing Last-Level Cache (LLC) management. \u0000<monospace>NoHammer</monospace>\u0000 temporarily extends the associativity of the cache set that is being targeted by utilizing another cache set as the extended set and keeps the cache lines of aggressor rows on the extended set under the eviction-based RH attack. Along with the modification of the LLC replacement policy, \u0000<monospace>NoHammer</monospace>\u0000 ensures that the aggressor row's cache lines are not evicted from the LLC under the RH attack. In our evaluation, we demonstrate that \u0000<monospace>NoHammer</monospace>\u0000 gives 6% higher performance than a baseline without any RH mitigation technique by replacing excessive cache misses caused by the RH attack with LLC hits through sophisticated LLC management, while requiring 45% less storage than prior proposals.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"157-160"},"PeriodicalIF":2.3,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures 多核体系结构上优化量子电路映射的匈牙利量子比特分配
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-25 DOI: 10.1109/LCA.2023.3318857
Pau Escofet;Anabel Ovide;Carmen G. Almudever;Eduard Alarcón;Sergi Abadal
Modular quantum computing architectures offer a promising alternative to monolithic designs for overcoming the scaling limitations of current quantum computers. To achieve scalability beyond small prototypes, quantum architectures are expected to adopt a modular approach, featuring clusters of tightly connected quantum bits with sparser connections between these clusters. Efficiently distributing qubits across multiple processing cores is critical for improving quantum computing systems’ performance and scalability. To address this challenge, we propose the Hungarian Qubit Assignment (HQA) algorithm, which leverages the Hungarian algorithm to improve qubit-to-core assignment. The HQA algorithm considers the interactions between qubits over the entire circuit, enabling fine-grained partitioning and enhanced qubit utilization. We compare the HQA algorithm with state-of-the-art alternatives through comprehensive experiments using both real-world quantum algorithms and random quantum circuits. The results demonstrate the superiority of our proposed approach, outperforming existing methods, with an average improvement of 1.28×.
模块化量子计算架构为克服当前量子计算机的扩展限制提供了一个有前途的替代单片设计。为了实现超越小型原型的可扩展性,量子架构有望采用模块化方法,其特点是紧密连接的量子比特集群,这些集群之间的连接更稀疏。在多个处理核心之间有效地分配量子位对于提高量子计算系统的性能和可扩展性至关重要。为了解决这一挑战,我们提出了匈牙利量子比特分配(HQA)算法,该算法利用匈牙利算法来改进量子比特到核心的分配。HQA算法考虑了整个电路中量子比特之间的相互作用,实现了细粒度分区和增强的量子比特利用率。我们通过使用现实世界量子算法和随机量子电路的综合实验,将HQA算法与最先进的替代方案进行比较。结果表明,我们提出的方法优于现有方法,平均提高了1.28倍。
{"title":"Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures","authors":"Pau Escofet;Anabel Ovide;Carmen G. Almudever;Eduard Alarcón;Sergi Abadal","doi":"10.1109/LCA.2023.3318857","DOIUrl":"https://doi.org/10.1109/LCA.2023.3318857","url":null,"abstract":"Modular quantum computing architectures offer a promising alternative to monolithic designs for overcoming the scaling limitations of current quantum computers. To achieve scalability beyond small prototypes, quantum architectures are expected to adopt a modular approach, featuring clusters of tightly connected quantum bits with sparser connections between these clusters. Efficiently distributing qubits across multiple processing cores is critical for improving quantum computing systems’ performance and scalability. To address this challenge, we propose the Hungarian Qubit Assignment (HQA) algorithm, which leverages the Hungarian algorithm to improve qubit-to-core assignment. The HQA algorithm considers the interactions between qubits over the entire circuit, enabling fine-grained partitioning and enhanced qubit utilization. We compare the HQA algorithm with state-of-the-art alternatives through comprehensive experiments using both real-world quantum algorithms and random quantum circuits. The results demonstrate the superiority of our proposed approach, outperforming existing methods, with an average improvement of 1.28×.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"161-164"},"PeriodicalIF":2.3,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hardware-Assisted Code-Pointer Tagging for Forward-Edge Control-Flow Integrity 前缘控制流完整性的硬件辅助代码指针标记
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-22 DOI: 10.1109/LCA.2023.3306326
Yonghae Kim;Anurag Kar;Jaewon Lee;Jaekyu Lee;Hyesoon Kim
Software attacks typically operate by overwriting control data, such as a return address and a function pointer, and hijacking the control flow of a program. To prevent such attacks, a number of control-flow integrity (CFI) solutions have been proposed. Nevertheless, most prior work finds difficulties in serving two ends: performance and security. In particular, protecting forward edges, i.e., indirect calls, remains challenging to solve without trading off one for another. In this work, we propose Code-Pointer Tagging (CPT), a novel dynamic CFI solution combined with cryptographic protection. Our key observation is that a pointer's message authentication code (MAC) can be associated with the pointer's CFI label used for CFI checks. We find that such an approach not only enables a space-efficient control-flow graph (CFG) storage but also achieves highly-efficient CFI checks performed along with implicit pointer authentication. To enable CPT, we implement lightweight compiler and hardware support. We prototype our design in an FPGA-accelerated RISC-V hardware simulation platform and conduct full-system-level evaluations. Our results show that CPT incurs a 1.2% average slowdown on the SPEC CPU C/C++ benchmarks while providing effective layered hardening on forward-edge CFI.
软件攻击通常通过覆盖控制数据(如返回地址和函数指针)并劫持程序的控制流来进行操作。为了防止此类攻击,已经提出了许多控制流完整性(CFI)解决方案。然而,大多数先前的工作发现在服务于两个目的方面存在困难:性能和安全性。特别是,保护前边,即间接呼叫,仍然具有挑战性,以解决没有一个交换另一个。在这项工作中,我们提出了代码指针标记(CPT),一种结合密码保护的新型动态CFI解决方案。我们的关键观察是,指针的消息验证码(MAC)可以与用于CFI检查的指针的CFI标签相关联。我们发现这种方法不仅可以实现空间高效的控制流图(CFG)存储,而且还可以实现与隐式指针认证一起执行的高效CFI检查。为了启用CPT,我们实现了轻量级编译器和硬件支持。我们在fpga加速的RISC-V硬件仿真平台上对我们的设计进行原型设计,并进行全系统级评估。我们的结果表明,CPT在SPEC CPU C/ c++基准测试中平均降低了1.2%,同时在前沿CFI上提供了有效的分层强化。
{"title":"Hardware-Assisted Code-Pointer Tagging for Forward-Edge Control-Flow Integrity","authors":"Yonghae Kim;Anurag Kar;Jaewon Lee;Jaekyu Lee;Hyesoon Kim","doi":"10.1109/LCA.2023.3306326","DOIUrl":"https://doi.org/10.1109/LCA.2023.3306326","url":null,"abstract":"Software attacks typically operate by overwriting control data, such as a return address and a function pointer, and hijacking the control flow of a program. To prevent such attacks, a number of control-flow integrity (CFI) solutions have been proposed. Nevertheless, most prior work finds difficulties in serving two ends: performance and security. In particular, protecting forward edges, i.e., indirect calls, remains challenging to solve without trading off one for another. In this work, we propose Code-Pointer Tagging (CPT), a novel dynamic CFI solution combined with cryptographic protection. Our key observation is that a pointer's message authentication code (MAC) can be associated with the pointer's CFI label used for CFI checks. We find that such an approach not only enables a space-efficient control-flow graph (CFG) storage but also achieves highly-efficient CFI checks performed along with implicit pointer authentication. To enable CPT, we implement lightweight compiler and hardware support. We prototype our design in an FPGA-accelerated RISC-V hardware simulation platform and conduct full-system-level evaluations. Our results show that CPT incurs a 1.2% average slowdown on the SPEC CPU C/C++ benchmarks while providing effective layered hardening on forward-edge CFI.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49988597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast Performance Prediction for Efficient Distributed DNN Training 高效分布式DNN训练的快速性能预测
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-18 DOI: 10.1109/LCA.2023.3316452
Yugyoung Yun;Eunhyeok Park
Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.
训练大规模DNN模型需要使用超大规模系统进行并行分布式训练。为了充分利用众多的加速器,必须智能地组合不同的并行化方案。然而,随着深度神经网络模型规模的增加,可能的方案组合变得非常庞大,因此,找到最优的并行方案变得非常昂贵,实际上是不可行的。在这封信中,我们介绍了一个新的成本模型,马尔可夫性能估计(MPE)。该模型提供了各种并行计划的可负担的吞吐量估计,即使在资源有限的情况下,也可以促进对理想并行计划的高效和快速搜索。值得注意的是,这项工作开创性地解释了寻找最佳计划的昂贵性质,并使用基于真实设备评估的直观性能估计来解决它。我们的实验证明了MPE的有效性,表明它比现有的最先进的基线Alpa加速了优化过程,速度提高了126倍(平均36.4)。
{"title":"Fast Performance Prediction for Efficient Distributed DNN Training","authors":"Yugyoung Yun;Eunhyeok Park","doi":"10.1109/LCA.2023.3316452","DOIUrl":"https://doi.org/10.1109/LCA.2023.3316452","url":null,"abstract":"Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs 在多晶片模组gpu中平衡效能与成本与永续性
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-08 DOI: 10.1109/LCA.2023.3313203
Shiqing Zhang;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout
MCM-GPUs scale performance by integrating multiple chiplets within the same package. How to partition the aggregate compute resources across chiplets poses a fundamental trade-off in performance versus cost and sustainability. We propose the Performance Per Wafer (PPW) metric to explore this trade-off and we find that while performance is maximized with few large chiplets, and while cost and environmental footprint is minimized with many small chiplets, the optimum balance is achieved with a moderate number of medium-sized chiplets. The optimum number of chiplets depends on the workload and increases with increased inter-chiplet bandwidth.
mcm - gpu通过在同一封装中集成多个芯片来扩展性能。如何跨小芯片划分聚合计算资源需要在性能与成本和可持续性之间进行基本权衡。我们提出了每晶圆性能(PPW)指标来探索这种权衡,我们发现,虽然使用少量大芯片可以最大化性能,同时使用许多小芯片可以最小化成本和环境足迹,但使用适量的中型芯片可以实现最佳平衡。最佳的小片数取决于工作负载,并随着小片间带宽的增加而增加。
{"title":"Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs","authors":"Shiqing Zhang;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2023.3313203","DOIUrl":"https://doi.org/10.1109/LCA.2023.3313203","url":null,"abstract":"MCM-GPUs scale performance by integrating multiple chiplets within the same package. How to partition the aggregate compute resources across chiplets poses a fundamental trade-off in performance versus cost and sustainability. We propose the \u0000<italic>Performance Per Wafer (PPW)</i>\u0000 metric to explore this trade-off and we find that while performance is maximized with few large chiplets, and while cost and environmental footprint is minimized with many small chiplets, the optimum balance is achieved with a moderate number of medium-sized chiplets. The optimum number of chiplets depends on the workload and increases with increased inter-chiplet bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"145-148"},"PeriodicalIF":2.3,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LV: Latency-Versatile Floating-Point Engine for High-Performance Deep Neural Networks LV:用于高性能深度神经网络的潜伏通用浮点引擎
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-25 DOI: 10.1109/LCA.2023.3287096
Yun-Chen Lo;Yu-Chih Tsai;Ren-Shuo Liu
Computing latency is an important system metric for Deep Neural Networks (DNNs) accelerators. To reduce latency, this work proposes LV, a latency-versatile floating-point engine (FP-PE), which contains the following key contributions: 1) an approximate bit-versatile multiplier-and-accumulate (BV-MAC) unit with early shifter and 2) an on-demand fixed-point-to-floating-point conversion (FXP2FP) unit. The extensive experimental results show that LV outperforms baseline FP-PE and redundancy-aware FP-PE by up to 2.12× and 1.3× speedup using TSMC 40-nm technology, achieving comparable accuracy on the ImageNet classification tasks.
计算延迟是深度神经网络(dnn)加速器的重要系统指标。为了减少延迟,本工作提出了LV,一种延迟通用浮点引擎(FP-PE),它包含以下关键贡献:1)具有早期移位器的近似位通用乘积(BV-MAC)单元和2)按需点到浮点转换(FXP2FP)单元。大量的实验结果表明,LV比使用台积电40纳米技术的基准FP-PE和冗余感知FP-PE的速度提高了2.12倍和1.3倍,在ImageNet分类任务上达到了相当的精度。
{"title":"LV: Latency-Versatile Floating-Point Engine for High-Performance Deep Neural Networks","authors":"Yun-Chen Lo;Yu-Chih Tsai;Ren-Shuo Liu","doi":"10.1109/LCA.2023.3287096","DOIUrl":"10.1109/LCA.2023.3287096","url":null,"abstract":"Computing latency is an important system metric for Deep Neural Networks (DNNs) accelerators. To reduce latency, this work proposes \u0000<bold>LV</b>\u0000, a latency-versatile floating-point engine (FP-PE), which contains the following key contributions: 1) an approximate bit-versatile multiplier-and-accumulate (BV-MAC) unit with early shifter and 2) an on-demand fixed-point-to-floating-point conversion (FXP2FP) unit. The extensive experimental results show that LV outperforms baseline FP-PE and redundancy-aware FP-PE by up to 2.12× and 1.3× speedup using TSMC 40-nm technology, achieving comparable accuracy on the ImageNet classification tasks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"125-128"},"PeriodicalIF":2.3,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44362022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System 一种灵活的基于嵌入感知的推荐系统近记忆处理架构
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-16 DOI: 10.1109/LCA.2023.3305668
Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan
Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a flexible embedding-aware NMP architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.
个性化推荐系统(RS)广泛应用于工业领域,在人工智能计算中心占用了大量时间。RS的一个关键组成部分是嵌入层,它由稀疏嵌入查找组成,并且是有内存限制的。最近的工作提出了近内存处理(NMP)架构,以利用高内存带宽来加快嵌入查找。这些NMP作品可以水平或垂直地划分嵌入向量。然而,在不同的内存配置或嵌入向量大小下,很难保证水平或垂直分区的有效性。为了改善这个问题,我们提出了FeaNMP,一个灵活的嵌入-一个加速RS推理阶段的软件NMP架构,我们详细探讨了不同的划分策略,并设计了一种灵活的方法来选择最优的方法,这取决于不同的嵌入维数和DDR配置。因此,与最先进的秩级NMP工作RecNMP相比,我们的工作在混合维工作负载下实现了高达11.1倍的嵌入层加速。
{"title":"A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System","authors":"Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan","doi":"10.1109/LCA.2023.3305668","DOIUrl":"10.1109/LCA.2023.3305668","url":null,"abstract":"Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a \u0000<underline>f</u>\u0000lexible \u0000<underline>e</u>\u0000mbedding-\u0000<underline>a</u>\u0000ware \u0000<underline>NMP</u>\u0000 architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"165-168"},"PeriodicalIF":2.3,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136139267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models 释放PIM的潜力:加速基于变压器的生成模型的大批量推理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-15 DOI: 10.1109/LCA.2023.3305386
Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn
Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.
基于转换器的生成模型,如GPT,通过关注生成键/值(KV)矩阵来总结输入序列,并通过对序列的每个令牌使用这些矩阵一次来生成相应的输出序列。输入和输出序列往往会变长,这提高了对上下文的理解和会话质量。这些模型通常也被分批进行推理,以提高服务吞吐量。所有这些趋势都使模型的权重能够有效地重用,增加了序列生成的相对重要性,尤其是在通过注意力处理KV矩阵时。我们发现,传统的计算平台(例如GPU)在处理这一注意力部分进行推理方面并不有效,因为每个请求都会生成不同的KV矩阵,无论批大小如何,它的每字节运算率都很低,KV矩阵的总大小甚至可以超过整个模型权重的总大小。这促使我们提出了AttAcc,它利用了这样一个事实,即KV矩阵在摘要期间被写一次,但被使用了多次(与输出序列长度成比例),每次都乘以对应于输出令牌的嵌入向量。进入/离开AttAcc的数据量可能比应该在内部读取以引起注意的数据量小几个数量级。我们在存储器设备中设计了具有多个处理的AttAcc,每个处理将嵌入向量与设备内的KV矩阵部分相乘,从而节省了外部(设备间)带宽和能耗。
{"title":"Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models","authors":"Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3305386","DOIUrl":"10.1109/LCA.2023.3305386","url":null,"abstract":"Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"113-116"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/10208/10189818/10218731.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49570973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing and Understanding Defense Methods for GNNs on GPUs gpu上gnn防御方法的表征与理解
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-15 DOI: 10.1109/LCA.2023.3304638
Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan
Graph neural networks (GNNs) are widely deployed in many vital fields, but suffer from adversarial attacks, which seriously compromise the security in these fields. Plenty of defense methods have been proposed to mitigate the impact of these attacks, however, they have introduced extra time-consuming stages into the execution of GNNs. These extra stages need to be accelerated because the end-to-end acceleration is essential for GNNs to achieve fast development and deployment. To disclose the performance bottlenecks, execution patterns, execution semantics, and overheads of the defense methods for GNNs, we characterize and explore these extra stages on GPUs. Given the characterization and exploration, we provide several useful guidelines for both software and hardware optimizations to accelerate the defense methods for GNNs.
图神经网络(gnn)在许多重要领域得到了广泛的应用,但同时也面临着对抗性攻击,严重影响了这些领域的安全性。已经提出了许多防御方法来减轻这些攻击的影响,然而,它们在gnn的执行中引入了额外耗时的阶段。这些额外的阶段需要加速,因为端到端加速对于gnn实现快速开发和部署至关重要。为了揭示gnn的性能瓶颈、执行模式、执行语义和防御方法的开销,我们对gpu上的这些额外阶段进行了表征和探索。鉴于这些特征和探索,我们为软件和硬件优化提供了一些有用的指导方针,以加速gnn的防御方法。
{"title":"Characterizing and Understanding Defense Methods for GNNs on GPUs","authors":"Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan","doi":"10.1109/LCA.2023.3304638","DOIUrl":"10.1109/LCA.2023.3304638","url":null,"abstract":"Graph neural networks (GNNs) are widely deployed in many vital fields, but suffer from adversarial attacks, which seriously compromise the security in these fields. Plenty of defense methods have been proposed to mitigate the impact of these attacks, however, they have introduced extra time-consuming stages into the execution of GNNs. These extra stages need to be accelerated because the end-to-end acceleration is essential for GNNs to achieve fast development and deployment. To disclose the performance bottlenecks, execution patterns, execution semantics, and overheads of the defense methods for GNNs, we characterize and explore these extra stages on GPUs. Given the characterization and exploration, we provide several useful guidelines for both software and hardware optimizations to accelerate the defense methods for GNNs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"137-140"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44243157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1