首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving 面向高能效 LLM 推理服务的 SLO 感知 GPU DVFS
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-28 DOI: 10.1109/LCA.2024.3406038
Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris
The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without energy overprovisioning. This autoregressive nature however, allows for iteration-level adjustments, enabling continuous fine-tuning of the system throughout the inference process. In this letter, we propose a solution based on iteration-level GPU Dynamic Voltage Frequency Scaling (DVFS), aiming to reduce the energy impact of LLM serving, an approach that has the potential for more than 22.8% and up to 45.5% energy gains when tested in real world situations under varying SLO constraints. Our approach works on top of existing LLM hosting services, requires minimal profiling and no intervention to the inference serving system.
基于 LLM 的聊天机器人越来越受欢迎,再加上它们对耗电的 GPU 基础设施的依赖,给提供商带来了严峻的挑战:如何在确保最佳用户体验的服务级别目标(SLO)下最大限度地降低能耗。传统的能耗优化方法由于其自回归架构而无法满足预定义的 SLO,从而导致能耗超配,因此无法满足 LLM 推理的要求。不过,这种自回归特性允许进行迭代级调整,从而在整个推理过程中对系统进行持续微调。在这封信中,我们提出了一种基于迭代级 GPU 动态电压频率缩放(DVFS)的解决方案,旨在减少 LLM 服务对能源的影响,这种方法在不同 SLO 约束条件下的实际情况中进行测试时,有可能实现超过 22.8%、最多 45.5% 的能源增益。我们的方法可在现有 LLM 托管服务的基础上运行,只需最低限度的剖析,无需干预推理服务系统。
{"title":"SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving","authors":"Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris","doi":"10.1109/LCA.2024.3406038","DOIUrl":"10.1109/LCA.2024.3406038","url":null,"abstract":"The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without \u0000<italic>energy overprovisioning</i>\u0000. This autoregressive nature however, allows for iteration-level adjustments, enabling continuous fine-tuning of the system throughout the inference process. In this letter, we propose a solution based on iteration-level GPU Dynamic Voltage Frequency Scaling (DVFS), aiming to reduce the energy impact of LLM serving, an approach that has the potential for more than 22.8% and up to 45.5% energy gains when tested in real world situations under varying SLO constraints. Our approach works on top of existing LLM hosting services, requires minimal profiling and no intervention to the inference serving system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"150-153"},"PeriodicalIF":1.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms DRAMATON: 用于大数理论变换的近 DRAM 加速器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3381452
Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose
With the rising popularity of post-quantum cryptographic schemes, realizing practical implementations for real-world applications is still a major challenge. A major bottleneck in such schemes is the fetching and processing of large polynomials in the Number Theoretic Transform (NTT), which makes non Von Neumann paradigms, such as near-memory processing, a viable option. We, therefore, propose a novel near-DRAM NTT accelerator design, called Dramaton. Additionally, we introduce a conflict-free mapping algorithm that enables Dramaton to process large NTTs with minimal hardware overhead using a fixed-permutation network. Dramaton achieves 5–207× speedup in latency over the state-of-the-art and 97× improvement in EDP over a recent near-memory NTT accelerator.
随着后量子加密算法的日益流行,在现实世界中实现实际应用仍然是一项重大挑战。此类方案的一个主要瓶颈是在数论变换(NTT)中获取和处理大型多项式,这使得近内存处理等非冯-诺依曼范式成为可行的选择。因此,我们提出了一种名为 Dramaton 的新型近内存 NTT 加速器设计。此外,我们还引入了一种无冲突映射算法,使 Dramaton 能够使用固定幂次网络以最小的硬件开销处理大型 NTT。与最新的近内存 NTT 加速器相比,Dramaton 的延迟速度提高了 5-207 倍,EDP 提高了 97 倍。
{"title":"Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms","authors":"Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose","doi":"10.1109/LCA.2024.3381452","DOIUrl":"10.1109/LCA.2024.3381452","url":null,"abstract":"With the rising popularity of post-quantum cryptographic schemes, realizing practical implementations for real-world applications is still a major challenge. A major bottleneck in such schemes is the fetching and processing of large polynomials in the Number Theoretic Transform (NTT), which makes non Von Neumann paradigms, such as near-memory processing, a viable option. We, therefore, propose a novel near-DRAM NTT accelerator design, called \u0000<sc>Dramaton</small>\u0000. Additionally, we introduce a conflict-free mapping algorithm that enables \u0000<sc>Dramaton</small>\u0000 to process large NTTs with minimal hardware overhead using a fixed-permutation network. \u0000<sc>Dramaton</small>\u0000 achieves 5–207× speedup in latency over the state-of-the-art and 97× improvement in EDP over a recent near-memory NTT accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"108-111"},"PeriodicalIF":2.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing Machine Learning-Based Runtime Prefetcher Selection 描述基于机器学习的运行时首选项选择
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3404887
Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer
Modern computer designs support composite prefetching, where multiple prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can sometimes hurt performance, especially in many-core systems where cache and other resources are limited. Recent work has proposed mitigating this issue by selectively enabling and disabling prefetcher components at runtime. Formulating the problem with machine learning (ML) methods is promising, but efficient and effective solutions in terms of cost and performance are not well understood. This work studies fundamental characteristics of the composite prefetcher selection problem through the lens of ML to inform future prefetcher selection designs. We show that prefetcher decisions do not have significant temporal dependencies, that a phase-based rather than sample-based definition of ground truth yields patterns that are easier to learn, and that prefetcher selection can be formulated as a workload-agnostic problem requiring little to no training at runtime.
现代计算机设计支持复合预取,即使用多个预取器组件来针对不同的内存访问模式。然而,多个预取器竞争资源有时会影响性能,尤其是在缓存和其他资源有限的多核系统中。最近的研究提出了通过在运行时有选择地启用和禁用预取器组件来缓解这一问题。用机器学习(ML)方法来解决这个问题很有前景,但在成本和性能方面,高效和有效的解决方案还没有得到很好的理解。这项工作通过 ML 的视角研究了复合预取器选择问题的基本特征,为未来的预取器选择设计提供参考。我们的研究表明,预取器决策并不具有显著的时间依赖性,基于阶段而非基于样本的地面实况定义会产生更易于学习的模式,而且预取器选择可以表述为一个与工作负载无关的问题,几乎不需要运行时的训练。
{"title":"Characterizing Machine Learning-Based Runtime Prefetcher Selection","authors":"Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer","doi":"10.1109/LCA.2024.3404887","DOIUrl":"10.1109/LCA.2024.3404887","url":null,"abstract":"Modern computer designs support composite prefetching, where multiple prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can sometimes hurt performance, especially in many-core systems where cache and other resources are limited. Recent work has proposed mitigating this issue by selectively enabling and disabling prefetcher components at runtime. Formulating the problem with machine learning (ML) methods is promising, but efficient and effective solutions in terms of cost and performance are not well understood. This work studies fundamental characteristics of the composite prefetcher selection problem through the lens of ML to inform future prefetcher selection designs. We show that prefetcher decisions do not have significant temporal dependencies, that a phase-based rather than sample-based definition of ground truth yields patterns that are easier to learn, and that prefetcher selection can be formulated as a workload-agnostic problem requiring little to no training at runtime.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"146-149"},"PeriodicalIF":1.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference 利用英特尔® 高级矩阵扩展 (AMX) 进行大型语言模型推理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-24 DOI: 10.1109/LCA.2024.3397747
Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim
The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.
大型语言模型(LLM)中的参数数量不断增加,这就需要许多昂贵的 GPU 来进行推理和训练。这是因为即使是英伟达 A100 这样的高端 GPU,由于内存容量有限,也只能存储部分参数。为了减少所需的 GPU 数量,尤其是推理所需的 GPU 数量,我们可以利用(主机)CPU 的大内存容量,不仅存储所有模型参数,还存储同样需要大量内存容量的中间输出。然而,这就需要在 CPU 和 GPU 之间通过速度较慢的 PCIe 接口频繁传输数据,从而形成了一个瓶颈,阻碍了低延迟和高吞吐量推理的实现。为了应对这一挑战,我们首先提出了 CPU-GPU 协同计算方法,该方法利用了英特尔最新 CPU(代号为 Sapphire Rapids (SPR))的高级矩阵扩展(AMX)功能。其次,我们提出了一种自适应模型分区策略,该策略可根据内存容量要求和算术强度,决定在 CPU 和 GPU 上分别运行给定 LLM 的各层。由于 CPU 执行内存容量大但算术强度低的层,通过 PCIe 接口传输的数据量大大减少,从而提高了 LLM 的推理性能。我们的评估表明,当 CPU-GPU 和 GPU 单纯计算都将模型存储在 CPU 内存中时,基于该策略的 CPU-GPU 协同计算在 OPT-30B 推理中的延迟比 GPU 单纯计算低 12.1 倍,吞吐量高 5.4 倍。
{"title":"Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference","authors":"Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim","doi":"10.1109/LCA.2024.3397747","DOIUrl":"10.1109/LCA.2024.3397747","url":null,"abstract":"The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10538369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI 利用基于 LFSR 的随机序列发生器为边缘人工智能设计近似乘法器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-19 DOI: 10.1109/LCA.2024.3379002
Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.
This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.
这封信介绍了一种创新的近似乘法器(AM)架构,它通过线性反馈移位寄存器(LFSR)利用随机生成的比特流。AM 适用于神经网络 (NN) 中的矩阵向量乘法 (MVM)。与最先进的设计相比,采用 90 nm CMOS 技术的硬件实现具有更高的功耗和面积效率。此外,该研究还探索了将随机计算应用于 LSTM 神经网络,从而提高了能效和速度。
{"title":"Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI","authors":"Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.","doi":"10.1109/LCA.2024.3379002","DOIUrl":"10.1109/LCA.2024.3379002","url":null,"abstract":"This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"91-94"},"PeriodicalIF":2.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hashing ATD Tags for Low-Overhead Safe Contention Monitoring 对 ATD 标签进行散列处理,实现低开销的安全争用监测
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-15 DOI: 10.1109/LCA.2024.3401570
Pablo Andreu;Pedro Lopez;Carles Hernandez
Increasing the performance of safety-critical systems via introducing multicore processors is becoming the norm. However, when multiple cores access a shared cache, inter-core evictions become a relevant source of interference that must be appropriately controlled. To solve this issue, one can statically partition caches and remove the interference. Unfortunately, this comes at the expense of less flexibility and, in some cases, worse performance. In this context, enabling more flexible cache allocation policies requires additional monitoring support. This paper proposes HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG enables a low-overhead implementation of an Auxiliary Tag Directory to determine inter-core evictions. Our results show that no inter-task interference underprediction is possible with HashTAG while providing a 44% reduction in ATD area with only 1.14% median overprediction.
通过引入多核处理器来提高安全关键型系统的性能已成为一种常态。然而,当多个内核访问一个共享缓存时,内核间的驱逐就会成为一个相关的干扰源,必须加以适当控制。为了解决这个问题,我们可以静态划分高速缓存并消除干扰。遗憾的是,这样做的代价是灵活性降低,在某些情况下性能更差。在这种情况下,要实现更灵活的缓存分配策略,就需要额外的监控支持。本文提出的 HashTAG 是一种新颖的方法,可以准确地确定内核间驱逐干扰的上限。HashTAG 可以低开销地实现辅助标签目录,以确定内核间驱逐。我们的研究结果表明,使用 HashTAG 不会出现任务间干扰预测不足的情况,同时 ATD 面积减少了 44%,中位数预测过度率仅为 1.14%。
{"title":"Hashing ATD Tags for Low-Overhead Safe Contention Monitoring","authors":"Pablo Andreu;Pedro Lopez;Carles Hernandez","doi":"10.1109/LCA.2024.3401570","DOIUrl":"10.1109/LCA.2024.3401570","url":null,"abstract":"Increasing the performance of safety-critical systems via introducing multicore processors is becoming the norm. However, when multiple cores access a shared cache, inter-core evictions become a relevant source of interference that must be appropriately controlled. To solve this issue, one can statically partition caches and remove the interference. Unfortunately, this comes at the expense of less flexibility and, in some cases, worse performance. In this context, enabling more flexible cache allocation policies requires additional monitoring support. This paper proposes HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG enables a low-overhead implementation of an Auxiliary Tag Directory to determine inter-core evictions. Our results show that no inter-task interference underprediction is possible with HashTAG while providing a 44% reduction in ATD area with only 1.14% median overprediction.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"166-169"},"PeriodicalIF":1.4,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10530895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Case for In-Memory Random Scatter-Gather for Fast Graph Processing 快速图形处理的内存随机散点收集案例
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-13 DOI: 10.1109/LCA.2024.3376680
Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee
Because of the widely recognized memory wall issue, modern DRAMs are increasingly being assigned innovative functionalities beyond the basic read and write operations. Often referred to as “function-in-memory”, these techniques are crafted to leverage the abundant internal bandwidth available within the DRAM. However, these techniques face several challenges, including requiring large areas for arithmetic units and the necessity of splitting a single word into multiple pieces. These challenges severely limit the practical application of these function-in-memory techniques. In this paper, we present Piccolo, an efficient design of random scatter-gather memory. Our method achieves significant improvements with minimal overhead. By demonstrating our technique on a graph processing accelerator, we show that Piccolo and the proposed accelerator achieves $1.2-3.1 times$ speedup compared to the prior art.
由于公认的内存墙问题,现代 DRAM 越来越多地被赋予基本读写操作之外的创新功能。这些技术通常被称为 "内存中的功能",旨在充分利用 DRAM 内部丰富的带宽。然而,这些技术面临着一些挑战,包括需要大面积的算术单元,以及必须将单个字分割成多个片段。这些挑战严重限制了这些内存中函数技术的实际应用。在本文中,我们介绍了一种高效的随机散点收集存储器设计 Piccolo。我们的方法以最小的开销实现了显著的改进。通过在图形处理加速器上演示我们的技术,我们发现与现有技术相比,Piccolo 和提议的加速器的速度提高了 1.2-3.1 times$。
{"title":"A Case for In-Memory Random Scatter-Gather for Fast Graph Processing","authors":"Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee","doi":"10.1109/LCA.2024.3376680","DOIUrl":"10.1109/LCA.2024.3376680","url":null,"abstract":"Because of the widely recognized memory wall issue, modern DRAMs are increasingly being assigned innovative functionalities beyond the basic read and write operations. Often referred to as “function-in-memory”, these techniques are crafted to leverage the abundant internal bandwidth available within the DRAM. However, these techniques face several challenges, including requiring large areas for arithmetic units and the necessity of splitting a single word into multiple pieces. These challenges severely limit the practical application of these function-in-memory techniques. In this paper, we present Piccolo, an efficient design of random scatter-gather memory. Our method achieves significant improvements with minimal overhead. By demonstrating our technique on a graph processing accelerator, we show that Piccolo and the proposed accelerator achieves \u0000<inline-formula><tex-math>$1.2-3.1 times$</tex-math></inline-formula>\u0000 speedup compared to the prior art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"73-77"},"PeriodicalIF":2.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information SparseLeakyNets: 利用时序侧信道信息对稀疏感知嵌入式神经网络进行分类预测攻击
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-07 DOI: 10.1109/LCA.2024.3397730
Saurav Maji;Kyungmi Lee;Anantha P. Chandrakasan
This letter explores security vulnerabilities in sparsity-aware optimizations for Neural Network (NN) platforms, specifically focusing on timing side-channel attacks introduced by optimizations such as skipping sparse multiplications. We propose a classification prediction attack that utilizes this timing side-channel information to mimic the NN's prediction outcomes. Our techniques were demonstrated for CIFAR-10, MNIST, and biomedical classification tasks using diverse dataflows and processing loads in timing models. The demonstrated results could predict the original classification decision with high accuracy.
这封信探讨了神经网络(NN)平台稀疏感知优化中的安全漏洞,特别关注跳过稀疏乘法等优化所引入的时序侧信道攻击。我们提出了一种分类预测攻击,利用这种时序侧信道信息来模仿神经网络的预测结果。我们的技术在 CIFAR-10、MNIST 和生物医学分类任务中进行了演示,使用了不同的数据流和时序模型中的处理负载。演示结果可以高精度预测原始分类决策。
{"title":"SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information","authors":"Saurav Maji;Kyungmi Lee;Anantha P. Chandrakasan","doi":"10.1109/LCA.2024.3397730","DOIUrl":"10.1109/LCA.2024.3397730","url":null,"abstract":"This letter explores security vulnerabilities in sparsity-aware optimizations for Neural Network (NN) platforms, specifically focusing on timing side-channel attacks introduced by optimizations such as skipping sparse multiplications. We propose a classification prediction attack that utilizes this timing side-channel information to mimic the NN's prediction outcomes. Our techniques were demonstrated for CIFAR-10, MNIST, and biomedical classification tasks using diverse dataflows and processing loads in timing models. The demonstrated results could predict the original classification decision with high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management 地址扩展:细粒度线程安全元数据管理的架构支持
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-06 DOI: 10.1109/LCA.2024.3373760
Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry
In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose Address Scaling, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. Address Scaling improves the performance of Memcheck, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.
近几十年来,软件系统的规模和复杂性都有了显著增长。因此,这些系统更容易出现错误,从而导致性能和正确性方面的挑战。使用运行时监控工具是缓解这些挑战的一种方法。然而,这些工具需要为其监控的每个字节的应用数据维护元数据,这就会因额外的元数据访问而产生性能开销。我们提出了 "地址扩展"(Address Scaling)这一新的硬件框架,该框架可执行细粒度元数据管理,以减少运行时监控工具的元数据访问开销。我们的机制基于对不同运行时监控工具以不同粒度维护元数据的观察。我们的主要见解是在同一缓存行中维护数据及其相应的元数据,以保持本地性。与将元数据存储在与数据分离的内存区域的最先进系统相比,在顺序和随机内存访问模式下,地址缩放可分别提高 Memcheck(一种检测内存相关错误的动态监控工具)的性能 3.55 倍和 6.58 倍。
{"title":"Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management","authors":"Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry","doi":"10.1109/LCA.2024.3373760","DOIUrl":"10.1109/LCA.2024.3373760","url":null,"abstract":"In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose \u0000<italic>Address Scaling</i>\u0000, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. \u0000<italic>Address Scaling</i>\u0000 improves the performance of \u0000<monospace>Memcheck</monospace>\u0000, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"69-72"},"PeriodicalIF":2.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Direct Memory Operands in GPU Instructions 利用 GPU 指令中的直接内存操作数
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-05 DOI: 10.1109/LCA.2024.3371062
Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad
GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.
GPU 广泛用于各种应用,特别是机器学习和科学计算等数据并行任务。然而,在处理内存负载时,由于从历史上的 RISC 处理器继承下来的架构限制,导致寄存器文件争用现象严重,从而影响了 GPU 的效率。我们发现,寄存器文件中存在的大量数值(约 26%)通常只使用一次,平均占寄存器文件库冲突总数的 25% 以上。本文旨在解决 GPU 寄存器文件中的一次性内存值(即只使用一次的数据值)所造成的空间浪费和延迟增加问题。为此,我们引入了一种受 CISC 架构启发的新机制。它在算术运算中用直接内存操作数取代了一次性加载。我们的方法平均可将性能提高 20%,能耗降低 18%,硬件开销几乎可以忽略不计(<1%)。
{"title":"Exploiting Direct Memory Operands in GPU Instructions","authors":"Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2024.3371062","DOIUrl":"10.1109/LCA.2024.3371062","url":null,"abstract":"GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"162-165"},"PeriodicalIF":1.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1