首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference 利用英特尔® 高级矩阵扩展 (AMX) 进行大型语言模型推理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-24 DOI: 10.1109/LCA.2024.3397747
Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim
The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.
大型语言模型(LLM)中的参数数量不断增加,这就需要许多昂贵的 GPU 来进行推理和训练。这是因为即使是英伟达 A100 这样的高端 GPU,由于内存容量有限,也只能存储部分参数。为了减少所需的 GPU 数量,尤其是推理所需的 GPU 数量,我们可以利用(主机)CPU 的大内存容量,不仅存储所有模型参数,还存储同样需要大量内存容量的中间输出。然而,这就需要在 CPU 和 GPU 之间通过速度较慢的 PCIe 接口频繁传输数据,从而形成了一个瓶颈,阻碍了低延迟和高吞吐量推理的实现。为了应对这一挑战,我们首先提出了 CPU-GPU 协同计算方法,该方法利用了英特尔最新 CPU(代号为 Sapphire Rapids (SPR))的高级矩阵扩展(AMX)功能。其次,我们提出了一种自适应模型分区策略,该策略可根据内存容量要求和算术强度,决定在 CPU 和 GPU 上分别运行给定 LLM 的各层。由于 CPU 执行内存容量大但算术强度低的层,通过 PCIe 接口传输的数据量大大减少,从而提高了 LLM 的推理性能。我们的评估表明,当 CPU-GPU 和 GPU 单纯计算都将模型存储在 CPU 内存中时,基于该策略的 CPU-GPU 协同计算在 OPT-30B 推理中的延迟比 GPU 单纯计算低 12.1 倍,吞吐量高 5.4 倍。
{"title":"Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference","authors":"Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim","doi":"10.1109/LCA.2024.3397747","DOIUrl":"10.1109/LCA.2024.3397747","url":null,"abstract":"The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10538369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI 利用基于 LFSR 的随机序列发生器为边缘人工智能设计近似乘法器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-19 DOI: 10.1109/LCA.2024.3379002
Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.
This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.
这封信介绍了一种创新的近似乘法器(AM)架构,它通过线性反馈移位寄存器(LFSR)利用随机生成的比特流。AM 适用于神经网络 (NN) 中的矩阵向量乘法 (MVM)。与最先进的设计相比,采用 90 nm CMOS 技术的硬件实现具有更高的功耗和面积效率。此外,该研究还探索了将随机计算应用于 LSTM 神经网络,从而提高了能效和速度。
{"title":"Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI","authors":"Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.","doi":"10.1109/LCA.2024.3379002","DOIUrl":"10.1109/LCA.2024.3379002","url":null,"abstract":"This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"91-94"},"PeriodicalIF":2.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hashing ATD Tags for Low-Overhead Safe Contention Monitoring 对 ATD 标签进行散列处理,实现低开销的安全争用监测
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-15 DOI: 10.1109/LCA.2024.3401570
Pablo Andreu;Pedro Lopez;Carles Hernandez
Increasing the performance of safety-critical systems via introducing multicore processors is becoming the norm. However, when multiple cores access a shared cache, inter-core evictions become a relevant source of interference that must be appropriately controlled. To solve this issue, one can statically partition caches and remove the interference. Unfortunately, this comes at the expense of less flexibility and, in some cases, worse performance. In this context, enabling more flexible cache allocation policies requires additional monitoring support. This paper proposes HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG enables a low-overhead implementation of an Auxiliary Tag Directory to determine inter-core evictions. Our results show that no inter-task interference underprediction is possible with HashTAG while providing a 44% reduction in ATD area with only 1.14% median overprediction.
通过引入多核处理器来提高安全关键型系统的性能已成为一种常态。然而,当多个内核访问一个共享缓存时,内核间的驱逐就会成为一个相关的干扰源,必须加以适当控制。为了解决这个问题,我们可以静态划分高速缓存并消除干扰。遗憾的是,这样做的代价是灵活性降低,在某些情况下性能更差。在这种情况下,要实现更灵活的缓存分配策略,就需要额外的监控支持。本文提出的 HashTAG 是一种新颖的方法,可以准确地确定内核间驱逐干扰的上限。HashTAG 可以低开销地实现辅助标签目录,以确定内核间驱逐。我们的研究结果表明,使用 HashTAG 不会出现任务间干扰预测不足的情况,同时 ATD 面积减少了 44%,中位数预测过度率仅为 1.14%。
{"title":"Hashing ATD Tags for Low-Overhead Safe Contention Monitoring","authors":"Pablo Andreu;Pedro Lopez;Carles Hernandez","doi":"10.1109/LCA.2024.3401570","DOIUrl":"10.1109/LCA.2024.3401570","url":null,"abstract":"Increasing the performance of safety-critical systems via introducing multicore processors is becoming the norm. However, when multiple cores access a shared cache, inter-core evictions become a relevant source of interference that must be appropriately controlled. To solve this issue, one can statically partition caches and remove the interference. Unfortunately, this comes at the expense of less flexibility and, in some cases, worse performance. In this context, enabling more flexible cache allocation policies requires additional monitoring support. This paper proposes HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG enables a low-overhead implementation of an Auxiliary Tag Directory to determine inter-core evictions. Our results show that no inter-task interference underprediction is possible with HashTAG while providing a 44% reduction in ATD area with only 1.14% median overprediction.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"166-169"},"PeriodicalIF":1.4,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10530895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Case for In-Memory Random Scatter-Gather for Fast Graph Processing 快速图形处理的内存随机散点收集案例
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-13 DOI: 10.1109/LCA.2024.3376680
Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee
Because of the widely recognized memory wall issue, modern DRAMs are increasingly being assigned innovative functionalities beyond the basic read and write operations. Often referred to as “function-in-memory”, these techniques are crafted to leverage the abundant internal bandwidth available within the DRAM. However, these techniques face several challenges, including requiring large areas for arithmetic units and the necessity of splitting a single word into multiple pieces. These challenges severely limit the practical application of these function-in-memory techniques. In this paper, we present Piccolo, an efficient design of random scatter-gather memory. Our method achieves significant improvements with minimal overhead. By demonstrating our technique on a graph processing accelerator, we show that Piccolo and the proposed accelerator achieves $1.2-3.1 times$ speedup compared to the prior art.
由于公认的内存墙问题,现代 DRAM 越来越多地被赋予基本读写操作之外的创新功能。这些技术通常被称为 "内存中的功能",旨在充分利用 DRAM 内部丰富的带宽。然而,这些技术面临着一些挑战,包括需要大面积的算术单元,以及必须将单个字分割成多个片段。这些挑战严重限制了这些内存中函数技术的实际应用。在本文中,我们介绍了一种高效的随机散点收集存储器设计 Piccolo。我们的方法以最小的开销实现了显著的改进。通过在图形处理加速器上演示我们的技术,我们发现与现有技术相比,Piccolo 和提议的加速器的速度提高了 1.2-3.1 times$。
{"title":"A Case for In-Memory Random Scatter-Gather for Fast Graph Processing","authors":"Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee","doi":"10.1109/LCA.2024.3376680","DOIUrl":"10.1109/LCA.2024.3376680","url":null,"abstract":"Because of the widely recognized memory wall issue, modern DRAMs are increasingly being assigned innovative functionalities beyond the basic read and write operations. Often referred to as “function-in-memory”, these techniques are crafted to leverage the abundant internal bandwidth available within the DRAM. However, these techniques face several challenges, including requiring large areas for arithmetic units and the necessity of splitting a single word into multiple pieces. These challenges severely limit the practical application of these function-in-memory techniques. In this paper, we present Piccolo, an efficient design of random scatter-gather memory. Our method achieves significant improvements with minimal overhead. By demonstrating our technique on a graph processing accelerator, we show that Piccolo and the proposed accelerator achieves \u0000<inline-formula><tex-math>$1.2-3.1 times$</tex-math></inline-formula>\u0000 speedup compared to the prior art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"73-77"},"PeriodicalIF":2.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information SparseLeakyNets: 利用时序侧信道信息对稀疏感知嵌入式神经网络进行分类预测攻击
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-07 DOI: 10.1109/LCA.2024.3397730
Saurav Maji;Kyungmi Lee;Anantha P. Chandrakasan
This letter explores security vulnerabilities in sparsity-aware optimizations for Neural Network (NN) platforms, specifically focusing on timing side-channel attacks introduced by optimizations such as skipping sparse multiplications. We propose a classification prediction attack that utilizes this timing side-channel information to mimic the NN's prediction outcomes. Our techniques were demonstrated for CIFAR-10, MNIST, and biomedical classification tasks using diverse dataflows and processing loads in timing models. The demonstrated results could predict the original classification decision with high accuracy.
这封信探讨了神经网络(NN)平台稀疏感知优化中的安全漏洞,特别关注跳过稀疏乘法等优化所引入的时序侧信道攻击。我们提出了一种分类预测攻击,利用这种时序侧信道信息来模仿神经网络的预测结果。我们的技术在 CIFAR-10、MNIST 和生物医学分类任务中进行了演示,使用了不同的数据流和时序模型中的处理负载。演示结果可以高精度预测原始分类决策。
{"title":"SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information","authors":"Saurav Maji;Kyungmi Lee;Anantha P. Chandrakasan","doi":"10.1109/LCA.2024.3397730","DOIUrl":"10.1109/LCA.2024.3397730","url":null,"abstract":"This letter explores security vulnerabilities in sparsity-aware optimizations for Neural Network (NN) platforms, specifically focusing on timing side-channel attacks introduced by optimizations such as skipping sparse multiplications. We propose a classification prediction attack that utilizes this timing side-channel information to mimic the NN's prediction outcomes. Our techniques were demonstrated for CIFAR-10, MNIST, and biomedical classification tasks using diverse dataflows and processing loads in timing models. The demonstrated results could predict the original classification decision with high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management 地址扩展:细粒度线程安全元数据管理的架构支持
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-06 DOI: 10.1109/LCA.2024.3373760
Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry
In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose Address Scaling, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. Address Scaling improves the performance of Memcheck, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.
近几十年来,软件系统的规模和复杂性都有了显著增长。因此,这些系统更容易出现错误,从而导致性能和正确性方面的挑战。使用运行时监控工具是缓解这些挑战的一种方法。然而,这些工具需要为其监控的每个字节的应用数据维护元数据,这就会因额外的元数据访问而产生性能开销。我们提出了 "地址扩展"(Address Scaling)这一新的硬件框架,该框架可执行细粒度元数据管理,以减少运行时监控工具的元数据访问开销。我们的机制基于对不同运行时监控工具以不同粒度维护元数据的观察。我们的主要见解是在同一缓存行中维护数据及其相应的元数据,以保持本地性。与将元数据存储在与数据分离的内存区域的最先进系统相比,在顺序和随机内存访问模式下,地址缩放可分别提高 Memcheck(一种检测内存相关错误的动态监控工具)的性能 3.55 倍和 6.58 倍。
{"title":"Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management","authors":"Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry","doi":"10.1109/LCA.2024.3373760","DOIUrl":"10.1109/LCA.2024.3373760","url":null,"abstract":"In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose \u0000<italic>Address Scaling</i>\u0000, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. \u0000<italic>Address Scaling</i>\u0000 improves the performance of \u0000<monospace>Memcheck</monospace>\u0000, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"69-72"},"PeriodicalIF":2.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Direct Memory Operands in GPU Instructions 利用 GPU 指令中的直接内存操作数
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-05 DOI: 10.1109/LCA.2024.3371062
Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad
GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.
GPU 广泛用于各种应用,特别是机器学习和科学计算等数据并行任务。然而,在处理内存负载时,由于从历史上的 RISC 处理器继承下来的架构限制,导致寄存器文件争用现象严重,从而影响了 GPU 的效率。我们发现,寄存器文件中存在的大量数值(约 26%)通常只使用一次,平均占寄存器文件库冲突总数的 25% 以上。本文旨在解决 GPU 寄存器文件中的一次性内存值(即只使用一次的数据值)所造成的空间浪费和延迟增加问题。为此,我们引入了一种受 CISC 架构启发的新机制。它在算术运算中用直接内存操作数取代了一次性加载。我们的方法平均可将性能提高 20%,能耗降低 18%,硬件开销几乎可以忽略不计(<1%)。
{"title":"Exploiting Direct Memory Operands in GPU Instructions","authors":"Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2024.3371062","DOIUrl":"10.1109/LCA.2024.3371062","url":null,"abstract":"GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"162-165"},"PeriodicalIF":1.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Achieving Forward Progress Guarantee in Small Hardware Transactions 在小型硬件交易中实现前向进度保证
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-28 DOI: 10.1109/LCA.2024.3370992
Mahita Nagabhiru;Gregory T. Byrd
Hardware-transactional-memory (HTM) manages to pique interest from academia and industry alike because of its potential to ease concurrent-programming without compromising on performance. It offers a simple “all-or-nothing” idea to the programmer, making a piece of code appear atomic in hardware. Despite this and many elegant HTM implementations in research, only best-effort HTM is available commercially. Best-effort HTM lacks forward progress guarantee making it harder for the programmer to create a concurrent scalable fallback path. This has made HTM's adaptability limited. With a scope to support a myriad of applications, HTMs do a trade off between design and verification complexity vs forward progress guarantee. In this letter, we argue that limiting the scope of applications helps HTM attain guaranteed forward progress. We support lock-free programs by using HTM as multi-word-atomics and demonstrate strategic design choices to achieve lock-freedom completely in hardware. We use lfbench, a lock-free micro-benchmark-suite, and Arm's best-effort HTM (ARM_TME) on the gem5 simulator, as our base. We demonstrate the performance tradeoffs between design choices of a deferral-based, NACK-based, and NACK-with-backoff approaches. We show that NACK-with-backoff performs better than the others without compromising scalability for both read- and write-intensive applications.
硬件事务内存(HTM)之所以能引起学术界和工业界的兴趣,是因为它具有在不影响性能的情况下简化并发编程的潜力。它为程序员提供了一个简单的 "全有或全无 "的想法,使一段代码在硬件中看起来是原子的。尽管如此,在研究中也有许多优雅的 HTM 实现,但只有尽力 HTM 可用于商业用途。尽力 HTM 缺乏前向进展保证,使得程序员更难创建并发的可扩展后备路径。这使得 HTM 的适应性受到限制。HTM 可支持大量应用,因此需要在设计和验证复杂性与前向进度保证之间进行权衡。在这封信中,我们认为限制应用范围有助于 HTM 实现有保证的前进。我们通过将 HTM 用作多字原子来支持无锁程序,并展示了在硬件中完全实现无锁的策略性设计选择。我们使用无锁微基准套件 lfbench 和 gem5 模拟器上的 Arm 最佳 HTM(ARM_TME)作为基础。我们展示了基于延迟、基于 NACK 和 NACK-with-backoff 方法的设计选择之间的性能权衡。我们表明,对于读取和写入密集型应用而言,带后退的 NACK 性能优于其他方法,且不影响可扩展性。
{"title":"Achieving Forward Progress Guarantee in Small Hardware Transactions","authors":"Mahita Nagabhiru;Gregory T. Byrd","doi":"10.1109/LCA.2024.3370992","DOIUrl":"10.1109/LCA.2024.3370992","url":null,"abstract":"Hardware-transactional-memory (HTM) manages to pique interest from academia and industry alike because of its potential to ease concurrent-programming without compromising on performance. It offers a simple “all-or-nothing” idea to the programmer, making a piece of code appear atomic in hardware. Despite this and many elegant HTM implementations in research, only best-effort HTM is available commercially. Best-effort HTM lacks forward progress guarantee making it harder for the programmer to create a concurrent scalable fallback path. This has made HTM's adaptability limited. With a scope to support a myriad of applications, HTMs do a trade off between design and verification complexity vs forward progress guarantee. In this letter, we argue that limiting the scope of applications helps HTM attain guaranteed forward progress. We support lock-free programs by using HTM as multi-word-atomics and demonstrate strategic design choices to achieve lock-freedom completely in hardware. We use lfbench, a lock-free micro-benchmark-suite, and Arm's best-effort HTM (ARM_TME) on the gem5 simulator, as our base. We demonstrate the performance tradeoffs between design choices of a deferral-based, NACK-based, and NACK-with-backoff approaches. We show that NACK-with-backoff performs better than the others without compromising scalability for both read- and write-intensive applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"53-56"},"PeriodicalIF":2.3,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs FullPack:通用 CPU 上子字节量化矢量矩阵乘法的全矢量利用率
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-27 DOI: 10.1109/LCA.2024.3370402
Hossein Katebi;Navidreza Asadi;Maziar Goudarzi
Sub-byte quantization on popular vector ISAs suffers from heavy waste of vector as well as memory bandwidth. The latest methods pack a number of quantized data in one vector, but have to pad them with empty bits to avoid overflow to neighbours. We remove even these empty bits and provide full utilization of the vector and memory bandwidth by our data-layout/compute co-design scheme. We implemented FullPack on TFLite for Vector-Matrix multiplication and showed up to $6.7times$ speedup, $2.75times$ on average on single layers, which translated to $1.56-2.11times$ end-to-end speedup on DeepSpeech.
流行的矢量 ISA 上的子字节量化严重浪费了矢量和内存带宽。最新的方法将大量量化数据打包到一个矢量中,但必须填充空位以避免溢出到邻域。我们通过数据布局/计算协同设计方案,去掉了这些空位,从而充分利用了矢量和内存带宽。我们在 TFLite 上实现了 FullPack 的矢量-矩阵乘法,结果显示速度提高了 6.7 倍,单层平均提高了 2.75 倍,在 DeepSpeech 上的端到端速度提高了 1.56-2.11 倍。
{"title":"FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs","authors":"Hossein Katebi;Navidreza Asadi;Maziar Goudarzi","doi":"10.1109/LCA.2024.3370402","DOIUrl":"10.1109/LCA.2024.3370402","url":null,"abstract":"Sub-byte quantization on popular vector ISAs suffers from heavy waste of vector as well as memory bandwidth. The latest methods pack a number of quantized data in one vector, but have to pad them with empty bits to avoid overflow to neighbours. We remove even these empty bits and provide full utilization of the vector and memory bandwidth by our data-layout/compute co-design scheme. We implemented FullPack on TFLite for Vector-Matrix multiplication and showed up to \u0000<inline-formula><tex-math>$6.7times$</tex-math></inline-formula>\u0000 speedup, \u0000<inline-formula><tex-math>$2.75times$</tex-math></inline-formula>\u0000 on average on single layers, which translated to \u0000<inline-formula><tex-math>$1.56-2.11times$</tex-math></inline-formula>\u0000 end-to-end speedup on DeepSpeech.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"142-145"},"PeriodicalIF":1.4,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JANM-IK: Jacobian Argumented Nelder-Mead Algorithm for Inverse Kinematics and its Hardware Acceleration JANM-IK:用于逆运动学的 Jacobian Argumented Nelder-Mead 算法及其硬件加速算法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-26 DOI: 10.1109/LCA.2024.3369940
Yuxin Yang;Xiaoming Chen;Yinhe Han
Inverse kinematics is one of the core calculations in robotic applications and has strong performance requirements. Previous hardware acceleration work paid little attention to joint constraints, which can lead to computational failures. We propose a new inverse kinematics algorithm JANM-IK. It uses a hardware-friendly design, optimizes the Jacobian-based method and Nelder-Mead method, realizes the processing of joint constraints, and has a high convergence speed. We further designed its acceleration architecture to achieve high-performance computing through sufficient parallelism and hardware optimization. Finally, after experimental verification, JANM-IK can achieve a very high success rate and obtain certain performance improvements.
逆运动学是机器人应用中的核心计算之一,对性能有很高的要求。以往的硬件加速工作很少关注关节约束,这可能导致计算失败。我们提出了一种新的逆运动学算法 JANM-IK。它采用硬件友好型设计,优化了基于雅各布的方法和 Nelder-Mead 方法,实现了对关节约束的处理,并具有较高的收敛速度。我们进一步设计了其加速架构,通过充分的并行性和硬件优化实现高性能计算。最后,经过实验验证,JANM-IK 可以达到很高的成功率,并获得一定的性能提升。
{"title":"JANM-IK: Jacobian Argumented Nelder-Mead Algorithm for Inverse Kinematics and its Hardware Acceleration","authors":"Yuxin Yang;Xiaoming Chen;Yinhe Han","doi":"10.1109/LCA.2024.3369940","DOIUrl":"10.1109/LCA.2024.3369940","url":null,"abstract":"Inverse kinematics is one of the core calculations in robotic applications and has strong performance requirements. Previous hardware acceleration work paid little attention to joint constraints, which can lead to computational failures. We propose a new inverse kinematics algorithm JANM-IK. It uses a hardware-friendly design, optimizes the Jacobian-based method and Nelder-Mead method, realizes the processing of joint constraints, and has a high convergence speed. We further designed its acceleration architecture to achieve high-performance computing through sufficient parallelism and hardware optimization. Finally, after experimental verification, JANM-IK can achieve a very high success rate and obtain certain performance improvements.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"45-48"},"PeriodicalIF":2.3,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1