首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management 地址扩展:细粒度线程安全元数据管理的架构支持
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-06 DOI: 10.1109/LCA.2024.3373760
Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry
In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose Address Scaling, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. Address Scaling improves the performance of Memcheck, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.
近几十年来,软件系统的规模和复杂性都有了显著增长。因此,这些系统更容易出现错误,从而导致性能和正确性方面的挑战。使用运行时监控工具是缓解这些挑战的一种方法。然而,这些工具需要为其监控的每个字节的应用数据维护元数据,这就会因额外的元数据访问而产生性能开销。我们提出了 "地址扩展"(Address Scaling)这一新的硬件框架,该框架可执行细粒度元数据管理,以减少运行时监控工具的元数据访问开销。我们的机制基于对不同运行时监控工具以不同粒度维护元数据的观察。我们的主要见解是在同一缓存行中维护数据及其相应的元数据,以保持本地性。与将元数据存储在与数据分离的内存区域的最先进系统相比,在顺序和随机内存访问模式下,地址缩放可分别提高 Memcheck(一种检测内存相关错误的动态监控工具)的性能 3.55 倍和 6.58 倍。
{"title":"Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management","authors":"Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry","doi":"10.1109/LCA.2024.3373760","DOIUrl":"10.1109/LCA.2024.3373760","url":null,"abstract":"In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose \u0000<italic>Address Scaling</i>\u0000, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. \u0000<italic>Address Scaling</i>\u0000 improves the performance of \u0000<monospace>Memcheck</monospace>\u0000, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"69-72"},"PeriodicalIF":2.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Direct Memory Operands in GPU Instructions 利用 GPU 指令中的直接内存操作数
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-05 DOI: 10.1109/LCA.2024.3371062
Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad
GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.
GPU 广泛用于各种应用,特别是机器学习和科学计算等数据并行任务。然而,在处理内存负载时,由于从历史上的 RISC 处理器继承下来的架构限制,导致寄存器文件争用现象严重,从而影响了 GPU 的效率。我们发现,寄存器文件中存在的大量数值(约 26%)通常只使用一次,平均占寄存器文件库冲突总数的 25% 以上。本文旨在解决 GPU 寄存器文件中的一次性内存值(即只使用一次的数据值)所造成的空间浪费和延迟增加问题。为此,我们引入了一种受 CISC 架构启发的新机制。它在算术运算中用直接内存操作数取代了一次性加载。我们的方法平均可将性能提高 20%,能耗降低 18%,硬件开销几乎可以忽略不计(<1%)。
{"title":"Exploiting Direct Memory Operands in GPU Instructions","authors":"Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2024.3371062","DOIUrl":"10.1109/LCA.2024.3371062","url":null,"abstract":"GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"162-165"},"PeriodicalIF":1.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Achieving Forward Progress Guarantee in Small Hardware Transactions 在小型硬件交易中实现前向进度保证
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-28 DOI: 10.1109/LCA.2024.3370992
Mahita Nagabhiru;Gregory T. Byrd
Hardware-transactional-memory (HTM) manages to pique interest from academia and industry alike because of its potential to ease concurrent-programming without compromising on performance. It offers a simple “all-or-nothing” idea to the programmer, making a piece of code appear atomic in hardware. Despite this and many elegant HTM implementations in research, only best-effort HTM is available commercially. Best-effort HTM lacks forward progress guarantee making it harder for the programmer to create a concurrent scalable fallback path. This has made HTM's adaptability limited. With a scope to support a myriad of applications, HTMs do a trade off between design and verification complexity vs forward progress guarantee. In this letter, we argue that limiting the scope of applications helps HTM attain guaranteed forward progress. We support lock-free programs by using HTM as multi-word-atomics and demonstrate strategic design choices to achieve lock-freedom completely in hardware. We use lfbench, a lock-free micro-benchmark-suite, and Arm's best-effort HTM (ARM_TME) on the gem5 simulator, as our base. We demonstrate the performance tradeoffs between design choices of a deferral-based, NACK-based, and NACK-with-backoff approaches. We show that NACK-with-backoff performs better than the others without compromising scalability for both read- and write-intensive applications.
硬件事务内存(HTM)之所以能引起学术界和工业界的兴趣,是因为它具有在不影响性能的情况下简化并发编程的潜力。它为程序员提供了一个简单的 "全有或全无 "的想法,使一段代码在硬件中看起来是原子的。尽管如此,在研究中也有许多优雅的 HTM 实现,但只有尽力 HTM 可用于商业用途。尽力 HTM 缺乏前向进展保证,使得程序员更难创建并发的可扩展后备路径。这使得 HTM 的适应性受到限制。HTM 可支持大量应用,因此需要在设计和验证复杂性与前向进度保证之间进行权衡。在这封信中,我们认为限制应用范围有助于 HTM 实现有保证的前进。我们通过将 HTM 用作多字原子来支持无锁程序,并展示了在硬件中完全实现无锁的策略性设计选择。我们使用无锁微基准套件 lfbench 和 gem5 模拟器上的 Arm 最佳 HTM(ARM_TME)作为基础。我们展示了基于延迟、基于 NACK 和 NACK-with-backoff 方法的设计选择之间的性能权衡。我们表明,对于读取和写入密集型应用而言,带后退的 NACK 性能优于其他方法,且不影响可扩展性。
{"title":"Achieving Forward Progress Guarantee in Small Hardware Transactions","authors":"Mahita Nagabhiru;Gregory T. Byrd","doi":"10.1109/LCA.2024.3370992","DOIUrl":"10.1109/LCA.2024.3370992","url":null,"abstract":"Hardware-transactional-memory (HTM) manages to pique interest from academia and industry alike because of its potential to ease concurrent-programming without compromising on performance. It offers a simple “all-or-nothing” idea to the programmer, making a piece of code appear atomic in hardware. Despite this and many elegant HTM implementations in research, only best-effort HTM is available commercially. Best-effort HTM lacks forward progress guarantee making it harder for the programmer to create a concurrent scalable fallback path. This has made HTM's adaptability limited. With a scope to support a myriad of applications, HTMs do a trade off between design and verification complexity vs forward progress guarantee. In this letter, we argue that limiting the scope of applications helps HTM attain guaranteed forward progress. We support lock-free programs by using HTM as multi-word-atomics and demonstrate strategic design choices to achieve lock-freedom completely in hardware. We use lfbench, a lock-free micro-benchmark-suite, and Arm's best-effort HTM (ARM_TME) on the gem5 simulator, as our base. We demonstrate the performance tradeoffs between design choices of a deferral-based, NACK-based, and NACK-with-backoff approaches. We show that NACK-with-backoff performs better than the others without compromising scalability for both read- and write-intensive applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"53-56"},"PeriodicalIF":2.3,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs FullPack:通用 CPU 上子字节量化矢量矩阵乘法的全矢量利用率
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-27 DOI: 10.1109/LCA.2024.3370402
Hossein Katebi;Navidreza Asadi;Maziar Goudarzi
Sub-byte quantization on popular vector ISAs suffers from heavy waste of vector as well as memory bandwidth. The latest methods pack a number of quantized data in one vector, but have to pad them with empty bits to avoid overflow to neighbours. We remove even these empty bits and provide full utilization of the vector and memory bandwidth by our data-layout/compute co-design scheme. We implemented FullPack on TFLite for Vector-Matrix multiplication and showed up to $6.7times$ speedup, $2.75times$ on average on single layers, which translated to $1.56-2.11times$ end-to-end speedup on DeepSpeech.
流行的矢量 ISA 上的子字节量化严重浪费了矢量和内存带宽。最新的方法将大量量化数据打包到一个矢量中,但必须填充空位以避免溢出到邻域。我们通过数据布局/计算协同设计方案,去掉了这些空位,从而充分利用了矢量和内存带宽。我们在 TFLite 上实现了 FullPack 的矢量-矩阵乘法,结果显示速度提高了 6.7 倍,单层平均提高了 2.75 倍,在 DeepSpeech 上的端到端速度提高了 1.56-2.11 倍。
{"title":"FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs","authors":"Hossein Katebi;Navidreza Asadi;Maziar Goudarzi","doi":"10.1109/LCA.2024.3370402","DOIUrl":"10.1109/LCA.2024.3370402","url":null,"abstract":"Sub-byte quantization on popular vector ISAs suffers from heavy waste of vector as well as memory bandwidth. The latest methods pack a number of quantized data in one vector, but have to pad them with empty bits to avoid overflow to neighbours. We remove even these empty bits and provide full utilization of the vector and memory bandwidth by our data-layout/compute co-design scheme. We implemented FullPack on TFLite for Vector-Matrix multiplication and showed up to \u0000<inline-formula><tex-math>$6.7times$</tex-math></inline-formula>\u0000 speedup, \u0000<inline-formula><tex-math>$2.75times$</tex-math></inline-formula>\u0000 on average on single layers, which translated to \u0000<inline-formula><tex-math>$1.56-2.11times$</tex-math></inline-formula>\u0000 end-to-end speedup on DeepSpeech.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"142-145"},"PeriodicalIF":1.4,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JANM-IK: Jacobian Argumented Nelder-Mead Algorithm for Inverse Kinematics and its Hardware Acceleration JANM-IK:用于逆运动学的 Jacobian Argumented Nelder-Mead 算法及其硬件加速算法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-26 DOI: 10.1109/LCA.2024.3369940
Yuxin Yang;Xiaoming Chen;Yinhe Han
Inverse kinematics is one of the core calculations in robotic applications and has strong performance requirements. Previous hardware acceleration work paid little attention to joint constraints, which can lead to computational failures. We propose a new inverse kinematics algorithm JANM-IK. It uses a hardware-friendly design, optimizes the Jacobian-based method and Nelder-Mead method, realizes the processing of joint constraints, and has a high convergence speed. We further designed its acceleration architecture to achieve high-performance computing through sufficient parallelism and hardware optimization. Finally, after experimental verification, JANM-IK can achieve a very high success rate and obtain certain performance improvements.
逆运动学是机器人应用中的核心计算之一,对性能有很高的要求。以往的硬件加速工作很少关注关节约束,这可能导致计算失败。我们提出了一种新的逆运动学算法 JANM-IK。它采用硬件友好型设计,优化了基于雅各布的方法和 Nelder-Mead 方法,实现了对关节约束的处理,并具有较高的收敛速度。我们进一步设计了其加速架构,通过充分的并行性和硬件优化实现高性能计算。最后,经过实验验证,JANM-IK 可以达到很高的成功率,并获得一定的性能提升。
{"title":"JANM-IK: Jacobian Argumented Nelder-Mead Algorithm for Inverse Kinematics and its Hardware Acceleration","authors":"Yuxin Yang;Xiaoming Chen;Yinhe Han","doi":"10.1109/LCA.2024.3369940","DOIUrl":"10.1109/LCA.2024.3369940","url":null,"abstract":"Inverse kinematics is one of the core calculations in robotic applications and has strong performance requirements. Previous hardware acceleration work paid little attention to joint constraints, which can lead to computational failures. We propose a new inverse kinematics algorithm JANM-IK. It uses a hardware-friendly design, optimizes the Jacobian-based method and Nelder-Mead method, realizes the processing of joint constraints, and has a high convergence speed. We further designed its acceleration architecture to achieve high-performance computing through sufficient parallelism and hardware optimization. Finally, after experimental verification, JANM-IK can achieve a very high success rate and obtain certain performance improvements.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"45-48"},"PeriodicalIF":2.3,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Energy-Efficiency of Capsule Networks on Modern GPUs 提高现代 GPU 上胶囊网络的能效
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-23 DOI: 10.1109/LCA.2024.3365149
Mohammad Hafezan;Ehsan Atoofian
Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.
卷积神经网络(CNN)已成为机器学习应用中引人注目的解决方案,因为它们在某些任务中的准确性已超过人类水平。尽管卷积神经网络取得了巨大成功,但它们是基于对特定特征的识别来对图像进行分类的,由于池化层的存在,忽略了不同特征之间的空间关系。谷歌大脑团队提出的胶囊网络(CapsNet)架构试图解决这一缺陷,它将多个神经元组合成一个胶囊,并学习不同输入特征之间的空间相关性。因此,CapsNet 不仅能识别特征的存在,还能识别其与其他特征之间的关系。然而,当 CapsNet 在配备张量内核(TC)的现代 GPU 上运行时,其成功的代价是资源利用率不足。由于 CapsNet 中的胶囊结构,TC 中的功能单元往往未得到充分利用,从而延长了胶囊层的执行时间,增加了能耗。在这项工作中,我们提出了一种消除无效操作和提高 GPU 能效的架构。在一组最先进的数据集上进行的实验测量表明,所提出的方法可将能效提高 15%,同时保持 CapsNets 的准确性。
{"title":"Improving Energy-Efficiency of Capsule Networks on Modern GPUs","authors":"Mohammad Hafezan;Ehsan Atoofian","doi":"10.1109/LCA.2024.3365149","DOIUrl":"10.1109/LCA.2024.3365149","url":null,"abstract":"Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models eDKM:针对大型语言模型的高效、精确的训练时间权重聚类方法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-07 DOI: 10.1109/LCA.2024.3363492
Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal
Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).
由于大型语言模型(LLMs)在许多复杂的语言任务中都表现出了高质量的性能,因此人们对将这些 LLMs 引入移动设备以实现更快的响应和更好的隐私保护产生了浓厚的兴趣。然而,LLMs 的大小(即数十亿个参数)需要进行高效压缩,以适应存储空间有限的设备。在众多压缩技术中,权重聚类(一种非线性量化形式)是 LLM 压缩的主要候选技术之一,并为现代智能手机所支持。然而,它的训练开销对于 LLM 微调来说过于巨大。尤其是可微分 KMeans 聚类(或称 DKM),它在压缩率和精度回归之间做出了最先进的权衡,但其庞大的内存复杂度使其几乎无法应用于训练时间 LLM 压缩。在这封信中,我们提出了一种内存效率高的 DKM 实现--eDKM,它采用新技术将 DKM 的内存占用降低了几个数量级。对于要保存在 CPU 上用于 DKM 后向传递的给定张量,我们在检查之前复制到 CPU 的张量是否没有重复后,通过应用唯一性和分片来压缩张量。我们的实验结果表明,eDKM 可以通过将解码器层的训练时间内存占用减少 130 倍,对 Alpaca 数据集进行微调并将预训练的 LLaMA 7B 模型从 12.6 GB 压缩到 2.5 GB(3 b/weight),同时在更广泛的 LLM 基准上提供良好的准确性(例如,PIQA 为 77.7%,Winograde 为 66.1%,等等)。
{"title":"eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models","authors":"Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal","doi":"10.1109/LCA.2024.3363492","DOIUrl":"10.1109/LCA.2024.3363492","url":null,"abstract":"Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead R.i.p. Geomean Speedup 使用等功(或等时)调和均值加速代替
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-05 DOI: 10.1109/LCA.2024.3361925
Lieven Eeckhout
How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.
如何准确总结平均性能具有挑战性。虽然几何平均加速度被广泛使用,但它毫无意义。相反,本文主张使用谐波平均加速度,它能准确概括工作负载在目标系统上的执行速度相对于基线快多少。我们提出了等功和等时谐波平均加速度指标,明确揭示了它们所做的不同假设,并进一步提出等功加速度与计算机体系结构研究最为相关。本文证明,在实践中使用哪种平均加速度非常重要,因为不恰当的平均值可能会导致错误的结论。
{"title":"R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead","authors":"Lieven Eeckhout","doi":"10.1109/LCA.2024.3361925","DOIUrl":"10.1109/LCA.2024.3361925","url":null,"abstract":"How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"78-82"},"PeriodicalIF":2.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Baobab Merkle Tree for Efficient Secure Memory 用于高效安全存储器的猴面包树(Baobab Merkle Tree
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-31 DOI: 10.1109/LCA.2024.3360709
Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman
Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the Baobab Merkle Tree, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.
安全内存是解决内存硬件漏洞的自然解决方案,但它面临着性能和内存开销的根本挑战。虽然在优化协议性能方面做了大量工作,但在优化内存开销方面的工作却少得多。在这项工作中,我们提出了 Baobab Merkle Tree,将计数器记忆在片上表中。Baobab 梅克尔树将 Bonsai 梅克尔树的空间开销降低了 2-4 倍,而不会产生性能开销。
{"title":"Baobab Merkle Tree for Efficient Secure Memory","authors":"Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman","doi":"10.1109/LCA.2024.3360709","DOIUrl":"10.1109/LCA.2024.3360709","url":null,"abstract":"Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the \u0000<italic>Baobab Merkle Tree</i>\u0000, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"33-36"},"PeriodicalIF":2.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Primate: A Framework to Automatically Generate Soft Processors for Network Applications 灵长类动物为网络应用自动生成软处理器的框架
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-26 DOI: 10.1109/LCA.2024.3358839
Rui Ma;Jia-Ching Hsu;Ali Mansoorshahi;Joseph Garvey;Michael Kinsner;Deshanand Singh;Derek Chiou
Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.
FPGA 上的叠加处理器可实现:i) 通过调用库函数的顺序代码实现软件可编程性;ii) 通过将库调用转换为调用相应的加速器实现高性能;iii) 比重新编程 FPGA 更快地部署。传统上,覆盖层是用 RTL 手写的,并通过手写汇编进行编程。我们提出了 Primate 框架,它能从以注释 C++ 编写的应用程序中自动生成覆盖层。我们在 Whippersnapper(Dang 等人,2017 年)P4 基准上对 Primate 进行了评估。与高性能 CPU 解决方案 PISCES(Shahbaz 等人,2016 年)相比,Primate Overlay 的延迟为 0.06x - 0.15x;与 FPGA 上的 P4 HLS 编译器 P4FPGA(Wang 等人,2017 年)生成的解决方案相比,Primate Overlay 的延迟为 0.25x - 2.3x。
{"title":"Primate: A Framework to Automatically Generate Soft Processors for Network Applications","authors":"Rui Ma;Jia-Ching Hsu;Ali Mansoorshahi;Joseph Garvey;Michael Kinsner;Deshanand Singh;Derek Chiou","doi":"10.1109/LCA.2024.3358839","DOIUrl":"10.1109/LCA.2024.3358839","url":null,"abstract":"Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"57-60"},"PeriodicalIF":2.3,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1