首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Improving Energy-Efficiency of Capsule Networks on Modern GPUs 提高现代 GPU 上胶囊网络的能效
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-23 DOI: 10.1109/LCA.2024.3365149
Mohammad Hafezan;Ehsan Atoofian
Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.
卷积神经网络(CNN)已成为机器学习应用中引人注目的解决方案,因为它们在某些任务中的准确性已超过人类水平。尽管卷积神经网络取得了巨大成功,但它们是基于对特定特征的识别来对图像进行分类的,由于池化层的存在,忽略了不同特征之间的空间关系。谷歌大脑团队提出的胶囊网络(CapsNet)架构试图解决这一缺陷,它将多个神经元组合成一个胶囊,并学习不同输入特征之间的空间相关性。因此,CapsNet 不仅能识别特征的存在,还能识别其与其他特征之间的关系。然而,当 CapsNet 在配备张量内核(TC)的现代 GPU 上运行时,其成功的代价是资源利用率不足。由于 CapsNet 中的胶囊结构,TC 中的功能单元往往未得到充分利用,从而延长了胶囊层的执行时间,增加了能耗。在这项工作中,我们提出了一种消除无效操作和提高 GPU 能效的架构。在一组最先进的数据集上进行的实验测量表明,所提出的方法可将能效提高 15%,同时保持 CapsNets 的准确性。
{"title":"Improving Energy-Efficiency of Capsule Networks on Modern GPUs","authors":"Mohammad Hafezan;Ehsan Atoofian","doi":"10.1109/LCA.2024.3365149","DOIUrl":"10.1109/LCA.2024.3365149","url":null,"abstract":"Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models eDKM:针对大型语言模型的高效、精确的训练时间权重聚类方法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-07 DOI: 10.1109/LCA.2024.3363492
Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal
Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).
由于大型语言模型(LLMs)在许多复杂的语言任务中都表现出了高质量的性能,因此人们对将这些 LLMs 引入移动设备以实现更快的响应和更好的隐私保护产生了浓厚的兴趣。然而,LLMs 的大小(即数十亿个参数)需要进行高效压缩,以适应存储空间有限的设备。在众多压缩技术中,权重聚类(一种非线性量化形式)是 LLM 压缩的主要候选技术之一,并为现代智能手机所支持。然而,它的训练开销对于 LLM 微调来说过于巨大。尤其是可微分 KMeans 聚类(或称 DKM),它在压缩率和精度回归之间做出了最先进的权衡,但其庞大的内存复杂度使其几乎无法应用于训练时间 LLM 压缩。在这封信中,我们提出了一种内存效率高的 DKM 实现--eDKM,它采用新技术将 DKM 的内存占用降低了几个数量级。对于要保存在 CPU 上用于 DKM 后向传递的给定张量,我们在检查之前复制到 CPU 的张量是否没有重复后,通过应用唯一性和分片来压缩张量。我们的实验结果表明,eDKM 可以通过将解码器层的训练时间内存占用减少 130 倍,对 Alpaca 数据集进行微调并将预训练的 LLaMA 7B 模型从 12.6 GB 压缩到 2.5 GB(3 b/weight),同时在更广泛的 LLM 基准上提供良好的准确性(例如,PIQA 为 77.7%,Winograde 为 66.1%,等等)。
{"title":"eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models","authors":"Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal","doi":"10.1109/LCA.2024.3363492","DOIUrl":"10.1109/LCA.2024.3363492","url":null,"abstract":"Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead R.i.p. Geomean Speedup 使用等功(或等时)调和均值加速代替
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-05 DOI: 10.1109/LCA.2024.3361925
Lieven Eeckhout
How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.
如何准确总结平均性能具有挑战性。虽然几何平均加速度被广泛使用,但它毫无意义。相反,本文主张使用谐波平均加速度,它能准确概括工作负载在目标系统上的执行速度相对于基线快多少。我们提出了等功和等时谐波平均加速度指标,明确揭示了它们所做的不同假设,并进一步提出等功加速度与计算机体系结构研究最为相关。本文证明,在实践中使用哪种平均加速度非常重要,因为不恰当的平均值可能会导致错误的结论。
{"title":"R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead","authors":"Lieven Eeckhout","doi":"10.1109/LCA.2024.3361925","DOIUrl":"10.1109/LCA.2024.3361925","url":null,"abstract":"How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"78-82"},"PeriodicalIF":2.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Baobab Merkle Tree for Efficient Secure Memory 用于高效安全存储器的猴面包树(Baobab Merkle Tree
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-31 DOI: 10.1109/LCA.2024.3360709
Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman
Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the Baobab Merkle Tree, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.
安全内存是解决内存硬件漏洞的自然解决方案,但它面临着性能和内存开销的根本挑战。虽然在优化协议性能方面做了大量工作,但在优化内存开销方面的工作却少得多。在这项工作中,我们提出了 Baobab Merkle Tree,将计数器记忆在片上表中。Baobab 梅克尔树将 Bonsai 梅克尔树的空间开销降低了 2-4 倍,而不会产生性能开销。
{"title":"Baobab Merkle Tree for Efficient Secure Memory","authors":"Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman","doi":"10.1109/LCA.2024.3360709","DOIUrl":"10.1109/LCA.2024.3360709","url":null,"abstract":"Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the \u0000<italic>Baobab Merkle Tree</i>\u0000, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"33-36"},"PeriodicalIF":2.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Primate: A Framework to Automatically Generate Soft Processors for Network Applications 灵长类动物为网络应用自动生成软处理器的框架
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-26 DOI: 10.1109/LCA.2024.3358839
Rui Ma;Jia-Ching Hsu;Ali Mansoorshahi;Joseph Garvey;Michael Kinsner;Deshanand Singh;Derek Chiou
Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.
FPGA 上的叠加处理器可实现:i) 通过调用库函数的顺序代码实现软件可编程性;ii) 通过将库调用转换为调用相应的加速器实现高性能;iii) 比重新编程 FPGA 更快地部署。传统上,覆盖层是用 RTL 手写的,并通过手写汇编进行编程。我们提出了 Primate 框架,它能从以注释 C++ 编写的应用程序中自动生成覆盖层。我们在 Whippersnapper(Dang 等人,2017 年)P4 基准上对 Primate 进行了评估。与高性能 CPU 解决方案 PISCES(Shahbaz 等人,2016 年)相比,Primate Overlay 的延迟为 0.06x - 0.15x;与 FPGA 上的 P4 HLS 编译器 P4FPGA(Wang 等人,2017 年)生成的解决方案相比,Primate Overlay 的延迟为 0.25x - 2.3x。
{"title":"Primate: A Framework to Automatically Generate Soft Processors for Network Applications","authors":"Rui Ma;Jia-Ching Hsu;Ali Mansoorshahi;Joseph Garvey;Michael Kinsner;Deshanand Singh;Derek Chiou","doi":"10.1109/LCA.2024.3358839","DOIUrl":"10.1109/LCA.2024.3358839","url":null,"abstract":"Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"57-60"},"PeriodicalIF":2.3,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Memory Layout for Pre-Alignment Filtering of Long DNA Reads Using Racetrack Memory 利用 Racetrack 内存对长 DNA 读数进行预配准过滤的高效内存布局
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-19 DOI: 10.1109/LCA.2024.3350701
Asif Ali Khan;Fazal Hameed;Taha Shahroodi;Alex K. Jones;Jeronimo Castrillon
DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed pre-alignment filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. However, the filtering operation itself is memory-intensive for which the conventional Von-Neumann architectures perform poorly. Therefore, recent designs advocate compute near memory (CNM) accelerators based on stacked DRAM and more exotic memory technologies such as racetrack memories (RTM). However, these designs only support small DNA reads of circa 100 nucleotides, referred to as short reads. This letter proposes a CNM system for handling both long and short reads. It introduces a novel data-placement solution that significantly increases parallelism and reduces overhead. Evaluation results show substantial reductions in execution time ($1.32times$) and energy consumption (50%), compared to the state-of-the-art.
DNA 序列比对是生物信息学中一项基本且计算成本高昂的操作。研究人员开发了预配准过滤器,通过丢弃匹配度较低的位置,有效减少配准过程中消耗的数据量。然而,过滤操作本身需要大量内存,传统的 Von-Neumann 架构在这方面表现不佳。因此,最近的设计提倡使用基于堆叠 DRAM 的计算近存储器(CNM)加速器和更奇特的存储器技术,如赛道存储器(RTM)。然而,这些设计只能支持约 100 个核苷酸的小 DNA 读取,即短读取。这封信提出了一种同时处理长读和短读的 CNM 系统。它引入了一种新颖的数据置放解决方案,大大提高了并行性并减少了开销。评估结果表明,与最先进的系统相比,该系统的执行时间(1.32 美元/次)和能耗(50%)大幅减少。
{"title":"Efficient Memory Layout for Pre-Alignment Filtering of Long DNA Reads Using Racetrack Memory","authors":"Asif Ali Khan;Fazal Hameed;Taha Shahroodi;Alex K. Jones;Jeronimo Castrillon","doi":"10.1109/LCA.2024.3350701","DOIUrl":"10.1109/LCA.2024.3350701","url":null,"abstract":"DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed \u0000<i>pre-alignment</i>\u0000 filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. However, the filtering operation itself is memory-intensive for which the conventional Von-Neumann architectures perform poorly. Therefore, recent designs advocate compute near memory (CNM) accelerators based on stacked DRAM and more exotic memory technologies such as \u0000<i>racetrack memories</i>\u0000 (RTM). However, these designs only support small DNA reads of circa 100 nucleotides, referred to as \u0000<i>short reads</i>\u0000. This letter proposes a CNM system for handling both long and short reads. It introduces a novel data-placement solution that significantly increases parallelism and reduces overhead. Evaluation results show substantial reductions in execution time (\u0000<inline-formula><tex-math>$1.32times$</tex-math></inline-formula>\u0000) and energy consumption (50%), compared to the state-of-the-art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"129-132"},"PeriodicalIF":2.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity DeMM:支持松弛结构稀疏性的解耦矩阵乘法引擎
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-17 DOI: 10.1109/LCA.2024.3355178
Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos
Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of $N$:128, or $N$:256, for small values of $N$, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and $N$ read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.
深度学习(DL)在各种应用领域取得了前所未有的成功。与此同时,模型剪枝已成为一种可行的解决方案,可减少移动应用中深度学习模型的占用空间,同时又不影响其准确性。为了使为密集 DL 模型构建的矩阵引擎也能处理经过剪枝的对应模型,经过剪枝的 DL 模型遵循 1:4 或 2:4 的细粒度结构稀疏性模式,即在每组四个连续值中,至少有一个或两个值必须为非零。最近,结构稀疏性也发展到了更粗糙(宽松)的 N:128 或 N:256(N 值很小)的情况,目标是为 DL 模型提供更宽的稀疏性范围(10%-90%)。在这项工作中,我们设计了一种加速器,通过构造,它可以在具有宽松结构稀疏性的宽块上运行。与传统的收缩阵列原型不同,新引擎将收缩阵列的内存部分与乘加单元解耦。内存块包括 1 个写入端口和 N 个读取端口,读取端口的数量等于每行非零元素的数量。乘加单元直接连接到每个读取端口,并按行先乘积的顺序完成乘法运算。更重要的是,简单的重新配置可实现更密集的模式。实验评估结果表明,与目前最先进的针对细粒度和宽松结构稀疏性而构建的收缩阵列引擎相比,延迟得到了大幅改善。
{"title":"DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity","authors":"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/LCA.2024.3355178","DOIUrl":"10.1109/LCA.2024.3355178","url":null,"abstract":"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000:128, or \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000:256, for small values of \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000 read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"17-20"},"PeriodicalIF":2.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139528673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Direct-Coding DNA With Multilevel Parallelism 多级并行的 DNA 直接编码技术
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-17 DOI: 10.1109/LCA.2024.3355109
Caden Corontzos;Eitan Frachtenberg
The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.
自 2000 年代初以来,全基因组测序的成本和时间一直在稳步快速下降,导致基因组数据激增。相比之下,数字存储设备容量、CPU 时钟速度和网络带宽的增长率则要温和得多。这种差距意味着,存储、传输和处理基因组测序数据的需求正在超过基础技术的能力。使问题更加复杂的是,用于自然语言或图像的传统数据压缩技术并不适合基因组数据。为了应对这一挑战,人们开发了许多数据压缩技术,在压缩率、计算时间、内存要求和复杂性之间进行了一系列权衡。本文将重点讨论这种权衡的一个极端的具体技术,即双位编码,其中基因组序列中的每个碱基都从其原始的 8 位 ASCII 表示压缩为唯一的双位二进制表示。即使是这种简单的直接编码方案,目前的实现方法也还有很大的改进空间。在这里,我们展示了这种编码可以利用现代计算机体系结构中的多级并行性,最大限度地提高编码和解码效率。我们的开源实现达到了每秒数十亿碱基的编码和解码率,远远高于之前报告的结果。事实上,我们测得的吞吐量通常只受到底层存储介质速度的限制。
{"title":"Direct-Coding DNA With Multilevel Parallelism","authors":"Caden Corontzos;Eitan Frachtenberg","doi":"10.1109/LCA.2024.3355109","DOIUrl":"10.1109/LCA.2024.3355109","url":null,"abstract":"The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"21-24"},"PeriodicalIF":2.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow Architectures UDIR:面向可重构数据流架构的统一编译器框架
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-13 DOI: 10.1109/LCA.2023.3342130
Nikhil Agarwal;Mitchell Fream;Souradip Ghosh;Brian C. Schwedock;Nathan Beckmann
Specialized hardware accelerators have gained traction as a means to improve energy efficiency over inefficient von Neumann cores. However, as specialized hardware is limited to a few applications, there is increasing interest in programmable, non-von Neumann architectures to improve efficiency on a wider range of programs. Reconfigurable dataflow architectures (RDAs) are a promising design, but the design space is fragmented and, in particular, existing compiler and software stacks are ad hoc and hard to use. Without a robust, mature software ecosystem, RDAs lose much of their advantage over specialized hardware. This letter proposes a unifying dataflow intermediate representation (UDIR) for RDA compilers. Popular von Neumann compiler representations are inadequate for dataflow architectures because they do not represent the dataflow control paradigm, which is the target of many common compiler analyses and optimizations. UDIR introduces contexts to break regions of instruction reuse in programs. Contexts generalize prior dataflow control paradigms, representing where in the program tokens must be synchronized. We evaluate UDIR on four prior dataflow architectures, providing simple rewrite rules to lower UDIR to their respective machine-specific representations, and demonstrate a case study of using UDIR to optimize memory ordering.
与低效的冯-诺依曼内核相比,专用硬件加速器作为一种提高能效的手段,已经获得了广泛的关注。然而,由于专用硬件仅限于少数应用,人们对可编程、非冯-诺依曼架构的兴趣与日俱增,以提高更多程序的效率。可重构数据流架构(RDA)是一种前景广阔的设计,但其设计空间非常分散,尤其是现有的编译器和软件栈都是临时性的,很难使用。如果没有一个强大、成熟的软件生态系统,RDA 与专用硬件相比就会失去很多优势。这封信为 RDA 编译器提出了一种统一的数据流中间表示法(UDIR)。流行的冯-诺依曼编译器表示法不适合数据流架构,因为它们不能表示数据流控制范式,而数据流控制范式是许多常见编译器分析和优化的目标。UDIR 引入了上下文,以打破程序中的指令重用区域。上下文概括了之前的数据流控制范式,代表了程序中必须同步的标记位置。我们在四种先前的数据流架构上对 UDIR 进行了评估,提供了简单的重写规则,将 UDIR 降低到各自特定的机器表示形式,并演示了使用 UDIR 优化内存排序的案例研究。
{"title":"UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow Architectures","authors":"Nikhil Agarwal;Mitchell Fream;Souradip Ghosh;Brian C. Schwedock;Nathan Beckmann","doi":"10.1109/LCA.2023.3342130","DOIUrl":"https://doi.org/10.1109/LCA.2023.3342130","url":null,"abstract":"Specialized hardware accelerators have gained traction as a means to improve energy efficiency over inefficient von Neumann cores. However, as specialized hardware is limited to a few applications, there is increasing interest in programmable, non-von Neumann architectures to improve efficiency on a wider range of programs. Reconfigurable dataflow architectures (RDAs) are a promising design, but the design space is fragmented and, in particular, existing compiler and software stacks are ad hoc and hard to use. Without a robust, mature software ecosystem, RDAs lose much of their advantage over specialized hardware. This letter proposes a unifying dataflow intermediate representation (UDIR) for RDA compilers. Popular von Neumann compiler representations are inadequate for dataflow architectures because they do not represent the dataflow control paradigm, which is the target of many common compiler analyses and optimizations. UDIR introduces \u0000<italic>contexts</i>\u0000 to break regions of instruction reuse in programs. Contexts generalize prior dataflow control paradigms, representing where in the program tokens must be synchronized. We evaluate UDIR on four prior dataflow architectures, providing simple rewrite rules to lower UDIR to their respective machine-specific representations, and demonstrate a case study of using UDIR to optimize memory ordering.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"99-103"},"PeriodicalIF":2.3,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140818795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRAMA: Commodity DRAM Based Content Addressable Memory DRAMA:基于商品 DRAM 的内容可寻址存储器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-12 DOI: 10.1109/LCA.2023.3341830
L. Yavits
Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6$ times $ higher performance and 19.6$ times $ lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.
多个应用领域都需要内容可寻址存储器(CAM)在大型数据集上提供快速并行搜索功能。然而,与 RAM 相比,CAM 的面积开销大、功耗高,因此扩展性较差。所提出的 DRAMA 解决方案可在未修改的商品 DRAM 中实现 CAM、三元 CAM (TCAM) 和近似(相似性)搜索 CAM 功能。DRAMA 以位串行方式执行比较操作,其中搜索模式(查询)以 DRAM 地址编码。DRAMA 中的单比特比较 (XNOR) 与常规 DRAM 读取相同。NAND CAM 和 NOR CAM 所需的 AND 和 OR 运算分别使用非标准 DRAM 时序实现。我们对 DRAMA 进行了细菌 DNA 分类评估,结果表明,与基于 CMOS CAM 的最先进基因组分类加速器相比,DRAMA 的性能提高了 3.6 倍,功耗降低了 19.6 倍。
{"title":"DRAMA: Commodity DRAM Based Content Addressable Memory","authors":"L. Yavits","doi":"10.1109/LCA.2023.3341830","DOIUrl":"10.1109/LCA.2023.3341830","url":null,"abstract":"Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance and 19.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"65-68"},"PeriodicalIF":2.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139160798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1