Pub Date : 2024-02-23DOI: 10.1109/LCA.2024.3365149
Mohammad Hafezan;Ehsan Atoofian
Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.
{"title":"Improving Energy-Efficiency of Capsule Networks on Modern GPUs","authors":"Mohammad Hafezan;Ehsan Atoofian","doi":"10.1109/LCA.2024.3365149","DOIUrl":"10.1109/LCA.2024.3365149","url":null,"abstract":"Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1109/LCA.2024.3363492
Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal
Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).
{"title":"eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models","authors":"Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal","doi":"10.1109/LCA.2024.3363492","DOIUrl":"10.1109/LCA.2024.3363492","url":null,"abstract":"Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-05DOI: 10.1109/LCA.2024.3361925
Lieven Eeckhout
How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.
{"title":"R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead","authors":"Lieven Eeckhout","doi":"10.1109/LCA.2024.3361925","DOIUrl":"10.1109/LCA.2024.3361925","url":null,"abstract":"How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"78-82"},"PeriodicalIF":2.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1109/LCA.2024.3360709
Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman
Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the Baobab Merkle Tree, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.
{"title":"Baobab Merkle Tree for Efficient Secure Memory","authors":"Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman","doi":"10.1109/LCA.2024.3360709","DOIUrl":"10.1109/LCA.2024.3360709","url":null,"abstract":"Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the \u0000<italic>Baobab Merkle Tree</i>\u0000, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"33-36"},"PeriodicalIF":2.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.
{"title":"Primate: A Framework to Automatically Generate Soft Processors for Network Applications","authors":"Rui Ma;Jia-Ching Hsu;Ali Mansoorshahi;Joseph Garvey;Michael Kinsner;Deshanand Singh;Derek Chiou","doi":"10.1109/LCA.2024.3358839","DOIUrl":"10.1109/LCA.2024.3358839","url":null,"abstract":"Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"57-60"},"PeriodicalIF":2.3,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-19DOI: 10.1109/LCA.2024.3350701
Asif Ali Khan;Fazal Hameed;Taha Shahroodi;Alex K. Jones;Jeronimo Castrillon
DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed pre-alignment filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. However, the filtering operation itself is memory-intensive for which the conventional Von-Neumann architectures perform poorly. Therefore, recent designs advocate compute near memory (CNM) accelerators based on stacked DRAM and more exotic memory technologies such as racetrack memories (RTM). However, these designs only support small DNA reads of circa 100 nucleotides, referred to as short reads. This letter proposes a CNM system for handling both long and short reads. It introduces a novel data-placement solution that significantly increases parallelism and reduces overhead. Evaluation results show substantial reductions in execution time ($1.32times$) and energy consumption (50%), compared to the state-of-the-art.
DNA 序列比对是生物信息学中一项基本且计算成本高昂的操作。研究人员开发了预配准过滤器,通过丢弃匹配度较低的位置,有效减少配准过程中消耗的数据量。然而,过滤操作本身需要大量内存,传统的 Von-Neumann 架构在这方面表现不佳。因此,最近的设计提倡使用基于堆叠 DRAM 的计算近存储器(CNM)加速器和更奇特的存储器技术,如赛道存储器(RTM)。然而,这些设计只能支持约 100 个核苷酸的小 DNA 读取,即短读取。这封信提出了一种同时处理长读和短读的 CNM 系统。它引入了一种新颖的数据置放解决方案,大大提高了并行性并减少了开销。评估结果表明,与最先进的系统相比,该系统的执行时间(1.32 美元/次)和能耗(50%)大幅减少。
{"title":"Efficient Memory Layout for Pre-Alignment Filtering of Long DNA Reads Using Racetrack Memory","authors":"Asif Ali Khan;Fazal Hameed;Taha Shahroodi;Alex K. Jones;Jeronimo Castrillon","doi":"10.1109/LCA.2024.3350701","DOIUrl":"10.1109/LCA.2024.3350701","url":null,"abstract":"DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed \u0000<i>pre-alignment</i>\u0000 filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. However, the filtering operation itself is memory-intensive for which the conventional Von-Neumann architectures perform poorly. Therefore, recent designs advocate compute near memory (CNM) accelerators based on stacked DRAM and more exotic memory technologies such as \u0000<i>racetrack memories</i>\u0000 (RTM). However, these designs only support small DNA reads of circa 100 nucleotides, referred to as \u0000<i>short reads</i>\u0000. This letter proposes a CNM system for handling both long and short reads. It introduces a novel data-placement solution that significantly increases parallelism and reduces overhead. Evaluation results show substantial reductions in execution time (\u0000<inline-formula><tex-math>$1.32times$</tex-math></inline-formula>\u0000) and energy consumption (50%), compared to the state-of-the-art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"129-132"},"PeriodicalIF":2.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of $N$:128, or $N$:256, for small values of $N$, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and $N$ read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.
{"title":"DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity","authors":"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/LCA.2024.3355178","DOIUrl":"10.1109/LCA.2024.3355178","url":null,"abstract":"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000:128, or \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000:256, for small values of \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000 read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"17-20"},"PeriodicalIF":2.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139528673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-17DOI: 10.1109/LCA.2024.3355109
Caden Corontzos;Eitan Frachtenberg
The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.
{"title":"Direct-Coding DNA With Multilevel Parallelism","authors":"Caden Corontzos;Eitan Frachtenberg","doi":"10.1109/LCA.2024.3355109","DOIUrl":"10.1109/LCA.2024.3355109","url":null,"abstract":"The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"21-24"},"PeriodicalIF":2.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-13DOI: 10.1109/LCA.2023.3342130
Nikhil Agarwal;Mitchell Fream;Souradip Ghosh;Brian C. Schwedock;Nathan Beckmann
Specialized hardware accelerators have gained traction as a means to improve energy efficiency over inefficient von Neumann cores. However, as specialized hardware is limited to a few applications, there is increasing interest in programmable, non-von Neumann architectures to improve efficiency on a wider range of programs. Reconfigurable dataflow architectures (RDAs) are a promising design, but the design space is fragmented and, in particular, existing compiler and software stacks are ad hoc and hard to use. Without a robust, mature software ecosystem, RDAs lose much of their advantage over specialized hardware. This letter proposes a unifying dataflow intermediate representation (UDIR) for RDA compilers. Popular von Neumann compiler representations are inadequate for dataflow architectures because they do not represent the dataflow control paradigm, which is the target of many common compiler analyses and optimizations. UDIR introduces contexts to break regions of instruction reuse in programs. Contexts generalize prior dataflow control paradigms, representing where in the program tokens must be synchronized. We evaluate UDIR on four prior dataflow architectures, providing simple rewrite rules to lower UDIR to their respective machine-specific representations, and demonstrate a case study of using UDIR to optimize memory ordering.
{"title":"UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow Architectures","authors":"Nikhil Agarwal;Mitchell Fream;Souradip Ghosh;Brian C. Schwedock;Nathan Beckmann","doi":"10.1109/LCA.2023.3342130","DOIUrl":"https://doi.org/10.1109/LCA.2023.3342130","url":null,"abstract":"Specialized hardware accelerators have gained traction as a means to improve energy efficiency over inefficient von Neumann cores. However, as specialized hardware is limited to a few applications, there is increasing interest in programmable, non-von Neumann architectures to improve efficiency on a wider range of programs. Reconfigurable dataflow architectures (RDAs) are a promising design, but the design space is fragmented and, in particular, existing compiler and software stacks are ad hoc and hard to use. Without a robust, mature software ecosystem, RDAs lose much of their advantage over specialized hardware. This letter proposes a unifying dataflow intermediate representation (UDIR) for RDA compilers. Popular von Neumann compiler representations are inadequate for dataflow architectures because they do not represent the dataflow control paradigm, which is the target of many common compiler analyses and optimizations. UDIR introduces \u0000<italic>contexts</i>\u0000 to break regions of instruction reuse in programs. Contexts generalize prior dataflow control paradigms, representing where in the program tokens must be synchronized. We evaluate UDIR on four prior dataflow architectures, providing simple rewrite rules to lower UDIR to their respective machine-specific representations, and demonstrate a case study of using UDIR to optimize memory ordering.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"99-103"},"PeriodicalIF":2.3,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140818795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-12DOI: 10.1109/LCA.2023.3341830
L. Yavits
Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6$ times $ higher performance and 19.6$ times $ lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.
多个应用领域都需要内容可寻址存储器(CAM)在大型数据集上提供快速并行搜索功能。然而,与 RAM 相比,CAM 的面积开销大、功耗高,因此扩展性较差。所提出的 DRAMA 解决方案可在未修改的商品 DRAM 中实现 CAM、三元 CAM (TCAM) 和近似(相似性)搜索 CAM 功能。DRAMA 以位串行方式执行比较操作,其中搜索模式(查询)以 DRAM 地址编码。DRAMA 中的单比特比较 (XNOR) 与常规 DRAM 读取相同。NAND CAM 和 NOR CAM 所需的 AND 和 OR 运算分别使用非标准 DRAM 时序实现。我们对 DRAMA 进行了细菌 DNA 分类评估,结果表明,与基于 CMOS CAM 的最先进基因组分类加速器相比,DRAMA 的性能提高了 3.6 倍,功耗降低了 19.6 倍。
{"title":"DRAMA: Commodity DRAM Based Content Addressable Memory","authors":"L. Yavits","doi":"10.1109/LCA.2023.3341830","DOIUrl":"10.1109/LCA.2023.3341830","url":null,"abstract":"Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance and 19.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"65-68"},"PeriodicalIF":2.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139160798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}