Pub Date : 2024-02-23DOI: 10.1109/LCA.2024.3365149
Mohammad Hafezan;Ehsan Atoofian
Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.
{"title":"Improving Energy-Efficiency of Capsule Networks on Modern GPUs","authors":"Mohammad Hafezan;Ehsan Atoofian","doi":"10.1109/LCA.2024.3365149","DOIUrl":"10.1109/LCA.2024.3365149","url":null,"abstract":"Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1109/LCA.2024.3363492
Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal
Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).
{"title":"eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models","authors":"Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal","doi":"10.1109/LCA.2024.3363492","DOIUrl":"10.1109/LCA.2024.3363492","url":null,"abstract":"Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-05DOI: 10.1109/LCA.2024.3361925
Lieven Eeckhout
How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.
{"title":"R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead","authors":"Lieven Eeckhout","doi":"10.1109/LCA.2024.3361925","DOIUrl":"10.1109/LCA.2024.3361925","url":null,"abstract":"How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"78-82"},"PeriodicalIF":2.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1109/LCA.2024.3360709
Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman
Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the Baobab Merkle Tree