Modern computer memories have shown to have reliability issues. The main memory is the target of a security threat called Rowhammer, which causes bit flips in adjacent victim cells of aggressor rows. Numerous countermeasures have been proposed, some of the most efficient ones relying on row access counters, with different techniques to reduce the impact on performance, energy consumption and silicon area. In these proposals, the number of counters is calculated using the maximum number of row activations that can be issued to the protected bank. As reducing the number of counters results in lower silicon area and energy overheads, this can have a direct impact on the production and usage costs. In this work, we demonstrate that two of the most efficient countermeasures can have their silicon area overhead reduced by approximately 50% without impacting the protection level by changing their counting granularity.
{"title":"Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations","authors":"Loïc France;Florent Bruguier;David Novo;Maria Mushtaq;Pascal Benoit","doi":"10.1109/LCA.2023.3328824","DOIUrl":"10.1109/LCA.2023.3328824","url":null,"abstract":"Modern computer memories have shown to have reliability issues. The main memory is the target of a security threat called Rowhammer, which causes bit flips in adjacent victim cells of aggressor rows. Numerous countermeasures have been proposed, some of the most efficient ones relying on row access counters, with different techniques to reduce the impact on performance, energy consumption and silicon area. In these proposals, the number of counters is calculated using the maximum number of row activations that can be issued to the protected bank. As reducing the number of counters results in lower silicon area and energy overheads, this can have a direct impact on the production and usage costs. In this work, we demonstrate that two of the most efficient countermeasures can have their silicon area overhead reduced by approximately 50% without impacting the protection level by changing their counting granularity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"61-64"},"PeriodicalIF":2.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135263777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-31DOI: 10.1109/LCA.2023.3327952
Adam Hastings;Ryan Piersma;Simha Sethumadhavan
Across the world, governments are instituting regulations with the goal of improving the state of computer security. In this paper, we propose how security regulation can be formulated and implemented at the architectural level. Our proposal, called FAIRSHARE, requires architects to spend a pre-determined fraction of system resources (e.g., execution cycles) towards security but leaves the decision of how and where to spend this budget up to the architects of these systems. We discuss how this can elevate security and outline the key architectural support necessary to implement such a solution. Our work is the first work at the intersection of architecture and regulation.
{"title":"Architectural Security Regulation","authors":"Adam Hastings;Ryan Piersma;Simha Sethumadhavan","doi":"10.1109/LCA.2023.3327952","DOIUrl":"10.1109/LCA.2023.3327952","url":null,"abstract":"Across the world, governments are instituting regulations with the goal of improving the state of computer security. In this paper, we propose how security regulation can be formulated and implemented at the architectural level. Our proposal, called FAIRSHARE, requires architects to spend a pre-determined fraction of system resources (e.g., execution cycles) towards security but leaves the decision of how and where to spend this budget up to the architects of these systems. We discuss how this can elevate security and outline the key architectural support necessary to implement such a solution. Our work is the first work at the intersection of architecture and regulation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"173-176"},"PeriodicalIF":2.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135263775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present the first architecture for a trusted execution environment for quantum computers. In the architecture, to protect the user's circuits, they are obfuscated with decoy control pulses added during circuit transpilation by the user. The decoy pulses are removed, i.e. attenuated, by the trusted hardware inside the superconducting quantum computer's fridge before they reach the qubits. This preliminary work demonstrates that protection from possibly malicious cloud providers is feasible with minimal hardware cost.
{"title":"A Quantum Computer Trusted Execution Environment","authors":"Theodoros Trochatos;Chuanqi Xu;Sanjay Deshpande;Yao Lu;Yongshan Ding;Jakub Szefer","doi":"10.1109/LCA.2023.3325852","DOIUrl":"10.1109/LCA.2023.3325852","url":null,"abstract":"We present the first architecture for a trusted execution environment for quantum computers. In the architecture, to protect the user's circuits, they are obfuscated with decoy control pulses added during circuit transpilation by the user. The decoy pulses are removed, i.e. attenuated, by the trusted hardware inside the superconducting quantum computer's fridge before they reach the qubits. This preliminary work demonstrates that protection from possibly malicious cloud providers is feasible with minimal hardware cost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"177-180"},"PeriodicalIF":2.3,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135056635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-19DOI: 10.1109/LCA.2023.3326170
Yingjie Qi;Jianlei Yang;Ao Zhou;Tong Qiao;Chunming Hu
Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.
{"title":"Architectural Implications of GNN Aggregation Programming Abstractions","authors":"Yingjie Qi;Jianlei Yang;Ao Zhou;Tong Qiao;Chunming Hu","doi":"10.1109/LCA.2023.3326170","DOIUrl":"10.1109/LCA.2023.3326170","url":null,"abstract":"Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"125-128"},"PeriodicalIF":2.3,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135056990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-13DOI: 10.1109/LCA.2023.3323482
Hailong Li;Jaewan Choi;Yongsuk Kwon;Jung Ho Ahn
Transformer-based models have become the backbone of numerous state-of-the-art natural language processing (NLP) tasks, including large language models. Matrix multiplication, a fundamental operation in the Transformer-based models, accounts for most of the execution time. While singular value decomposition (SVD) can accelerate this operation by reducing the amount of computation and memory footprints through rank size reduction, it leads to degraded model quality due to challenges in preserving important information. Moreover, this method does not effectively utilize the resources of modern GPUs. In this paper, we propose a hardware-friendly approach: matrix multiplication based on tiled singular value decomposition (TSVD). TSVD divides a matrix into multiple tiles and performs matrix factorization on each tile using SVD. By breaking down the process into smaller regions, TSVD mitigates the loss of important data. We apply the matrices decomposed by TSVD for matrix multiplication, and our TSVD-based matrix multiplication (TSVD-matmul) leverages GPU resources more efficiently compared to the SVD approach. As a result, TSVD-matmul achieved a speedup of 1.03× to 3.24× compared to the SVD approach at compression ratios ranging from 2 to 8. When deployed to GPT-2, TSVD not only performs competitively with a full fine-tuning on the E2E NLG task but also achieves a speedup of 1.06× to 1.24× at 2 to 8 compression ratios while increasing accuracy by up to 1.5 BLEU score.
{"title":"A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models","authors":"Hailong Li;Jaewan Choi;Yongsuk Kwon;Jung Ho Ahn","doi":"10.1109/LCA.2023.3323482","DOIUrl":"10.1109/LCA.2023.3323482","url":null,"abstract":"Transformer-based models have become the backbone of numerous state-of-the-art natural language processing (NLP) tasks, including large language models. Matrix multiplication, a fundamental operation in the Transformer-based models, accounts for most of the execution time. While singular value decomposition (SVD) can accelerate this operation by reducing the amount of computation and memory footprints through rank size reduction, it leads to degraded model quality due to challenges in preserving important information. Moreover, this method does not effectively utilize the resources of modern GPUs. In this paper, we propose a hardware-friendly approach: matrix multiplication based on tiled singular value decomposition (TSVD). TSVD divides a matrix into multiple tiles and performs matrix factorization on each tile using SVD. By breaking down the process into smaller regions, TSVD mitigates the loss of important data. We apply the matrices decomposed by TSVD for matrix multiplication, and our TSVD-based matrix multiplication (TSVD-matmul) leverages GPU resources more efficiently compared to the SVD approach. As a result, TSVD-matmul achieved a speedup of 1.03× to 3.24× compared to the SVD approach at compression ratios ranging from 2 to 8. When deployed to GPT-2, TSVD not only performs competitively with a full fine-tuning on the E2E NLG task but also achieves a speedup of 1.06× to 1.24× at 2 to 8 compression ratios while increasing accuracy by up to 1.5 BLEU score.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"169-172"},"PeriodicalIF":2.3,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136305409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.
{"title":"Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines","authors":"Yosuke Ueno;Yuna Tomida;Teruo Tanimoto;Masamitsu Tanaka;Yutaka Tabuchi;Koji Inoue;Hiroshi Nakamura","doi":"10.1109/LCA.2023.3322700","DOIUrl":"10.1109/LCA.2023.3322700","url":null,"abstract":"The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"6-9"},"PeriodicalIF":2.3,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136053842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-29DOI: 10.1109/LCA.2023.3320670
Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim
Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose NoHammer