首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations 减少基于计数器的行锤缓解措施的硅面积开销
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-31 DOI: 10.1109/LCA.2023.3328824
Loïc France;Florent Bruguier;David Novo;Maria Mushtaq;Pascal Benoit
Modern computer memories have shown to have reliability issues. The main memory is the target of a security threat called Rowhammer, which causes bit flips in adjacent victim cells of aggressor rows. Numerous countermeasures have been proposed, some of the most efficient ones relying on row access counters, with different techniques to reduce the impact on performance, energy consumption and silicon area. In these proposals, the number of counters is calculated using the maximum number of row activations that can be issued to the protected bank. As reducing the number of counters results in lower silicon area and energy overheads, this can have a direct impact on the production and usage costs. In this work, we demonstrate that two of the most efficient countermeasures can have their silicon area overhead reduced by approximately 50% without impacting the protection level by changing their counting granularity.
现代计算机内存已显示出可靠性问题。主存储器是一种名为 "行锤"(Rowhammer)的安全威胁的目标,它会导致攻击行中相邻受害单元的位翻转。人们提出了许多对策,其中一些最有效的对策依赖于行访问计数器,并采用不同的技术来减少对性能、能耗和硅面积的影响。在这些建议中,计数器的数量是根据可向受保护行组发出的最大行激活次数来计算的。由于减少计数器数量可以降低硅面积和能耗开销,因此会对生产和使用成本产生直接影响。在这项工作中,我们展示了两种最有效的对策,通过改变其计数粒度,可将硅面积开销减少约 50%,而不会影响保护级别。
{"title":"Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations","authors":"Loïc France;Florent Bruguier;David Novo;Maria Mushtaq;Pascal Benoit","doi":"10.1109/LCA.2023.3328824","DOIUrl":"10.1109/LCA.2023.3328824","url":null,"abstract":"Modern computer memories have shown to have reliability issues. The main memory is the target of a security threat called Rowhammer, which causes bit flips in adjacent victim cells of aggressor rows. Numerous countermeasures have been proposed, some of the most efficient ones relying on row access counters, with different techniques to reduce the impact on performance, energy consumption and silicon area. In these proposals, the number of counters is calculated using the maximum number of row activations that can be issued to the protected bank. As reducing the number of counters results in lower silicon area and energy overheads, this can have a direct impact on the production and usage costs. In this work, we demonstrate that two of the most efficient countermeasures can have their silicon area overhead reduced by approximately 50% without impacting the protection level by changing their counting granularity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"61-64"},"PeriodicalIF":2.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135263777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architectural Security Regulation 《建筑保安规例》
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-31 DOI: 10.1109/LCA.2023.3327952
Adam Hastings;Ryan Piersma;Simha Sethumadhavan
Across the world, governments are instituting regulations with the goal of improving the state of computer security. In this paper, we propose how security regulation can be formulated and implemented at the architectural level. Our proposal, called FAIRSHARE, requires architects to spend a pre-determined fraction of system resources (e.g., execution cycles) towards security but leaves the decision of how and where to spend this budget up to the architects of these systems. We discuss how this can elevate security and outline the key architectural support necessary to implement such a solution. Our work is the first work at the intersection of architecture and regulation.
世界各地的政府都在制定法规,目的是改善计算机安全状况。在本文中,我们提出了如何在体系结构层面制定和实施安全法规。我们的建议称为FAIRSHARE,要求架构师将预先确定的一部分系统资源(例如,执行周期)用于安全性,但将如何以及在何处花费该预算的决定留给这些系统的架构师。我们将讨论如何提高安全性,并概述实现此类解决方案所需的关键体系结构支持。我们的工作是第一个在建筑和法规的交叉点工作。
{"title":"Architectural Security Regulation","authors":"Adam Hastings;Ryan Piersma;Simha Sethumadhavan","doi":"10.1109/LCA.2023.3327952","DOIUrl":"10.1109/LCA.2023.3327952","url":null,"abstract":"Across the world, governments are instituting regulations with the goal of improving the state of computer security. In this paper, we propose how security regulation can be formulated and implemented at the architectural level. Our proposal, called FAIRSHARE, requires architects to spend a pre-determined fraction of system resources (e.g., execution cycles) towards security but leaves the decision of how and where to spend this budget up to the architects of these systems. We discuss how this can elevate security and outline the key architectural support necessary to implement such a solution. Our work is the first work at the intersection of architecture and regulation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"173-176"},"PeriodicalIF":2.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135263775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Quantum Computer Trusted Execution Environment 量子计算机可信执行环境
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-19 DOI: 10.1109/LCA.2023.3325852
Theodoros Trochatos;Chuanqi Xu;Sanjay Deshpande;Yao Lu;Yongshan Ding;Jakub Szefer
We present the first architecture for a trusted execution environment for quantum computers. In the architecture, to protect the user's circuits, they are obfuscated with decoy control pulses added during circuit transpilation by the user. The decoy pulses are removed, i.e. attenuated, by the trusted hardware inside the superconducting quantum computer's fridge before they reach the qubits. This preliminary work demonstrates that protection from possibly malicious cloud providers is feasible with minimal hardware cost.
我们提出了量子计算机可信执行环境的第一个体系结构。在该体系结构中,为了保护用户电路,用户在电路编译过程中加入诱饵控制脉冲对其进行混淆。诱骗脉冲在到达量子位之前,被超导量子计算机冰箱内的可信硬件移除,即衰减。这项初步工作表明,以最小的硬件成本保护可能恶意的云提供商是可行的。
{"title":"A Quantum Computer Trusted Execution Environment","authors":"Theodoros Trochatos;Chuanqi Xu;Sanjay Deshpande;Yao Lu;Yongshan Ding;Jakub Szefer","doi":"10.1109/LCA.2023.3325852","DOIUrl":"10.1109/LCA.2023.3325852","url":null,"abstract":"We present the first architecture for a trusted execution environment for quantum computers. In the architecture, to protect the user's circuits, they are obfuscated with decoy control pulses added during circuit transpilation by the user. The decoy pulses are removed, i.e. attenuated, by the trusted hardware inside the superconducting quantum computer's fridge before they reach the qubits. This preliminary work demonstrates that protection from possibly malicious cloud providers is feasible with minimal hardware cost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"177-180"},"PeriodicalIF":2.3,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135056635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architectural Implications of GNN Aggregation Programming Abstractions GNN 聚合编程抽象的架构影响
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-19 DOI: 10.1109/LCA.2023.3326170
Yingjie Qi;Jianlei Yang;Ao Zhou;Tong Qiao;Chunming Hu
Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.
图形神经网络(GNN)具有从图形数据中提取有用表征的强大功能,因此大受欢迎。随着高效 GNN 计算需求的增加,出现了各种旨在优化 GNN 聚合的编程抽象,以促进 GNN 的加速。然而,目前还没有对现有抽象进行全面评估和分析,因此对于哪种方法更好还没有明确的共识。在这封信中,我们从数据组织和传播方法两个维度对现有的 GNN 聚合编程抽象进行了分类。通过在最先进的 GNN 库上构建这些抽象,我们进行了深入细致的特性研究,比较了它们的性能和效率,并根据我们的分析为未来的 GNN 加速提供了一些启示。
{"title":"Architectural Implications of GNN Aggregation Programming Abstractions","authors":"Yingjie Qi;Jianlei Yang;Ao Zhou;Tong Qiao;Chunming Hu","doi":"10.1109/LCA.2023.3326170","DOIUrl":"10.1109/LCA.2023.3326170","url":null,"abstract":"Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"125-128"},"PeriodicalIF":2.3,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135056990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models 基于变压器模型的一种硬件友好的平铺奇异值分解矩阵乘法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-13 DOI: 10.1109/LCA.2023.3323482
Hailong Li;Jaewan Choi;Yongsuk Kwon;Jung Ho Ahn
Transformer-based models have become the backbone of numerous state-of-the-art natural language processing (NLP) tasks, including large language models. Matrix multiplication, a fundamental operation in the Transformer-based models, accounts for most of the execution time. While singular value decomposition (SVD) can accelerate this operation by reducing the amount of computation and memory footprints through rank size reduction, it leads to degraded model quality due to challenges in preserving important information. Moreover, this method does not effectively utilize the resources of modern GPUs. In this paper, we propose a hardware-friendly approach: matrix multiplication based on tiled singular value decomposition (TSVD). TSVD divides a matrix into multiple tiles and performs matrix factorization on each tile using SVD. By breaking down the process into smaller regions, TSVD mitigates the loss of important data. We apply the matrices decomposed by TSVD for matrix multiplication, and our TSVD-based matrix multiplication (TSVD-matmul) leverages GPU resources more efficiently compared to the SVD approach. As a result, TSVD-matmul achieved a speedup of 1.03× to 3.24× compared to the SVD approach at compression ratios ranging from 2 to 8. When deployed to GPT-2, TSVD not only performs competitively with a full fine-tuning on the E2E NLG task but also achieves a speedup of 1.06× to 1.24× at 2 to 8 compression ratios while increasing accuracy by up to 1.5 BLEU score.
基于变压器的模型已经成为许多最先进的自然语言处理(NLP)任务的支柱,包括大型语言模型。矩阵乘法是基于transformer的模型中的一个基本操作,它占用了大部分的执行时间。虽然奇异值分解(SVD)可以通过减少秩大小减少计算量和内存占用来加速该操作,但由于在保留重要信息方面存在挑战,它会导致模型质量下降。此外,这种方法不能有效地利用现代gpu的资源。在本文中,我们提出了一种硬件友好的方法:基于平块奇异值分解(TSVD)的矩阵乘法。TSVD将矩阵划分为多个块,并使用奇异值分解对每个块进行矩阵分解。通过将过程分解成更小的区域,TSVD减轻了重要数据的丢失。我们将TSVD分解的矩阵用于矩阵乘法,与SVD方法相比,我们的基于TSVD的矩阵乘法(TSVD- matl)方法在gpu上表现出更高的效率,特别是对于小问题规模或具有高瘦形状的矩阵。这是因为TSVD-matmul更有效地利用了GPU资源。因此,与SVD方法相比,TSVD-matmul在压缩比为2到8的情况下实现了1.03到3.24倍的加速。当部署到GPT-2时,TSVD不仅在E2E NLG任务上进行了全面微调,而且在2到8压缩比下实现了1.06到1.24倍的加速,同时将精度提高了1.5 BLEU分数。
{"title":"A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models","authors":"Hailong Li;Jaewan Choi;Yongsuk Kwon;Jung Ho Ahn","doi":"10.1109/LCA.2023.3323482","DOIUrl":"10.1109/LCA.2023.3323482","url":null,"abstract":"Transformer-based models have become the backbone of numerous state-of-the-art natural language processing (NLP) tasks, including large language models. Matrix multiplication, a fundamental operation in the Transformer-based models, accounts for most of the execution time. While singular value decomposition (SVD) can accelerate this operation by reducing the amount of computation and memory footprints through rank size reduction, it leads to degraded model quality due to challenges in preserving important information. Moreover, this method does not effectively utilize the resources of modern GPUs. In this paper, we propose a hardware-friendly approach: matrix multiplication based on tiled singular value decomposition (TSVD). TSVD divides a matrix into multiple tiles and performs matrix factorization on each tile using SVD. By breaking down the process into smaller regions, TSVD mitigates the loss of important data. We apply the matrices decomposed by TSVD for matrix multiplication, and our TSVD-based matrix multiplication (TSVD-matmul) leverages GPU resources more efficiently compared to the SVD approach. As a result, TSVD-matmul achieved a speedup of 1.03× to 3.24× compared to the SVD approach at compression ratios ranging from 2 to 8. When deployed to GPT-2, TSVD not only performs competitively with a full fine-tuning on the E2E NLG task but also achieves a speedup of 1.06× to 1.24× at 2 to 8 compression ratios while increasing accuracy by up to 1.5 BLEU score.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"169-172"},"PeriodicalIF":2.3,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136305409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines 降低低温 QAOA 设备的温间带宽
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-09 DOI: 10.1109/LCA.2023.3322700
Yosuke Ueno;Yuna Tomida;Teruo Tanimoto;Masamitsu Tanaka;Yutaka Tabuchi;Koji Inoue;Hiroshi Nakamura
The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.
低温和室温环境之间的带宽限制是超导噪声中型量子计算机的一个关键瓶颈。本文介绍了首次针对量子近似优化算法的算法感知系统级优化试验,以解决这一问题。我们基于计数器的低温架构使用单流量子逻辑,显示出指数级带宽降低,并减少了温间电缆的热流入和外围功耗,这有助于提高超导量子计算机的可扩展性。
{"title":"Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines","authors":"Yosuke Ueno;Yuna Tomida;Teruo Tanimoto;Masamitsu Tanaka;Yutaka Tabuchi;Koji Inoue;Hiroshi Nakamura","doi":"10.1109/LCA.2023.3322700","DOIUrl":"10.1109/LCA.2023.3322700","url":null,"abstract":"The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"6-9"},"PeriodicalIF":2.3,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136053842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NoHammer: Preventing Row Hammer With Last-Level Cache Management NoHammer:防止行锤与最后一级缓存管理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-29 DOI: 10.1109/LCA.2023.3320670
Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim
Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose NoHammer, an efficient RH mitigation technique to prevent the bit-flips caused by the RH attack by utilizing Last-Level Cache (LLC) management. NoHammer temporarily extends the associativity of the cache set that is being targeted by utilizing another cache set as the extended set and keeps the cache lines of aggressor rows on the extended set under the eviction-based RH attack. Along with the modification of the LLC replacement policy, NoHammer ensures that the aggressor row's cache lines are not evicted from the LLC under the RH attack. In our evaluation, we demonstrate that NoHammer gives 6% higher performance than a baseline without any RH mitigation technique by replacing excessive cache misses caused by the RH attack with LLC hits through sophisticated LLC management, while requiring 45% less storage than prior proposals.
行锤(RH)是一种电路级现象,其中重复激活DRAM行导致相邻行的位翻转。先前的研究依赖于额外的刷新来减轻RH脆弱性,表明可以有效地防止比特翻转。然而,由于为缓解相对湿度而进行的额外刷新导致了显著的性能下降和能源开销,因此其实现具有挑战性。为了克服挑战,一些研究提出了在不依赖额外刷新的情况下减轻RH攻击的技术。这些技术包括延迟激活攻击行一段时间,或将攻击行与另一行交换以将其与受害者行隔离开来。虽然这些技术不需要额外的刷新来减轻RH,但激活延迟技术可能会导致假阳性情况下的高性能下降,并且交换技术需要高存储开销来跟踪交换信息。我们提出了NoHammer,一种有效的RH缓解技术,通过利用最后一级缓存(LLC)管理来防止由RH攻击引起的位翻转。NoHammer通过利用另一个缓存集作为扩展集来临时扩展缓存集的关联性,并在基于驱逐的RH攻击下将攻击者行的缓存行保留在扩展集上。随着对LLC替换策略的修改,NoHammer确保攻击者行的缓存行在RH攻击下不会从LLC中被驱逐。在我们的评估中,我们证明NoHammer在没有任何RH缓解技术的情况下,通过复杂的LLC管理将RH攻击导致的过多缓存丢失替换为LLC命中,从而比基线性能提高6%,同时所需的存储空间比之前的建议减少45%。
{"title":"NoHammer: Preventing Row Hammer With Last-Level Cache Management","authors":"Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim","doi":"10.1109/LCA.2023.3320670","DOIUrl":"https://doi.org/10.1109/LCA.2023.3320670","url":null,"abstract":"Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose \u0000<monospace>NoHammer</monospace>\u0000, an efficient RH mitigation technique to prevent the bit-flips caused by the RH attack by utilizing Last-Level Cache (LLC) management. \u0000<monospace>NoHammer</monospace>\u0000 temporarily extends the associativity of the cache set that is being targeted by utilizing another cache set as the extended set and keeps the cache lines of aggressor rows on the extended set under the eviction-based RH attack. Along with the modification of the LLC replacement policy, \u0000<monospace>NoHammer</monospace>\u0000 ensures that the aggressor row's cache lines are not evicted from the LLC under the RH attack. In our evaluation, we demonstrate that \u0000<monospace>NoHammer</monospace>\u0000 gives 6% higher performance than a baseline without any RH mitigation technique by replacing excessive cache misses caused by the RH attack with LLC hits through sophisticated LLC management, while requiring 45% less storage than prior proposals.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"157-160"},"PeriodicalIF":2.3,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures 多核体系结构上优化量子电路映射的匈牙利量子比特分配
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-25 DOI: 10.1109/LCA.2023.3318857
Pau Escofet;Anabel Ovide;Carmen G. Almudever;Eduard Alarcón;Sergi Abadal
Modular quantum computing architectures offer a promising alternative to monolithic designs for overcoming the scaling limitations of current quantum computers. To achieve scalability beyond small prototypes, quantum architectures are expected to adopt a modular approach, featuring clusters of tightly connected quantum bits with sparser connections between these clusters. Efficiently distributing qubits across multiple processing cores is critical for improving quantum computing systems’ performance and scalability. To address this challenge, we propose the Hungarian Qubit Assignment (HQA) algorithm, which leverages the Hungarian algorithm to improve qubit-to-core assignment. The HQA algorithm considers the interactions between qubits over the entire circuit, enabling fine-grained partitioning and enhanced qubit utilization. We compare the HQA algorithm with state-of-the-art alternatives through comprehensive experiments using both real-world quantum algorithms and random quantum circuits. The results demonstrate the superiority of our proposed approach, outperforming existing methods, with an average improvement of 1.28×.
模块化量子计算架构为克服当前量子计算机的扩展限制提供了一个有前途的替代单片设计。为了实现超越小型原型的可扩展性,量子架构有望采用模块化方法,其特点是紧密连接的量子比特集群,这些集群之间的连接更稀疏。在多个处理核心之间有效地分配量子位对于提高量子计算系统的性能和可扩展性至关重要。为了解决这一挑战,我们提出了匈牙利量子比特分配(HQA)算法,该算法利用匈牙利算法来改进量子比特到核心的分配。HQA算法考虑了整个电路中量子比特之间的相互作用,实现了细粒度分区和增强的量子比特利用率。我们通过使用现实世界量子算法和随机量子电路的综合实验,将HQA算法与最先进的替代方案进行比较。结果表明,我们提出的方法优于现有方法,平均提高了1.28倍。
{"title":"Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures","authors":"Pau Escofet;Anabel Ovide;Carmen G. Almudever;Eduard Alarcón;Sergi Abadal","doi":"10.1109/LCA.2023.3318857","DOIUrl":"https://doi.org/10.1109/LCA.2023.3318857","url":null,"abstract":"Modular quantum computing architectures offer a promising alternative to monolithic designs for overcoming the scaling limitations of current quantum computers. To achieve scalability beyond small prototypes, quantum architectures are expected to adopt a modular approach, featuring clusters of tightly connected quantum bits with sparser connections between these clusters. Efficiently distributing qubits across multiple processing cores is critical for improving quantum computing systems’ performance and scalability. To address this challenge, we propose the Hungarian Qubit Assignment (HQA) algorithm, which leverages the Hungarian algorithm to improve qubit-to-core assignment. The HQA algorithm considers the interactions between qubits over the entire circuit, enabling fine-grained partitioning and enhanced qubit utilization. We compare the HQA algorithm with state-of-the-art alternatives through comprehensive experiments using both real-world quantum algorithms and random quantum circuits. The results demonstrate the superiority of our proposed approach, outperforming existing methods, with an average improvement of 1.28×.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"161-164"},"PeriodicalIF":2.3,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hardware-Assisted Code-Pointer Tagging for Forward-Edge Control-Flow Integrity 前缘控制流完整性的硬件辅助代码指针标记
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-22 DOI: 10.1109/LCA.2023.3306326
Yonghae Kim;Anurag Kar;Jaewon Lee;Jaekyu Lee;Hyesoon Kim
Software attacks typically operate by overwriting control data, such as a return address and a function pointer, and hijacking the control flow of a program. To prevent such attacks, a number of control-flow integrity (CFI) solutions have been proposed. Nevertheless, most prior work finds difficulties in serving two ends: performance and security. In particular, protecting forward edges, i.e., indirect calls, remains challenging to solve without trading off one for another. In this work, we propose Code-Pointer Tagging (CPT), a novel dynamic CFI solution combined with cryptographic protection. Our key observation is that a pointer's message authentication code (MAC) can be associated with the pointer's CFI label used for CFI checks. We find that such an approach not only enables a space-efficient control-flow graph (CFG) storage but also achieves highly-efficient CFI checks performed along with implicit pointer authentication. To enable CPT, we implement lightweight compiler and hardware support. We prototype our design in an FPGA-accelerated RISC-V hardware simulation platform and conduct full-system-level evaluations. Our results show that CPT incurs a 1.2% average slowdown on the SPEC CPU C/C++ benchmarks while providing effective layered hardening on forward-edge CFI.
软件攻击通常通过覆盖控制数据(如返回地址和函数指针)并劫持程序的控制流来进行操作。为了防止此类攻击,已经提出了许多控制流完整性(CFI)解决方案。然而,大多数先前的工作发现在服务于两个目的方面存在困难:性能和安全性。特别是,保护前边,即间接呼叫,仍然具有挑战性,以解决没有一个交换另一个。在这项工作中,我们提出了代码指针标记(CPT),一种结合密码保护的新型动态CFI解决方案。我们的关键观察是,指针的消息验证码(MAC)可以与用于CFI检查的指针的CFI标签相关联。我们发现这种方法不仅可以实现空间高效的控制流图(CFG)存储,而且还可以实现与隐式指针认证一起执行的高效CFI检查。为了启用CPT,我们实现了轻量级编译器和硬件支持。我们在fpga加速的RISC-V硬件仿真平台上对我们的设计进行原型设计,并进行全系统级评估。我们的结果表明,CPT在SPEC CPU C/ c++基准测试中平均降低了1.2%,同时在前沿CFI上提供了有效的分层强化。
{"title":"Hardware-Assisted Code-Pointer Tagging for Forward-Edge Control-Flow Integrity","authors":"Yonghae Kim;Anurag Kar;Jaewon Lee;Jaekyu Lee;Hyesoon Kim","doi":"10.1109/LCA.2023.3306326","DOIUrl":"https://doi.org/10.1109/LCA.2023.3306326","url":null,"abstract":"Software attacks typically operate by overwriting control data, such as a return address and a function pointer, and hijacking the control flow of a program. To prevent such attacks, a number of control-flow integrity (CFI) solutions have been proposed. Nevertheless, most prior work finds difficulties in serving two ends: performance and security. In particular, protecting forward edges, i.e., indirect calls, remains challenging to solve without trading off one for another. In this work, we propose Code-Pointer Tagging (CPT), a novel dynamic CFI solution combined with cryptographic protection. Our key observation is that a pointer's message authentication code (MAC) can be associated with the pointer's CFI label used for CFI checks. We find that such an approach not only enables a space-efficient control-flow graph (CFG) storage but also achieves highly-efficient CFI checks performed along with implicit pointer authentication. To enable CPT, we implement lightweight compiler and hardware support. We prototype our design in an FPGA-accelerated RISC-V hardware simulation platform and conduct full-system-level evaluations. Our results show that CPT incurs a 1.2% average slowdown on the SPEC CPU C/C++ benchmarks while providing effective layered hardening on forward-edge CFI.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49988597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast Performance Prediction for Efficient Distributed DNN Training 高效分布式DNN训练的快速性能预测
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-18 DOI: 10.1109/LCA.2023.3316452
Yugyoung Yun;Eunhyeok Park
Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.
训练大规模DNN模型需要使用超大规模系统进行并行分布式训练。为了充分利用众多的加速器,必须智能地组合不同的并行化方案。然而,随着深度神经网络模型规模的增加,可能的方案组合变得非常庞大,因此,找到最优的并行方案变得非常昂贵,实际上是不可行的。在这封信中,我们介绍了一个新的成本模型,马尔可夫性能估计(MPE)。该模型提供了各种并行计划的可负担的吞吐量估计,即使在资源有限的情况下,也可以促进对理想并行计划的高效和快速搜索。值得注意的是,这项工作开创性地解释了寻找最佳计划的昂贵性质,并使用基于真实设备评估的直观性能估计来解决它。我们的实验证明了MPE的有效性,表明它比现有的最先进的基线Alpa加速了优化过程,速度提高了126倍(平均36.4)。
{"title":"Fast Performance Prediction for Efficient Distributed DNN Training","authors":"Yugyoung Yun;Eunhyeok Park","doi":"10.1109/LCA.2023.3316452","DOIUrl":"https://doi.org/10.1109/LCA.2023.3316452","url":null,"abstract":"Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1