首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs 在多晶片模组gpu中平衡效能与成本与永续性
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-08 DOI: 10.1109/LCA.2023.3313203
Shiqing Zhang;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout
MCM-GPUs scale performance by integrating multiple chiplets within the same package. How to partition the aggregate compute resources across chiplets poses a fundamental trade-off in performance versus cost and sustainability. We propose the Performance Per Wafer (PPW) metric to explore this trade-off and we find that while performance is maximized with few large chiplets, and while cost and environmental footprint is minimized with many small chiplets, the optimum balance is achieved with a moderate number of medium-sized chiplets. The optimum number of chiplets depends on the workload and increases with increased inter-chiplet bandwidth.
mcm - gpu通过在同一封装中集成多个芯片来扩展性能。如何跨小芯片划分聚合计算资源需要在性能与成本和可持续性之间进行基本权衡。我们提出了每晶圆性能(PPW)指标来探索这种权衡,我们发现,虽然使用少量大芯片可以最大化性能,同时使用许多小芯片可以最小化成本和环境足迹,但使用适量的中型芯片可以实现最佳平衡。最佳的小片数取决于工作负载,并随着小片间带宽的增加而增加。
{"title":"Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs","authors":"Shiqing Zhang;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2023.3313203","DOIUrl":"https://doi.org/10.1109/LCA.2023.3313203","url":null,"abstract":"MCM-GPUs scale performance by integrating multiple chiplets within the same package. How to partition the aggregate compute resources across chiplets poses a fundamental trade-off in performance versus cost and sustainability. We propose the \u0000<italic>Performance Per Wafer (PPW)</i>\u0000 metric to explore this trade-off and we find that while performance is maximized with few large chiplets, and while cost and environmental footprint is minimized with many small chiplets, the optimum balance is achieved with a moderate number of medium-sized chiplets. The optimum number of chiplets depends on the workload and increases with increased inter-chiplet bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"145-148"},"PeriodicalIF":2.3,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LV: Latency-Versatile Floating-Point Engine for High-Performance Deep Neural Networks LV:用于高性能深度神经网络的潜伏通用浮点引擎
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-25 DOI: 10.1109/LCA.2023.3287096
Yun-Chen Lo;Yu-Chih Tsai;Ren-Shuo Liu
Computing latency is an important system metric for Deep Neural Networks (DNNs) accelerators. To reduce latency, this work proposes LV, a latency-versatile floating-point engine (FP-PE), which contains the following key contributions: 1) an approximate bit-versatile multiplier-and-accumulate (BV-MAC) unit with early shifter and 2) an on-demand fixed-point-to-floating-point conversion (FXP2FP) unit. The extensive experimental results show that LV outperforms baseline FP-PE and redundancy-aware FP-PE by up to 2.12× and 1.3× speedup using TSMC 40-nm technology, achieving comparable accuracy on the ImageNet classification tasks.
计算延迟是深度神经网络(dnn)加速器的重要系统指标。为了减少延迟,本工作提出了LV,一种延迟通用浮点引擎(FP-PE),它包含以下关键贡献:1)具有早期移位器的近似位通用乘积(BV-MAC)单元和2)按需点到浮点转换(FXP2FP)单元。大量的实验结果表明,LV比使用台积电40纳米技术的基准FP-PE和冗余感知FP-PE的速度提高了2.12倍和1.3倍,在ImageNet分类任务上达到了相当的精度。
{"title":"LV: Latency-Versatile Floating-Point Engine for High-Performance Deep Neural Networks","authors":"Yun-Chen Lo;Yu-Chih Tsai;Ren-Shuo Liu","doi":"10.1109/LCA.2023.3287096","DOIUrl":"10.1109/LCA.2023.3287096","url":null,"abstract":"Computing latency is an important system metric for Deep Neural Networks (DNNs) accelerators. To reduce latency, this work proposes \u0000<bold>LV</b>\u0000, a latency-versatile floating-point engine (FP-PE), which contains the following key contributions: 1) an approximate bit-versatile multiplier-and-accumulate (BV-MAC) unit with early shifter and 2) an on-demand fixed-point-to-floating-point conversion (FXP2FP) unit. The extensive experimental results show that LV outperforms baseline FP-PE and redundancy-aware FP-PE by up to 2.12× and 1.3× speedup using TSMC 40-nm technology, achieving comparable accuracy on the ImageNet classification tasks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"125-128"},"PeriodicalIF":2.3,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44362022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System 一种灵活的基于嵌入感知的推荐系统近记忆处理架构
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-16 DOI: 10.1109/LCA.2023.3305668
Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan
Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a flexible embedding-aware NMP architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.
个性化推荐系统(RS)广泛应用于工业领域,在人工智能计算中心占用了大量时间。RS的一个关键组成部分是嵌入层,它由稀疏嵌入查找组成,并且是有内存限制的。最近的工作提出了近内存处理(NMP)架构,以利用高内存带宽来加快嵌入查找。这些NMP作品可以水平或垂直地划分嵌入向量。然而,在不同的内存配置或嵌入向量大小下,很难保证水平或垂直分区的有效性。为了改善这个问题,我们提出了FeaNMP,一个灵活的嵌入-一个加速RS推理阶段的软件NMP架构,我们详细探讨了不同的划分策略,并设计了一种灵活的方法来选择最优的方法,这取决于不同的嵌入维数和DDR配置。因此,与最先进的秩级NMP工作RecNMP相比,我们的工作在混合维工作负载下实现了高达11.1倍的嵌入层加速。
{"title":"A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System","authors":"Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan","doi":"10.1109/LCA.2023.3305668","DOIUrl":"10.1109/LCA.2023.3305668","url":null,"abstract":"Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a \u0000<underline>f</u>\u0000lexible \u0000<underline>e</u>\u0000mbedding-\u0000<underline>a</u>\u0000ware \u0000<underline>NMP</u>\u0000 architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"165-168"},"PeriodicalIF":2.3,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136139267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models 释放PIM的潜力:加速基于变压器的生成模型的大批量推理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-15 DOI: 10.1109/LCA.2023.3305386
Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn
Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.
基于转换器的生成模型,如GPT,通过关注生成键/值(KV)矩阵来总结输入序列,并通过对序列的每个令牌使用这些矩阵一次来生成相应的输出序列。输入和输出序列往往会变长,这提高了对上下文的理解和会话质量。这些模型通常也被分批进行推理,以提高服务吞吐量。所有这些趋势都使模型的权重能够有效地重用,增加了序列生成的相对重要性,尤其是在通过注意力处理KV矩阵时。我们发现,传统的计算平台(例如GPU)在处理这一注意力部分进行推理方面并不有效,因为每个请求都会生成不同的KV矩阵,无论批大小如何,它的每字节运算率都很低,KV矩阵的总大小甚至可以超过整个模型权重的总大小。这促使我们提出了AttAcc,它利用了这样一个事实,即KV矩阵在摘要期间被写一次,但被使用了多次(与输出序列长度成比例),每次都乘以对应于输出令牌的嵌入向量。进入/离开AttAcc的数据量可能比应该在内部读取以引起注意的数据量小几个数量级。我们在存储器设备中设计了具有多个处理的AttAcc,每个处理将嵌入向量与设备内的KV矩阵部分相乘,从而节省了外部(设备间)带宽和能耗。
{"title":"Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models","authors":"Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3305386","DOIUrl":"10.1109/LCA.2023.3305386","url":null,"abstract":"Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"113-116"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/10208/10189818/10218731.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49570973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing and Understanding Defense Methods for GNNs on GPUs gpu上gnn防御方法的表征与理解
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-15 DOI: 10.1109/LCA.2023.3304638
Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan
Graph neural networks (GNNs) are widely deployed in many vital fields, but suffer from adversarial attacks, which seriously compromise the security in these fields. Plenty of defense methods have been proposed to mitigate the impact of these attacks, however, they have introduced extra time-consuming stages into the execution of GNNs. These extra stages need to be accelerated because the end-to-end acceleration is essential for GNNs to achieve fast development and deployment. To disclose the performance bottlenecks, execution patterns, execution semantics, and overheads of the defense methods for GNNs, we characterize and explore these extra stages on GPUs. Given the characterization and exploration, we provide several useful guidelines for both software and hardware optimizations to accelerate the defense methods for GNNs.
图神经网络(gnn)在许多重要领域得到了广泛的应用,但同时也面临着对抗性攻击,严重影响了这些领域的安全性。已经提出了许多防御方法来减轻这些攻击的影响,然而,它们在gnn的执行中引入了额外耗时的阶段。这些额外的阶段需要加速,因为端到端加速对于gnn实现快速开发和部署至关重要。为了揭示gnn的性能瓶颈、执行模式、执行语义和防御方法的开销,我们对gpu上的这些额外阶段进行了表征和探索。鉴于这些特征和探索,我们为软件和硬件优化提供了一些有用的指导方针,以加速gnn的防御方法。
{"title":"Characterizing and Understanding Defense Methods for GNNs on GPUs","authors":"Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan","doi":"10.1109/LCA.2023.3304638","DOIUrl":"10.1109/LCA.2023.3304638","url":null,"abstract":"Graph neural networks (GNNs) are widely deployed in many vital fields, but suffer from adversarial attacks, which seriously compromise the security in these fields. Plenty of defense methods have been proposed to mitigate the impact of these attacks, however, they have introduced extra time-consuming stages into the execution of GNNs. These extra stages need to be accelerated because the end-to-end acceleration is essential for GNNs to achieve fast development and deployment. To disclose the performance bottlenecks, execution patterns, execution semantics, and overheads of the defense methods for GNNs, we characterize and explore these extra stages on GPUs. Given the characterization and exploration, we provide several useful guidelines for both software and hardware optimizations to accelerate the defense methods for GNNs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"137-140"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44243157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
By-Software Branch Prediction in Loops 循环中的软件分支预测
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-11 DOI: 10.1109/LCA.2023.3304613
Maziar Goudarzi;Reza Azimi;Julian Humecki;Faizaan Rehman;Richard Zhang;Chirag Sethi;Tanishq Bomman;Yuqi Yang
Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch pre-resolution mechanism that allows software to pass branch outcomes to the processor frontend ahead of fetching the branch instruction. A compiler pass identifies the instruction chain leading to the branch (the branch backslice) and generates the pre-execute code that produces the branch outcomes ahead of the frontend observing them. The loop structure helps to unambiguously map the branch outcomes to their corresponding dynamic instances of the branch instruction. Our approach also allows for covering the loop iteration space selectively, with arbitrarily complex patterns. Our method for pre-execution enables important optimizations such as unrolling and vectorization, in order to substantially reduce the pre-execution overhead. Experimental results on select workloads from SPEC CPU 2017 and graph analytics workloads show up to 95% reduction of MPKI (21% on average), up to 39% speedup (7% on average), and 23% IPC gain on average, compared to a core with TAGE-SC-L-64KB branch predictor.
负载相关分支(Load-Dependent Branches, LDB)通常在其本地或全局历史中不显示规则模式,因此传统的分支预测器本质上难以正确预测。我们提出了一种软件到硬件的分支预解析机制,该机制允许软件在获取分支指令之前将分支结果传递给处理器前端。编译器通过识别通向分支的指令链(分支反片),并生成预执行代码,这些代码在前端观察分支结果之前产生分支结果。循环结构有助于将分支结果明确地映射到分支指令的相应动态实例。我们的方法还允许有选择地覆盖循环迭代空间,使用任意复杂的模式。我们的预执行方法支持重要的优化,例如展开和向量化,以大大减少预执行开销。从SPEC CPU 2017和图形分析工作负载中选择工作负载的实验结果显示,与具有tag - sc - l - 64kb分支预测器的核心相比,MPKI降低了95%(平均21%),加速提高了39%(平均7%),IPC平均提高了23%。
{"title":"By-Software Branch Prediction in Loops","authors":"Maziar Goudarzi;Reza Azimi;Julian Humecki;Faizaan Rehman;Richard Zhang;Chirag Sethi;Tanishq Bomman;Yuqi Yang","doi":"10.1109/LCA.2023.3304613","DOIUrl":"https://doi.org/10.1109/LCA.2023.3304613","url":null,"abstract":"Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch pre-resolution mechanism that allows software to pass branch outcomes to the processor frontend ahead of fetching the branch instruction. A compiler pass identifies the instruction chain leading to the branch (the branch \u0000<italic>backslice</i>\u0000) and generates the pre-execute code that produces the branch outcomes ahead of the frontend observing them. The loop structure helps to unambiguously map the branch outcomes to their corresponding dynamic instances of the branch instruction. Our approach also allows for covering the loop iteration space selectively, with arbitrarily complex patterns. Our method for pre-execution enables important optimizations such as unrolling and vectorization, in order to substantially reduce the pre-execution overhead. Experimental results on select workloads from SPEC CPU 2017 and graph analytics workloads show up to 95% reduction of MPKI (21% on average), up to 39% speedup (7% on average), and 23% IPC gain on average, compared to a core with TAGE-SC-L-64KB branch predictor.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"129-132"},"PeriodicalIF":2.3,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simulating Our Way to Safer Software: A Tale of Integrating Microarchitecture Simulation and Leakage Estimation Modeling 模拟我们的安全软件之路:集成微体系结构模拟和泄漏估计建模的故事
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-10 DOI: 10.1109/LCA.2023.3303913
Justin Feng;Fatemeh Arkannezhad;Christopher Ryu;Enoch Huang;Siddhant Gupta;Nader Sehatbakhsh
An important step to protect software against side-channel vulnerability is to rigorously evaluate it on the target hardware using standard leakage tests. Recently, leakage estimation tools have received a lot of attention to improve this time-consuming process. Despite their advancements, existing tools often neglect the impact of microarchitecture and its underlying events in their leakage model which leads to inaccuracies. This paper takes the first step in addressing these issues by integrating a physical side-channel leakage estimation tool into a microarchitectural simulator. To achieve this, we first systematically explore the impact of various architecture and microarchitecture activities and their underlying interactions on the produced physical side-channel signals and integrate that into the microarchitecture model. Second, to create a comprehensive leakage estimation report, we leverage taint tracking and symbolic execution to accurately analyze different paths and inputs. The final outcome of this work is a tool that takes a binary and generates a leakage report that covers architecture and microarchitecture-related leakages for both data-dependent and path-dependent information leakage scenarios.
保护软件免受侧通道漏洞攻击的一个重要步骤是使用标准泄漏测试在目标硬件上对其进行严格评估。最近,泄漏估计工具受到了很多关注,以改进这一耗时的过程。尽管已有工具取得了进步,但它们往往忽略了微体系结构及其泄漏模型中潜在事件的影响,这导致了不准确。本文通过将物理侧信道泄漏估计工具集成到微体系结构模拟器中,迈出了解决这些问题的第一步。为了实现这一点,我们首先系统地探索了各种体系结构和微体系结构活动及其潜在相互作用对产生的物理侧通道信号的影响,并将其集成到微体系结构模型中。其次,为了创建一个全面的泄漏估计报告,我们利用污染跟踪和符号执行来准确分析不同的路径和输入。这项工作的最终结果是一个工具,它采用二进制文件并生成一份泄漏报告,该报告涵盖了数据相关和路径相关信息泄漏场景中与体系结构和微体系结构相关的泄漏。
{"title":"Simulating Our Way to Safer Software: A Tale of Integrating Microarchitecture Simulation and Leakage Estimation Modeling","authors":"Justin Feng;Fatemeh Arkannezhad;Christopher Ryu;Enoch Huang;Siddhant Gupta;Nader Sehatbakhsh","doi":"10.1109/LCA.2023.3303913","DOIUrl":"10.1109/LCA.2023.3303913","url":null,"abstract":"An important step to protect software against side-channel vulnerability is to rigorously evaluate it on the target hardware using standard leakage tests. Recently, leakage estimation tools have received a lot of attention to improve this time-consuming process. Despite their advancements, existing tools often neglect the impact of microarchitecture and its underlying events in their leakage model which leads to inaccuracies. This paper takes the first step in addressing these issues by integrating a physical side-channel leakage estimation tool into a microarchitectural simulator. To achieve this, we first systematically explore the impact of various architecture and microarchitecture activities and their underlying interactions on the produced physical side-channel signals and integrate that into the microarchitecture model. Second, to create a comprehensive leakage estimation report, we leverage taint tracking and symbolic execution to accurately analyze different paths and inputs. The final outcome of this work is a tool that takes a binary and generates a leakage report that covers architecture and microarchitecture-related leakages for both data-dependent and path-dependent information leakage scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"109-112"},"PeriodicalIF":2.3,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41489016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SoCurity: A Design Approach for Enhancing SoC Security SoCurity:一种增强SoC安全性的设计方法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-03 DOI: 10.1109/LCA.2023.3301448
Naorin Hossain;Alper Buyuktosunoglu;John-David Wellman;Pradip Bose;Margaret Martonosi
We propose SoCurity, the first NoC counter-based hardware monitoring approach for enhancing heterogeneous SoC security. With SoCurity, we develop a fast, lightweight anomalous activity detection system leveraging semi-supervised machine learning models that require no prior attack knowledge for detecting anomalies. We demonstrate our techniques with a case study on a real SoC for a connected autonomous vehicle system and find up to 96% detection accuracy.
我们提出了SoCurity,这是第一种基于NoC计数器的硬件监控方法,用于增强异构SoC的安全性。利用SoCurity,我们开发了一个快速、轻量级的异常活动检测系统,该系统利用半监督机器学习模型,不需要事先的攻击知识来检测异常。我们通过对连接的自动驾驶汽车系统的真实SoC的案例研究来展示我们的技术,并发现检测准确率高达96%。
{"title":"SoCurity: A Design Approach for Enhancing SoC Security","authors":"Naorin Hossain;Alper Buyuktosunoglu;John-David Wellman;Pradip Bose;Margaret Martonosi","doi":"10.1109/LCA.2023.3301448","DOIUrl":"10.1109/LCA.2023.3301448","url":null,"abstract":"We propose SoCurity, the first NoC counter-based hardware monitoring approach for enhancing heterogeneous SoC security. With SoCurity, we develop a fast, lightweight anomalous activity detection system leveraging semi-supervised machine learning models that require no prior attack knowledge for detecting anomalies. We demonstrate our techniques with a case study on a real SoC for a connected autonomous vehicle system and find up to 96% detection accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"105-108"},"PeriodicalIF":2.3,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42292159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smart Memory: Deep Learning Acceleration in 3D-Stacked Memories 智能存储器:在三维堆叠存储器中加速深度学习
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-08-01 DOI: 10.1109/LCA.2023.3287976
Seyyed Hossein SeyyedAghaei Rezaei;Parham Zilouchian Moghaddam;Mehdi Modarressi
Processing-in-memory (PIM) is the most promising paradigm to address the bandwidth bottleneck in deep neural network (DNN) accelerators. However, the algorithmic and dataflow structure of DNNs still necessitates moving a large amount of data across banks inside the memory device to bring input data and their corresponding model parameters together, negatively shifting part of the bandwidth bottleneck to the in-memory data communication infrastructure. To alleviate this bottleneck, we present Smart Memory, a highly parallel in-memory DNN accelerator for 3D memories that benefits from a scalable high-bandwidth in-memory network. Whereas the existing PIM designs implement the compute units and network-on-chip on the logic die of the underlying 3D memory, in Smart Memory the computation and data transmission tasks are distributed across the memory banks. To this end, each memory bank is equipped with (1) a very simple processing unit to run neural networks, and (2) a circuit-switched router to interconnect memory banks by a 3D network-on-memory. Our evaluation shows 44% average performance improvement over state-of-the-art in-memory DNN accelerators.
内存处理(PIM)是解决深度神经网络(DNN)加速器带宽瓶颈的最有前途的模式。然而,DNN 的算法和数据流结构仍然需要在内存设备内部的存储库之间移动大量数据,以便将输入数据及其相应的模型参数汇集到一起,这就将部分带宽瓶颈转移到了内存数据通信基础设施上。为了缓解这一瓶颈,我们提出了智能内存(Smart Memory),这是一种适用于三维内存的高度并行内存 DNN 加速器,得益于可扩展的高带宽内存网络。现有的 PIM 设计是在底层 3D 存储器的逻辑芯片上实现计算单元和片上网络,而在 Smart Memory 中,计算和数据传输任务则分布在各个存储器组中。为此,每个内存组都配备了:(1)一个非常简单的处理单元,用于运行神经网络;(2)一个电路交换路由器,用于通过三维内存网络互连内存组。我们的评估显示,与最先进的内存神经网络加速器相比,平均性能提高了 44%。
{"title":"Smart Memory: Deep Learning Acceleration in 3D-Stacked Memories","authors":"Seyyed Hossein SeyyedAghaei Rezaei;Parham Zilouchian Moghaddam;Mehdi Modarressi","doi":"10.1109/LCA.2023.3287976","DOIUrl":"10.1109/LCA.2023.3287976","url":null,"abstract":"Processing-in-memory (PIM) is the most promising paradigm to address the bandwidth bottleneck in deep neural network (DNN) accelerators. However, the algorithmic and dataflow structure of DNNs still necessitates moving a large amount of data across banks inside the memory device to bring input data and their corresponding model parameters together, negatively shifting part of the bandwidth bottleneck to the in-memory data communication infrastructure. To alleviate this bottleneck, we present \u0000<italic>Smart Memory</i>\u0000, a highly parallel in-memory DNN accelerator for 3D memories that benefits from a scalable high-bandwidth in-memory network. Whereas the existing PIM designs implement the compute units and network-on-chip on the logic die of the underlying 3D memory, in \u0000<italic>Smart Memory</i>\u0000 the computation and data transmission tasks are distributed across the memory banks. To this end, each memory bank is equipped with (1) a very simple processing unit to run neural networks, and (2) a circuit-switched router to interconnect memory banks by a 3D network-on-memory. Our evaluation shows 44% average performance improvement over state-of-the-art in-memory DNN accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"137-141"},"PeriodicalIF":2.3,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135784868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Mirage of Breaking MIRAGE: Analyzing the Modeling Pitfalls in Emerging “Attacks” on MIRAGE 打破海市蜃楼的海市蜃楼:分析海市蜃楼新出现的“攻击”中的建模缺陷
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-07-21 DOI: 10.1109/LCA.2023.3297875
Gururaj Saileshwar;Moinuddin Qureshi
This letter studies common modeling pitfalls in security analyses of hardware defenses to highlight the importance of accurate reproduction of defenses. We provide a case study of MIRAGE (Saileshwar and Qureshi 2021), a defense against cache side channel attacks, and analyze its incorrect modeling in a recent work (Chakraborty et al., 2023) that claimed to break its security. We highlight several modeling pitfalls that can invalidate the security properties of any defense including a) incomplete modeling of components critical for security, b) usage of random number generators that are insufficiently random, and c) initialization of system to improbable states, leading to an incorrect conclusion of a vulnerability, and show how these modeling bugs incorrectly cause set conflicts to be observed in a recent work’s (Chakraborty et al., 2023) model of MIRAGE. We also provide an implementation addressing these bugs that does not incur set-conflicts, highlighting that MIRAGE is still unbroken.
本文研究了硬件防御安全分析中常见的建模缺陷,以强调准确复制防御的重要性。我们提供了MIRAGE的案例研究(Saileshwar和Qureshi 2021),这是一种针对缓存侧通道攻击的防御,并在最近的一项工作(Chakraborty等人,2023)中分析了其错误的建模,该工作声称破坏了其安全性。我们强调了几个可以使任何防御的安全属性无效的建模缺陷,包括a)对安全至关重要的组件的不完整建模,b)使用随机数生成器的随机性不足,以及c)将系统初始化到不可能的状态,导致对漏洞的错误结论,并展示了这些建模错误如何在最近的工作(Chakraborty et al., 2023) MIRAGE模型中观察到的集合冲突。我们还提供了一个解决这些错误的实现,它不会引起集合冲突,强调MIRAGE仍然是未被破坏的。
{"title":"The Mirage of Breaking MIRAGE: Analyzing the Modeling Pitfalls in Emerging “Attacks” on MIRAGE","authors":"Gururaj Saileshwar;Moinuddin Qureshi","doi":"10.1109/LCA.2023.3297875","DOIUrl":"10.1109/LCA.2023.3297875","url":null,"abstract":"This letter studies common modeling pitfalls in security analyses of hardware defenses to highlight the importance of accurate reproduction of defenses. We provide a case study of MIRAGE (Saileshwar and Qureshi 2021), a defense against cache side channel attacks, and analyze its incorrect modeling in a recent work (Chakraborty et al., 2023) that claimed to break its security. We highlight several modeling pitfalls that can invalidate the security properties of any defense including a) incomplete modeling of components critical for security, b) usage of random number generators that are insufficiently random, and c) initialization of system to improbable states, leading to an incorrect conclusion of a vulnerability, and show how these modeling bugs incorrectly cause set conflicts to be observed in a recent work’s (Chakraborty et al., 2023) model of MIRAGE. We also provide an implementation addressing these bugs that does not incur set-conflicts, highlighting that MIRAGE is still unbroken.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"121-124"},"PeriodicalIF":2.3,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48105760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1