IEEE Computer Architecture Letters最新文献

英文中文

A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System 一种灵活的基于嵌入感知的推荐系统近记忆处理架构

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-16 DOI: 10.1109/LCA.2023.3305668

Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan

Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a flexible embedding-aware NMP architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.

个性化推荐系统(RS)广泛应用于工业领域，在人工智能计算中心占用了大量时间。RS的一个关键组成部分是嵌入层，它由稀疏嵌入查找组成，并且是有内存限制的。最近的工作提出了近内存处理(NMP)架构，以利用高内存带宽来加快嵌入查找。这些NMP作品可以水平或垂直地划分嵌入向量。然而，在不同的内存配置或嵌入向量大小下，很难保证水平或垂直分区的有效性。为了改善这个问题，我们提出了FeaNMP，一个灵活的嵌入-一个加速RS推理阶段的软件NMP架构，我们详细探讨了不同的划分策略，并设计了一种灵活的方法来选择最优的方法，这取决于不同的嵌入维数和DDR配置。因此，与最先进的秩级NMP工作RecNMP相比，我们的工作在混合维工作负载下实现了高达11.1倍的嵌入层加速。

{"title":"A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System","authors":"Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan","doi":"10.1109/LCA.2023.3305668","DOIUrl":"10.1109/LCA.2023.3305668","url":null,"abstract":"Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a \u0000<underline>f\u0000lexible \u0000<underline>e\u0000mbedding-\u0000<underline>a\u0000ware \u0000<underline>NMP\u0000 architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"165-168"},"PeriodicalIF":2.3,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136139267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models 释放PIM的潜力：加速基于变压器的生成模型的大批量推理

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-15 DOI: 10.1109/LCA.2023.3305386

Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn

Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.

基于转换器的生成模型，如GPT，通过关注生成键/值（KV）矩阵来总结输入序列，并通过对序列的每个令牌使用这些矩阵一次来生成相应的输出序列。输入和输出序列往往会变长，这提高了对上下文的理解和会话质量。这些模型通常也被分批进行推理，以提高服务吞吐量。所有这些趋势都使模型的权重能够有效地重用，增加了序列生成的相对重要性，尤其是在通过注意力处理KV矩阵时。我们发现，传统的计算平台（例如GPU）在处理这一注意力部分进行推理方面并不有效，因为每个请求都会生成不同的KV矩阵，无论批大小如何，它的每字节运算率都很低，KV矩阵的总大小甚至可以超过整个模型权重的总大小。这促使我们提出了AttAcc，它利用了这样一个事实，即KV矩阵在摘要期间被写一次，但被使用了多次（与输出序列长度成比例），每次都乘以对应于输出令牌的嵌入向量。进入/离开AttAcc的数据量可能比应该在内部读取以引起注意的数据量小几个数量级。我们在存储器设备中设计了具有多个处理的AttAcc，每个处理将嵌入向量与设备内的KV矩阵部分相乘，从而节省了外部（设备间）带宽和能耗。

{"title":"Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models","authors":"Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3305386","DOIUrl":"10.1109/LCA.2023.3305386","url":null,"abstract":"Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"113-116"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/10208/10189818/10218731.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49570973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing and Understanding Defense Methods for GNNs on GPUs gpu上gnn防御方法的表征与理解

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-15 DOI: 10.1109/LCA.2023.3304638

Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan

Graph neural networks (GNNs) are widely deployed in many vital fields, but suffer from adversarial attacks, which seriously compromise the security in these fields. Plenty of defense methods have been proposed to mitigate the impact of these attacks, however, they have introduced extra time-consuming stages into the execution of GNNs. These extra stages need to be accelerated because the end-to-end acceleration is essential for GNNs to achieve fast development and deployment. To disclose the performance bottlenecks, execution patterns, execution semantics, and overheads of the defense methods for GNNs, we characterize and explore these extra stages on GPUs. Given the characterization and exploration, we provide several useful guidelines for both software and hardware optimizations to accelerate the defense methods for GNNs.

图神经网络(gnn)在许多重要领域得到了广泛的应用，但同时也面临着对抗性攻击，严重影响了这些领域的安全性。已经提出了许多防御方法来减轻这些攻击的影响，然而，它们在gnn的执行中引入了额外耗时的阶段。这些额外的阶段需要加速，因为端到端加速对于gnn实现快速开发和部署至关重要。为了揭示gnn的性能瓶颈、执行模式、执行语义和防御方法的开销，我们对gpu上的这些额外阶段进行了表征和探索。鉴于这些特征和探索，我们为软件和硬件优化提供了一些有用的指导方针，以加速gnn的防御方法。

引用次数: 0

By-Software Branch Prediction in Loops 循环中的软件分支预测

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-11 DOI: 10.1109/LCA.2023.3304613

Maziar Goudarzi;Reza Azimi;Julian Humecki;Faizaan Rehman;Richard Zhang;Chirag Sethi;Tanishq Bomman;Yuqi Yang

Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch pre-resolution mechanism that allows software to pass branch outcomes to the processor frontend ahead of fetching the branch instruction. A compiler pass identifies the instruction chain leading to the branch (the branch backslice) and generates the pre-execute code that produces the branch outcomes ahead of the frontend observing them. The loop structure helps to unambiguously map the branch outcomes to their corresponding dynamic instances of the branch instruction. Our approach also allows for covering the loop iteration space selectively, with arbitrarily complex patterns. Our method for pre-execution enables important optimizations such as unrolling and vectorization, in order to substantially reduce the pre-execution overhead. Experimental results on select workloads from SPEC CPU 2017 and graph analytics workloads show up to 95% reduction of MPKI (21% on average), up to 39% speedup (7% on average), and 23% IPC gain on average, compared to a core with TAGE-SC-L-64KB branch predictor.

负载相关分支(Load-Dependent Branches, LDB)通常在其本地或全局历史中不显示规则模式，因此传统的分支预测器本质上难以正确预测。我们提出了一种软件到硬件的分支预解析机制，该机制允许软件在获取分支指令之前将分支结果传递给处理器前端。编译器通过识别通向分支的指令链(分支反片)，并生成预执行代码，这些代码在前端观察分支结果之前产生分支结果。循环结构有助于将分支结果明确地映射到分支指令的相应动态实例。我们的方法还允许有选择地覆盖循环迭代空间，使用任意复杂的模式。我们的预执行方法支持重要的优化，例如展开和向量化，以大大减少预执行开销。从SPEC CPU 2017和图形分析工作负载中选择工作负载的实验结果显示，与具有tag - sc - l - 64kb分支预测器的核心相比，MPKI降低了95%(平均21%)，加速提高了39%(平均7%)，IPC平均提高了23%。

{"title":"By-Software Branch Prediction in Loops","authors":"Maziar Goudarzi;Reza Azimi;Julian Humecki;Faizaan Rehman;Richard Zhang;Chirag Sethi;Tanishq Bomman;Yuqi Yang","doi":"10.1109/LCA.2023.3304613","DOIUrl":"https://doi.org/10.1109/LCA.2023.3304613","url":null,"abstract":"Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch pre-resolution mechanism that allows software to pass branch outcomes to the processor frontend ahead of fetching the branch instruction. A compiler pass identifies the instruction chain leading to the branch (the branch \u0000<italic>backslice\u0000) and generates the pre-execute code that produces the branch outcomes ahead of the frontend observing them. The loop structure helps to unambiguously map the branch outcomes to their corresponding dynamic instances of the branch instruction. Our approach also allows for covering the loop iteration space selectively, with arbitrarily complex patterns. Our method for pre-execution enables important optimizations such as unrolling and vectorization, in order to substantially reduce the pre-execution overhead. Experimental results on select workloads from SPEC CPU 2017 and graph analytics workloads show up to 95% reduction of MPKI (21% on average), up to 39% speedup (7% on average), and 23% IPC gain on average, compared to a core with TAGE-SC-L-64KB branch predictor.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"129-132"},"PeriodicalIF":2.3,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simulating Our Way to Safer Software: A Tale of Integrating Microarchitecture Simulation and Leakage Estimation Modeling 模拟我们的安全软件之路：集成微体系结构模拟和泄漏估计建模的故事

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-10 DOI: 10.1109/LCA.2023.3303913

Justin Feng;Fatemeh Arkannezhad;Christopher Ryu;Enoch Huang;Siddhant Gupta;Nader Sehatbakhsh

An important step to protect software against side-channel vulnerability is to rigorously evaluate it on the target hardware using standard leakage tests. Recently, leakage estimation tools have received a lot of attention to improve this time-consuming process. Despite their advancements, existing tools often neglect the impact of microarchitecture and its underlying events in their leakage model which leads to inaccuracies. This paper takes the first step in addressing these issues by integrating a physical side-channel leakage estimation tool into a microarchitectural simulator. To achieve this, we first systematically explore the impact of various architecture and microarchitecture activities and their underlying interactions on the produced physical side-channel signals and integrate that into the microarchitecture model. Second, to create a comprehensive leakage estimation report, we leverage taint tracking and symbolic execution to accurately analyze different paths and inputs. The final outcome of this work is a tool that takes a binary and generates a leakage report that covers architecture and microarchitecture-related leakages for both data-dependent and path-dependent information leakage scenarios.

保护软件免受侧通道漏洞攻击的一个重要步骤是使用标准泄漏测试在目标硬件上对其进行严格评估。最近，泄漏估计工具受到了很多关注，以改进这一耗时的过程。尽管已有工具取得了进步，但它们往往忽略了微体系结构及其泄漏模型中潜在事件的影响，这导致了不准确。本文通过将物理侧信道泄漏估计工具集成到微体系结构模拟器中，迈出了解决这些问题的第一步。为了实现这一点，我们首先系统地探索了各种体系结构和微体系结构活动及其潜在相互作用对产生的物理侧通道信号的影响，并将其集成到微体系结构模型中。其次，为了创建一个全面的泄漏估计报告，我们利用污染跟踪和符号执行来准确分析不同的路径和输入。这项工作的最终结果是一个工具，它采用二进制文件并生成一份泄漏报告，该报告涵盖了数据相关和路径相关信息泄漏场景中与体系结构和微体系结构相关的泄漏。

{"title":"Simulating Our Way to Safer Software: A Tale of Integrating Microarchitecture Simulation and Leakage Estimation Modeling","authors":"Justin Feng;Fatemeh Arkannezhad;Christopher Ryu;Enoch Huang;Siddhant Gupta;Nader Sehatbakhsh","doi":"10.1109/LCA.2023.3303913","DOIUrl":"10.1109/LCA.2023.3303913","url":null,"abstract":"An important step to protect software against side-channel vulnerability is to rigorously evaluate it on the target hardware using standard leakage tests. Recently, leakage estimation tools have received a lot of attention to improve this time-consuming process. Despite their advancements, existing tools often neglect the impact of microarchitecture and its underlying events in their leakage model which leads to inaccuracies. This paper takes the first step in addressing these issues by integrating a physical side-channel leakage estimation tool into a microarchitectural simulator. To achieve this, we first systematically explore the impact of various architecture and microarchitecture activities and their underlying interactions on the produced physical side-channel signals and integrate that into the microarchitecture model. Second, to create a comprehensive leakage estimation report, we leverage taint tracking and symbolic execution to accurately analyze different paths and inputs. The final outcome of this work is a tool that takes a binary and generates a leakage report that covers architecture and microarchitecture-related leakages for both data-dependent and path-dependent information leakage scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"109-112"},"PeriodicalIF":2.3,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41489016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SoCurity: A Design Approach for Enhancing SoC Security SoCurity：一种增强SoC安全性的设计方法

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-03 DOI: 10.1109/LCA.2023.3301448

Naorin Hossain;Alper Buyuktosunoglu;John-David Wellman;Pradip Bose;Margaret Martonosi

We propose SoCurity, the first NoC counter-based hardware monitoring approach for enhancing heterogeneous SoC security. With SoCurity, we develop a fast, lightweight anomalous activity detection system leveraging semi-supervised machine learning models that require no prior attack knowledge for detecting anomalies. We demonstrate our techniques with a case study on a real SoC for a connected autonomous vehicle system and find up to 96% detection accuracy.

我们提出了SoCurity，这是第一种基于NoC计数器的硬件监控方法，用于增强异构SoC的安全性。利用SoCurity，我们开发了一个快速、轻量级的异常活动检测系统，该系统利用半监督机器学习模型，不需要事先的攻击知识来检测异常。我们通过对连接的自动驾驶汽车系统的真实SoC的案例研究来展示我们的技术，并发现检测准确率高达96%。

引用次数: 0

Smart Memory: Deep Learning Acceleration in 3D-Stacked Memories 智能存储器：在三维堆叠存储器中加速深度学习

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-08-01 DOI: 10.1109/LCA.2023.3287976

Seyyed Hossein SeyyedAghaei Rezaei;Parham Zilouchian Moghaddam;Mehdi Modarressi

Processing-in-memory (PIM) is the most promising paradigm to address the bandwidth bottleneck in deep neural network (DNN) accelerators. However, the algorithmic and dataflow structure of DNNs still necessitates moving a large amount of data across banks inside the memory device to bring input data and their corresponding model parameters together, negatively shifting part of the bandwidth bottleneck to the in-memory data communication infrastructure. To alleviate this bottleneck, we present Smart Memory, a highly parallel in-memory DNN accelerator for 3D memories that benefits from a scalable high-bandwidth in-memory network. Whereas the existing PIM designs implement the compute units and network-on-chip on the logic die of the underlying 3D memory, in Smart Memory the computation and data transmission tasks are distributed across the memory banks. To this end, each memory bank is equipped with (1) a very simple processing unit to run neural networks, and (2) a circuit-switched router to interconnect memory banks by a 3D network-on-memory. Our evaluation shows 44% average performance improvement over state-of-the-art in-memory DNN accelerators.

内存处理（PIM）是解决深度神经网络（DNN）加速器带宽瓶颈的最有前途的模式。然而，DNN 的算法和数据流结构仍然需要在内存设备内部的存储库之间移动大量数据，以便将输入数据及其相应的模型参数汇集到一起，这就将部分带宽瓶颈转移到了内存数据通信基础设施上。为了缓解这一瓶颈，我们提出了智能内存（Smart Memory），这是一种适用于三维内存的高度并行内存 DNN 加速器，得益于可扩展的高带宽内存网络。现有的 PIM 设计是在底层 3D 存储器的逻辑芯片上实现计算单元和片上网络，而在 Smart Memory 中，计算和数据传输任务则分布在各个存储器组中。为此，每个内存组都配备了：（1）一个非常简单的处理单元，用于运行神经网络；（2）一个电路交换路由器，用于通过三维内存网络互连内存组。我们的评估显示，与最先进的内存神经网络加速器相比，平均性能提高了 44%。

{"title":"Smart Memory: Deep Learning Acceleration in 3D-Stacked Memories","authors":"Seyyed Hossein SeyyedAghaei Rezaei;Parham Zilouchian Moghaddam;Mehdi Modarressi","doi":"10.1109/LCA.2023.3287976","DOIUrl":"10.1109/LCA.2023.3287976","url":null,"abstract":"Processing-in-memory (PIM) is the most promising paradigm to address the bandwidth bottleneck in deep neural network (DNN) accelerators. However, the algorithmic and dataflow structure of DNNs still necessitates moving a large amount of data across banks inside the memory device to bring input data and their corresponding model parameters together, negatively shifting part of the bandwidth bottleneck to the in-memory data communication infrastructure. To alleviate this bottleneck, we present \u0000<italic>Smart Memory\u0000, a highly parallel in-memory DNN accelerator for 3D memories that benefits from a scalable high-bandwidth in-memory network. Whereas the existing PIM designs implement the compute units and network-on-chip on the logic die of the underlying 3D memory, in \u0000<italic>Smart Memory\u0000 the computation and data transmission tasks are distributed across the memory banks. To this end, each memory bank is equipped with (1) a very simple processing unit to run neural networks, and (2) a circuit-switched router to interconnect memory banks by a 3D network-on-memory. Our evaluation shows 44% average performance improvement over state-of-the-art in-memory DNN accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"137-141"},"PeriodicalIF":2.3,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135784868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Mirage of Breaking MIRAGE: Analyzing the Modeling Pitfalls in Emerging “Attacks” on MIRAGE 打破海市蜃楼的海市蜃楼:分析海市蜃楼新出现的“攻击”中的建模缺陷

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-07-21 DOI: 10.1109/LCA.2023.3297875

Gururaj Saileshwar;Moinuddin Qureshi

This letter studies common modeling pitfalls in security analyses of hardware defenses to highlight the importance of accurate reproduction of defenses. We provide a case study of MIRAGE (Saileshwar and Qureshi 2021), a defense against cache side channel attacks, and analyze its incorrect modeling in a recent work (Chakraborty et al., 2023) that claimed to break its security. We highlight several modeling pitfalls that can invalidate the security properties of any defense including a) incomplete modeling of components critical for security, b) usage of random number generators that are insufficiently random, and c) initialization of system to improbable states, leading to an incorrect conclusion of a vulnerability, and show how these modeling bugs incorrectly cause set conflicts to be observed in a recent work’s (Chakraborty et al., 2023) model of MIRAGE. We also provide an implementation addressing these bugs that does not incur set-conflicts, highlighting that MIRAGE is still unbroken.

本文研究了硬件防御安全分析中常见的建模缺陷，以强调准确复制防御的重要性。我们提供了MIRAGE的案例研究(Saileshwar和Qureshi 2021)，这是一种针对缓存侧通道攻击的防御，并在最近的一项工作(Chakraborty等人，2023)中分析了其错误的建模，该工作声称破坏了其安全性。我们强调了几个可以使任何防御的安全属性无效的建模缺陷，包括a)对安全至关重要的组件的不完整建模，b)使用随机数生成器的随机性不足，以及c)将系统初始化到不可能的状态，导致对漏洞的错误结论，并展示了这些建模错误如何在最近的工作(Chakraborty et al.， 2023) MIRAGE模型中观察到的集合冲突。我们还提供了一个解决这些错误的实现，它不会引起集合冲突，强调MIRAGE仍然是未被破坏的。

引用次数: 0

Exploring the Latency Sensitivity of Cache Replacement Policies 缓存替换策略的时延敏感性研究

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-07-19 DOI: 10.1109/LCA.2023.3296251

Ahmed Nematallah;Chang Hyun Park;David Black-Schaffer

With DRAM latencies increasing relative to CPU speeds, the performance of caches has become more important. This has led to increasingly sophisticated replacement policies that require complex calculations to update their replacement metadata, which often require multiple cycles. To minimize the negative impact of these metadata updates, architects have focused on policies that incur as little update latency as possible through a combination of reducing the policies’ precision and using parallel hardware. In this work we investigate whether these tradeoffs to reduce cache metadata update latency are needed. Specifically, we look at the performance and energy impact of increasing the latency of cache replacement policy updates. We find that even dramatic increases in replacement policy update latency have very limited effect. This indicates that designers have far more freedom to increase policy complexity and latency than previously assumed.

随着DRAM延迟相对于CPU速度的增加，缓存的性能变得更加重要。这导致了越来越复杂的替换策略，需要复杂的计算来更新其替换元数据，这通常需要多个周期。为了最大限度地减少这些元数据更新的负面影响，架构师将重点放在通过降低策略的精度和使用并行硬件来尽可能减少更新延迟的策略上。在这项工作中，我们研究了是否需要这些折衷来减少缓存元数据更新延迟。具体来说，我们将研究增加缓存替换策略更新延迟对性能和能源的影响。我们发现，即使替换策略更新延迟急剧增加，效果也非常有限。这表明，与之前假设的相比，设计人员在增加策略复杂性和延迟方面有更大的自由度。

引用次数: 0

X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands X射线：通过发出内存命令发现DRAM内部结构和错误特征

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2023-07-17 DOI: 10.1109/LCA.2023.3296153

Hwayong Nam;Seungmin Baek;Minbok Wi;Michael Jaemin Kim;Jaehyun Park;Chihun Song;Nam Sung Kim;Jung Ho Ahn

The demand for accurate information about the internal structure and characteristics of DRAM has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer.

对有关DRAM内部结构和特性的准确信息的需求一直在增长。最近的研究探索了DRAM的结构和特性，以改善内存中的处理，提高可靠性，并减轻被称为rowhammer的漏洞。然而，DRAM制造商只通过官方文件披露有限的信息，因此很难找到有关实际DRAM设备的具体信息。本文使用激活诱导位翻转（AIB）、保留时间测试和行复制操作，对DRAM的内部结构和特性进行了可靠的研究。虽然之前的研究试图了解DRAM器件的内部行为，但他们只显示了结果，没有确定原因，也没有分析DRAM模块而不是单个芯片。我们首先揭示了DRAM子阵列的大小、结构和操作，并验证了我们对DRAM特性的发现。然后，我们纠正了与AIBs相关的误解信息，并展示了支持rowhammer原因的实验结果。

{"title":"X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands","authors":"Hwayong Nam;Seungmin Baek;Minbok Wi;Michael Jaemin Kim;Jaehyun Park;Chihun Song;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3296153","DOIUrl":"10.1109/LCA.2023.3296153","url":null,"abstract":"The demand for accurate information about the internal structure and characteristics of DRAM has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"89-92"},"PeriodicalIF":2.3,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49364167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Computer Architecture Letters

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀