2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献_第4页

An Ultra-efficient Look-up Table based Programmable Processing in Memory Architecture for Data Encryption 一种基于超高效查表的内存结构可编程处理数据加密

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00049

Purab Ranjan Sutradhar, K. Basu, Sai Manoj Pudukotai Dinakarrao, A. Ganguly

Processing in Memory (PIM), a non-von Neumann computing paradigm, has emerged as a faster and more efficient alternative to the traditional computing devices for data-centric applications such as Data Encryption. In this work, we present a novel PIM architecture implemented using programmable Lookup Tables (LUT) inside a DRAM chip to facilitate massively parallel and ultra-efficient data encryption with the Advanced Encryption Standard (AES) algorithm. Its LUT-based architecture replaces logic-based computations with LUT ‘look-ups’ to minimize power consumption and operational latency. The proposed PIM architecture is organized as clusters of homogeneous, interconnected LUTs that can be dynamically programmed to execute operations required for performing AES encryption. Our simulations show that the proposed PIM architecture can offer up to 14.6× and 1.8× higher performance compared to CUDA-based implementation of AES Encryption on a high-end commodity GPU and a state-of-the-art GPU Computing Processor, respectively. At the same time, it also achieves 217× and 31.2× higher energy efficiency, respectively, than the aforementioned devices while performing AES Encryption.

内存处理(PIM)是一种非冯·诺伊曼计算范式，它已经成为一种更快、更有效的替代传统计算设备，用于数据加密等以数据为中心的应用。在这项工作中，我们提出了一种新的PIM架构，该架构使用DRAM芯片内的可编程查找表(LUT)实现，以促进使用高级加密标准(AES)算法进行大规模并行和超高效的数据加密。它基于LUT的架构用LUT“查找”取代了基于逻辑的计算，从而最大限度地减少了功耗和操作延迟。所提出的PIM体系结构被组织为同构的、相互连接的lut集群，这些lut可以被动态编程以执行执行AES加密所需的操作。我们的模拟表明，与高端商用GPU和最先进的GPU计算处理器上基于cuda的AES加密实现相比，所提出的PIM架构可以提供高达14.6倍和1.8倍的性能提升。同时，在进行AES加密的同时，它的能效也比上述设备分别提高了217x和31.2 x。

{"title":"An Ultra-efficient Look-up Table based Programmable Processing in Memory Architecture for Data Encryption","authors":"Purab Ranjan Sutradhar, K. Basu, Sai Manoj Pudukotai Dinakarrao, A. Ganguly","doi":"10.1109/ICCD53106.2021.00049","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00049","url":null,"abstract":"Processing in Memory (PIM), a non-von Neumann computing paradigm, has emerged as a faster and more efficient alternative to the traditional computing devices for data-centric applications such as Data Encryption. In this work, we present a novel PIM architecture implemented using programmable Lookup Tables (LUT) inside a DRAM chip to facilitate massively parallel and ultra-efficient data encryption with the Advanced Encryption Standard (AES) algorithm. Its LUT-based architecture replaces logic-based computations with LUT ‘look-ups’ to minimize power consumption and operational latency. The proposed PIM architecture is organized as clusters of homogeneous, interconnected LUTs that can be dynamically programmed to execute operations required for performing AES encryption. Our simulations show that the proposed PIM architecture can offer up to 14.6× and 1.8× higher performance compared to CUDA-based implementation of AES Encryption on a high-end commodity GPU and a state-of-the-art GPU Computing Processor, respectively. At the same time, it also achieves 217× and 31.2× higher energy efficiency, respectively, than the aforementioned devices while performing AES Encryption.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130889276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SimiEncode: A Similarity-based Encoding Scheme to Improve Performance and Lifetime of Non-Volatile Main Memory SimiEncode:一种基于相似性的编码方案，以提高非易失性主存的性能和寿命

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00044

Suzhen Wu, Jiapeng Wu, Zhirong Shen, Zhihao Zhang, Zuocheng Wang, Bo Mao

Non-Volatile Memories (NVMs) have shown tremendous potential to be the next generation of main memory, yet they are still seriously hampered by the high write latency and limited endurance. In this paper, we first unveil via realworld benchmark analysis that the words within the same cache line showcase a high degree of similarity. We therefore present SimiEncode, a low-overhead and effective Similarity-based Encoding approach. SimiEncode relieves writes to NVMs by (1) generating a mask word with minimized differences to the words within a cache line, (2) encoding each word with the associated mask word by simple XOR operations, and (3) writing a single tag bit to indicate the resulting zero word after encoding. Our prototype implementation of SimiEncode and extensive evaluations driven by 15 state-of-the-art benchmarks demonstrate that, compared with existing approaches, SimiEncode significantly prolongs the lifetime and improves the performance. Importantly, SimiEncode is orthogonal to and can be easily incorporated into existing bit flipping optimizations.

非易失性存储器(Non-Volatile memory, NVMs)作为下一代主存储器已经显示出了巨大的潜力，但它们仍然受到高写入延迟和有限的耐用性的严重阻碍。在本文中，我们首先通过现实世界的基准分析揭示，同一缓存行的单词显示出高度的相似性。因此，我们提出了SimiEncode，一种低开销且有效的基于相似性的编码方法。SimiEncode通过以下方式减轻了对nvm的写操作:(1)生成与缓存行中单词差异最小的掩码单词，(2)通过简单的异或操作将每个单词与相关的掩码单词编码，以及(3)写入单个标记位以表示编码后产生的零单词。我们的SimiEncode原型实现和由15个最先进的基准测试驱动的广泛评估表明，与现有方法相比，SimiEncode显着延长了生命周期并提高了性能。重要的是，SimiEncode与现有的位翻转优化是正交的，并且可以很容易地集成到其中。

{"title":"SimiEncode: A Similarity-based Encoding Scheme to Improve Performance and Lifetime of Non-Volatile Main Memory","authors":"Suzhen Wu, Jiapeng Wu, Zhirong Shen, Zhihao Zhang, Zuocheng Wang, Bo Mao","doi":"10.1109/ICCD53106.2021.00044","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00044","url":null,"abstract":"Non-Volatile Memories (NVMs) have shown tremendous potential to be the next generation of main memory, yet they are still seriously hampered by the high write latency and limited endurance. In this paper, we first unveil via realworld benchmark analysis that the words within the same cache line showcase a high degree of similarity. We therefore present SimiEncode, a low-overhead and effective Similarity-based Encoding approach. SimiEncode relieves writes to NVMs by (1) generating a mask word with minimized differences to the words within a cache line, (2) encoding each word with the associated mask word by simple XOR operations, and (3) writing a single tag bit to indicate the resulting zero word after encoding. Our prototype implementation of SimiEncode and extensive evaluations driven by 15 state-of-the-art benchmarks demonstrate that, compared with existing approaches, SimiEncode significantly prolongs the lifetime and improves the performance. Importantly, SimiEncode is orthogonal to and can be easily incorporated into existing bit flipping optimizations.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134007345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Reconfigurable Array for Analog Applications 用于模拟应用的可重构阵列

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00064

Ziyi Chen, I. Savidis

In this paper, a novel field-programmable analog array (FPAA) fabric consisting of a 6x6 matrix of configurable analog blocks (CABs) is proposed. The implementation of programmable CABs eliminates the use of fixed analog sub-circuits. A unique routing strategy is developed within the CAB units that supports both differential and single-ended mode circuit configurations. The bandwidth limitation due to the routing switches of each individual CAB unit is compensated for through the use of a switch-less routing network between CABs. Algorithms and methodologies are developed to facilitate rapid implementation of analog circuits on the FPAA. The proposed FPAA fabric provides high operating speeds as compared to existing FPAA topologies, while providing greater configuration in the CAB units as compared to switch-less FPAAs. The FPAA core includes 498 programming switches and 14 global switchless interconnects, while occupying an area of 0.1 mm2 in a 65 nm CMOS process. The characteristic power consumption is approximately 24.6 mW for a supply voltage of 1.2 V. Circuits implemented on the proposed FPAA fabric include operational amplifiers (op amps), filters, oscillators, and frequency dividers. The reconfigured bandpass filter provides a center frequency of approximately 1.5 GHz, while the synthesized ring-oscillator and frequency divider support operating frequencies of up to 500 MHz.

本文提出了一种新型的现场可编程模拟阵列(FPAA)结构，该结构由6x6可配置模拟块(cab)矩阵组成。可编程cab的实现消除了固定模拟子电路的使用。在CAB单元内开发了一种独特的路由策略，支持差分和单端模式电路配置。由于每个单独的CAB单元的路由开关的带宽限制是通过在CAB之间使用无开关路由网络来补偿的。算法和方法的发展，以促进快速实现模拟电路在FPAA。与现有的FPAA拓扑结构相比，拟议的FPAA结构提供了更高的运行速度，同时与无开关FPAA相比，在CAB单元中提供了更大的配置。FPAA核心包括498个编程开关和14个全局无开关互连，而在65纳米CMOS工艺中占地0.1 mm2。电源电压为1.2 V时，特性功耗约为24.6 mW。在提议的FPAA结构上实现的电路包括运算放大器(运放)、滤波器、振荡器和分频器。重新配置的带通滤波器提供约1.5 GHz的中心频率，而合成的环形振荡器和分频器支持高达500 MHz的工作频率。

{"title":"Reconfigurable Array for Analog Applications","authors":"Ziyi Chen, I. Savidis","doi":"10.1109/ICCD53106.2021.00064","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00064","url":null,"abstract":"In this paper, a novel field-programmable analog array (FPAA) fabric consisting of a 6x6 matrix of configurable analog blocks (CABs) is proposed. The implementation of programmable CABs eliminates the use of fixed analog sub-circuits. A unique routing strategy is developed within the CAB units that supports both differential and single-ended mode circuit configurations. The bandwidth limitation due to the routing switches of each individual CAB unit is compensated for through the use of a switch-less routing network between CABs. Algorithms and methodologies are developed to facilitate rapid implementation of analog circuits on the FPAA. The proposed FPAA fabric provides high operating speeds as compared to existing FPAA topologies, while providing greater configuration in the CAB units as compared to switch-less FPAAs. The FPAA core includes 498 programming switches and 14 global switchless interconnects, while occupying an area of 0.1 mm2 in a 65 nm CMOS process. The characteristic power consumption is approximately 24.6 mW for a supply voltage of 1.2 V. Circuits implemented on the proposed FPAA fabric include operational amplifiers (op amps), filters, oscillators, and frequency dividers. The reconfigured bandpass filter provides a center frequency of approximately 1.5 GHz, while the synthesized ring-oscillator and frequency divider support operating frequencies of up to 500 MHz.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131981452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RDP3: Rapid Domain Platform Performance Prediction for Design Space Exploration 面向设计空间探索的快速领域平台性能预测

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00086

Jinghan Zhang, Mehrshad Zandigohar, G. Schirner

Heterogeneous Accelerator-rich (ACC-rich) platforms combining general-purpose cores and specialized HW Accelerators (ACCs) promise high-performance and low-power deployment of streaming applications, e.g. for video analytics, software-defined radio, and radar. In order to recover Non-Recurring Engineering (NRE) cost, a unified domain platform for a set of applications can be exploited, especially when applications have functional and structural similarities, which can benefit from common ACCs. However, identifying the most beneficial set of common ACCs is challenging, and current Design Space Exploration (DSE) methods for domain platform allocation suffer from a long exploration time bottleneck. In particular, compared to a traditional DSE, evaluating the performance of a platform for a domain of applications is much more time-consuming as binding exploration and evaluation for each application in the domain is required. Thus, a rapid domain performance evaluation is needed to speed up the exploration of the platform allocation.This paper introduces Rapid Domain Platform Performance Prediction (RDP3) methods to speed up the exploration in domain DSE. Key contributions are: (1) analyzing current domain DSE flow and its exploration time bottleneck; (2) introducing four RDP3 methods to speedup the evaluation of different platform allocations: Heuristic Processing (HP) estimation, Linear Regression (LR), Decision Tree Regression (DTR), and Multi-Layer Perceptron (MLP) predictions; (3) comparing the performance of these predictions and integrating the prediction into the current domain DSE. To evaluate the efficacy of RDP3, we explore 10K platforms capable of processing OpenVX domain applications. We demonstrate that RDP3-MLP as the most promising method can achieve a speedup of 17.5K times with only 0.001 mean square error compared to the current platform evaluation using the analytical model. Integrating RDP3-MLP into the existing domain DSE method GIDE [1] can save 80.8% exploration time while still resulting in the same output platform design.

结合通用核心和专用硬件加速器(ACCs)的异构富加速器(ACC-rich)平台承诺高性能和低功耗的流应用部署，例如视频分析，软件定义无线电和雷达。为了回收非重复工程(NRE)成本，可以利用一组应用程序的统一域平台，特别是当应用程序具有功能和结构相似性时，这可以从通用acc中受益。然而，确定最有益的通用acc集是具有挑战性的，并且当前用于领域平台分配的设计空间探索(DSE)方法存在较长的探索时间瓶颈。特别是，与传统的DSE相比，评估应用程序领域的平台性能要耗时得多，因为需要对领域中的每个应用程序进行绑定探索和评估。因此，需要快速的域性能评估来加快平台分配的探索。本文介绍了快速领域平台性能预测(RDP3)方法，以加快对领域DSE的探索。主要贡献有:(1)分析了当前域DSE流及其勘探时间瓶颈;(2)引入了启发式处理(HP)估计、线性回归(LR)、决策树回归(DTR)和多层感知器(MLP)预测四种RDP3方法来加速不同平台分配的评估;(3)比较这些预测的性能，并将预测结果整合到当前域DSE中。为了评估RDP3的有效性，我们探索了能够处理OpenVX域应用程序的10K平台。我们证明，与使用分析模型的当前平台评估相比，RDP3-MLP作为最有希望的方法可以实现17.5K倍的加速，均方误差仅为0.001。将RDP3-MLP集成到现有的域DSE方法GIDE[1]中，在输出平台设计不变的情况下，可以节省80.8%的勘探时间。

{"title":"RDP3: Rapid Domain Platform Performance Prediction for Design Space Exploration","authors":"Jinghan Zhang, Mehrshad Zandigohar, G. Schirner","doi":"10.1109/ICCD53106.2021.00086","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00086","url":null,"abstract":"Heterogeneous Accelerator-rich (ACC-rich) platforms combining general-purpose cores and specialized HW Accelerators (ACCs) promise high-performance and low-power deployment of streaming applications, e.g. for video analytics, software-defined radio, and radar. In order to recover Non-Recurring Engineering (NRE) cost, a unified domain platform for a set of applications can be exploited, especially when applications have functional and structural similarities, which can benefit from common ACCs. However, identifying the most beneficial set of common ACCs is challenging, and current Design Space Exploration (DSE) methods for domain platform allocation suffer from a long exploration time bottleneck. In particular, compared to a traditional DSE, evaluating the performance of a platform for a domain of applications is much more time-consuming as binding exploration and evaluation for each application in the domain is required. Thus, a rapid domain performance evaluation is needed to speed up the exploration of the platform allocation.This paper introduces Rapid Domain Platform Performance Prediction (RDP3) methods to speed up the exploration in domain DSE. Key contributions are: (1) analyzing current domain DSE flow and its exploration time bottleneck; (2) introducing four RDP3 methods to speedup the evaluation of different platform allocations: Heuristic Processing (HP) estimation, Linear Regression (LR), Decision Tree Regression (DTR), and Multi-Layer Perceptron (MLP) predictions; (3) comparing the performance of these predictions and integrating the prediction into the current domain DSE. To evaluate the efficacy of RDP3, we explore 10K platforms capable of processing OpenVX domain applications. We demonstrate that RDP3-MLP as the most promising method can achieve a speedup of 17.5K times with only 0.001 mean square error compared to the current platform evaluation using the analytical model. Integrating RDP3-MLP into the existing domain DSE method GIDE [1] can save 80.8% exploration time while still resulting in the same output platform design.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132248892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic File Cache Optimization for Hybrid SSDs with High-Density and Low-Cost Flash Memory 基于高密度低成本闪存的混合ssd动态文件缓存优化

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00036

Ben Gu, Longfei Luo, Yina Lv, Changlong Li, Liang Shi

Over the last few years, hybrid solid-state drives (SSDs) have been widely adopted due to their high performance and high capacity. Devices equipped with hybrid SSDs can be utilized to cache files from the network for performance improvement. However, this paper finds an interesting observation, that is, the efficiency of hybrid SSDs is significantly degraded instead of improved when too much data is cached. This is because the internal mode switching between different types of flash memory is affected by the device utilization. This paper proposes a dynamic file cache optimization scheme for hybrid SSDs, DFCache, which optimizes the device’s efficiency and limits unreasonable space consumption. DFCache includes two key ideas, dynamic cache space management, and intelligent cache file sifting. DFCache is implemented in Linux kernel and tested under real hybrid SSDs. Experimental results show that the I/O performance outperforms the state-of-the-art by up to 3.7x.

在过去的几年中，混合固态硬盘(ssd)由于其高性能和高容量而被广泛采用。配备混合ssd的设备可以用来缓存来自网络的文件，以提高性能。然而，本文发现了一个有趣的现象，即当缓存的数据过多时，混合ssd的效率会显著降低，而不是提高。这是因为不同类型闪存之间的内部模式切换受设备利用率的影响。本文提出了一种用于混合ssd的动态文件缓存优化方案DFCache，优化了设备的效率，限制了不合理的空间消耗。DFCache包括两个关键思想，动态缓存空间管理和智能缓存文件筛选。DFCache是在Linux内核中实现的，并在真实的混合ssd上进行了测试。实验结果表明，它的I/O性能比最先进的I/O性能高出3.7倍。

引用次数: 2

WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units WidePipe:基于神经处理单元集群的高吞吐量深度学习推理系统

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00091

Lixian Ma, En Shao, Yueyuan Zhou, Guangming Tan

The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.

机器学习技术的广泛应用促进了机器学习即服务(ML-as-a-Service, MLaaS)的产生，这是一种无服务器计算范式，用于快速部署训练好的模型作为服务。然而，设计一个能够应对低延迟和异构神经网络大流量的推理系统是一个挑战。在现有的云推理系统中，为机器学习服务自适应地配置多层并行性是很困难的，特别是如果集群有加速器，如gpu、npu、fpga等。这些问题导致资源利用率低下，限制了系统吞吐量。在本文中，我们提出并实现了一个名为WidePipe的高吞吐量推理系统，该系统利用强化学习来根据设备状态共同适应资源分配和请求的批量大小。我们评估了WidePipe在250个节点中拥有1000个神经处理单元的大型集群中的性能。实验结果表明，在部署异构机器学习服务时，WidePipe的吞吐量比现有推理系统高2.11倍，满足响应时间的服务级目标。

{"title":"WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units","authors":"Lixian Ma, En Shao, Yueyuan Zhou, Guangming Tan","doi":"10.1109/ICCD53106.2021.00091","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00091","url":null,"abstract":"The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133709963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Flipping Bits to Share Crossbars in ReRAM-Based DNN Accelerator 在基于reram的DNN加速器中翻转位共享交叉条

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00016

Lei Zhao, Youtao Zhang, Jun Yang

Future deep neural networks (DNNs) tend to grow deeper and contain more trainable weights. Although methods such as pruning and quantization are widely adopted to reduce DNN’s model size and computation, they are less applicable in the area of ReRAM-based DNN accelerators. On the one hand, because the cells in crossbars are accessed uniformly, it is difficult to explore fine-grained pruning in ReRAM-based DNN accelerators. On the other hand, aggressive quantization results in poor accuracy coupled with the low precision of ReRAM cells to represent weight values.In this paper, we propose BFlip – a novel model size and computation reduction technique – to share crossbars among multiple bit matrices. BFlip clusters similar bit matrices together, and finds a combination of row and column flips for each bit matrix to minimize its distance to the centroid of the cluster. Therefore, only the centroid bit matrix is stored in the crossbar, which is shared by all other bit matrices in that cluster. We also propose a calibration method to improve the accuracy as well as a ReRAM-based DNN accelerator to fully reap the storage and computation benefits of BFlip. Our experiments show that BFlip effectively reduces model size and computation with negligible accuracy impact. The proposed accelerator achieves 2.45 × speedup and 85% energy reduction over the ISAAC baseline.

未来的深度神经网络(dnn)趋向于变得更深，包含更多可训练的权重。尽管修剪和量化等方法被广泛用于减少DNN的模型大小和计算量，但它们在基于reram的DNN加速器领域的适用性较差。一方面，由于交叉条中的细胞是均匀访问的，因此难以在基于reram的DNN加速器中探索细粒度修剪。另一方面，激进的量化导致精度差，加上ReRAM单元格表示权重值的精度低。在本文中，我们提出了一种新颖的模型大小和计算减少技术BFlip在多个位矩阵之间共享交叉棒。BFlip将相似的位矩阵聚在一起，并为每个位矩阵找到行和列翻转的组合，以最小化其到簇质心的距离。因此，只有质心位矩阵存储在横杆中，该簇中的所有其他位矩阵共享。我们还提出了一种校准方法来提高精度，并提出了一个基于reram的深度神经网络加速器，以充分利用BFlip的存储和计算优势。我们的实验表明，BFlip有效地减少了模型大小和计算，而精度影响可以忽略不计。所提出的加速器在ISAAC基线上实现了2.45倍的加速和85%的能量降低。

{"title":"Flipping Bits to Share Crossbars in ReRAM-Based DNN Accelerator","authors":"Lei Zhao, Youtao Zhang, Jun Yang","doi":"10.1109/ICCD53106.2021.00016","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00016","url":null,"abstract":"Future deep neural networks (DNNs) tend to grow deeper and contain more trainable weights. Although methods such as pruning and quantization are widely adopted to reduce DNN’s model size and computation, they are less applicable in the area of ReRAM-based DNN accelerators. On the one hand, because the cells in crossbars are accessed uniformly, it is difficult to explore fine-grained pruning in ReRAM-based DNN accelerators. On the other hand, aggressive quantization results in poor accuracy coupled with the low precision of ReRAM cells to represent weight values.In this paper, we propose BFlip – a novel model size and computation reduction technique – to share crossbars among multiple bit matrices. BFlip clusters similar bit matrices together, and finds a combination of row and column flips for each bit matrix to minimize its distance to the centroid of the cluster. Therefore, only the centroid bit matrix is stored in the crossbar, which is shared by all other bit matrices in that cluster. We also propose a calibration method to improve the accuracy as well as a ReRAM-based DNN accelerator to fully reap the storage and computation benefits of BFlip. Our experiments show that BFlip effectively reduces model size and computation with negligible accuracy impact. The proposed accelerator achieves 2.45 × speedup and 85% energy reduction over the ISAAC baseline.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

APT: Efficient Side-Channel Analysis Framework against Inner Product Masking Scheme APT:针对内积掩蔽方案的有效侧信道分析框架

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00093

Jingdian Ming, Wei Cheng, Yongbin Zhou, Huizhong Li

Due to its provable security and remarkable device-independence, masking has been widely accepted as a good algorithmic-level countermeasure against side-channel attacks. Subsequently, several code-based masking schemes are proposed to strengthen the original Boolean masking (BM) scheme, and Inner Product Masking (IPM) scheme is typically one of those. In this paper, we provide a framework, named analysis with predicted template (APT), for side-channel analysis against the IPM scheme. Following this framework, we propose two attacks based on maximum likelihood and Euclidean distance, respectively. To evaluate their efficiency, we perform simulated experiments on first-order BM and an optimal IPM scheme. The results show that our proposals are equivalent to a second-order CPA against BM scheme, but they are significantly efficient against an optimal IPM. In practical experiments based on an ARM Cortex-M4 architecture, the results of our proposals do not turn out well because of a few outliers in collected leakages. After filtering out these outliers, our proposals perform efficiently as expected. Finally, we argue that the side-channel security of IPM can be improved by keeping the vector L to be randomly selected from an elaborated small set.

由于其可证明的安全性和显著的设备独立性，掩蔽已被广泛接受为对抗侧信道攻击的良好算法级对策。随后，提出了几种基于码的掩码方案来增强布尔掩码(BM)方案，内积掩码(IPM)方案是其中典型的一种。在本文中，我们提供了一个框架，称为预测模板分析(APT)，用于针对IPM方案的侧信道分析。在此框架下，我们分别提出了两种基于最大似然和欧氏距离的攻击方法。为了评估它们的效率，我们进行了一阶BM和最优IPM方案的模拟实验。结果表明，我们提出的方法在对抗BM方案时相当于二阶CPA，但在对抗最优IPM方案时效率显著。在基于ARM Cortex-M4架构的实际实验中，由于收集到的泄漏中存在一些异常值，我们的建议的结果并不好。在过滤掉这些异常值后，我们的建议如预期的那样有效地执行。最后，我们认为通过保持向量L从一个精心设计的小集合中随机选择，可以提高IPM的侧信道安全性。

{"title":"APT: Efficient Side-Channel Analysis Framework against Inner Product Masking Scheme","authors":"Jingdian Ming, Wei Cheng, Yongbin Zhou, Huizhong Li","doi":"10.1109/ICCD53106.2021.00093","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00093","url":null,"abstract":"Due to its provable security and remarkable device-independence, masking has been widely accepted as a good algorithmic-level countermeasure against side-channel attacks. Subsequently, several code-based masking schemes are proposed to strengthen the original Boolean masking (BM) scheme, and Inner Product Masking (IPM) scheme is typically one of those. In this paper, we provide a framework, named analysis with predicted template (APT), for side-channel analysis against the IPM scheme. Following this framework, we propose two attacks based on maximum likelihood and Euclidean distance, respectively. To evaluate their efficiency, we perform simulated experiments on first-order BM and an optimal IPM scheme. The results show that our proposals are equivalent to a second-order CPA against BM scheme, but they are significantly efficient against an optimal IPM. In practical experiments based on an ARM Cortex-M4 architecture, the results of our proposals do not turn out well because of a few outliers in collected leakages. After filtering out these outliers, our proposals perform efficiently as expected. Finally, we argue that the side-channel security of IPM can be improved by keeping the vector L to be randomly selected from an elaborated small set.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132389205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rectification of Integer Arithmetic Circuits using Computer Algebra Techniques 用计算机代数技术校正整数算术电路

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00039

V. Rao, Haden Ondricek, P. Kalla, Florian Enescu

This paper proposes a symbolic algebra approach for multi-target rectification of integer arithmetic circuits. The circuit is represented as a system of polynomials and rectified against a polynomial specification with computations modeled over the field of rationals. Given a set of nets as potential rectification targets, we formulate a check to ascertain the existence of rectification functions at these targets. Upon confirmation, we compute the patch functions collectively for the targets. In this regard, we show how to synthesize a logic sub-circuit from polynomial artifacts generated over the field of rationals. We present new mathematical contributions and results to substantiate this synthesis process. We present two approaches for patch function computation: a greedy approach that resolves the rectification functions for the targets and an approach that explores a subset of don’t care conditions for the targets. Our approach is implemented as custom software and utilizes the existing open-source symbolic algebra libraries for computations. We present experimental results of our approach on several integer multipliers benchmark and discuss the quality of the patch sub-circuits generated.

提出了整数算术电路多目标纠偏的符号代数方法。电路表示为多项式系统，并根据多项式规范进行校正，计算在有理域上建模。给定一组网络作为潜在的整改目标，我们制定了一个检查来确定这些目标上是否存在整改功能。确认后，我们对目标集合计算patch函数。在这方面，我们展示了如何从在有理域上产生的多项式伪影合成逻辑子电路。我们提出了新的数学贡献和结果来证实这一合成过程。我们提出了两种补丁函数的计算方法:一种是贪心的方法，它解决了目标的校正函数，另一种是探索目标的不关心条件子集的方法。我们的方法是作为自定义软件实现的，并利用现有的开源符号代数库进行计算。我们给出了该方法在若干整数乘法器基准上的实验结果，并讨论了所生成的贴片子电路的质量。

{"title":"Rectification of Integer Arithmetic Circuits using Computer Algebra Techniques","authors":"V. Rao, Haden Ondricek, P. Kalla, Florian Enescu","doi":"10.1109/ICCD53106.2021.00039","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00039","url":null,"abstract":"This paper proposes a symbolic algebra approach for multi-target rectification of integer arithmetic circuits. The circuit is represented as a system of polynomials and rectified against a polynomial specification with computations modeled over the field of rationals. Given a set of nets as potential rectification targets, we formulate a check to ascertain the existence of rectification functions at these targets. Upon confirmation, we compute the patch functions collectively for the targets. In this regard, we show how to synthesize a logic sub-circuit from polynomial artifacts generated over the field of rationals. We present new mathematical contributions and results to substantiate this synthesis process. We present two approaches for patch function computation: a greedy approach that resolves the rectification functions for the targets and an approach that explores a subset of don’t care conditions for the targets. Our approach is implemented as custom software and utilizes the existing open-source symbolic algebra libraries for computations. We present experimental results of our approach on several integer multipliers benchmark and discuss the quality of the patch sub-circuits generated.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116109991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Premier: A Concurrency-Aware Pseudo-Partitioning Framework for Shared Last-Level Cache 一个用于共享最后一级缓存的并发感知伪分区框架

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00068

Xiaoyang Lu, Rujia Wang, Xian-He Sun

As the number of on-chip cores and application demands increase, efficient management of shared cache resources becomes imperative. Cache partitioning techniques have been studied for decades to reduce interference between applications in a shared cache and provide performance and fairness guarantees. However, there are few studies on how concurrent memory accesses affect the effectiveness of partitioning. When concurrent memory requests exist, cache miss does not reflect concurrency overlapping well. In this work, we first introduce pure misses per kilo instructions (PMPKI), a metric that quantifies the cache efficiency considering concurrent access activities. Then we propose Premier, a dynamically adaptive concurrency-aware cache pseudo-partitioning framework. Premier provides insertion and promotion policies based on PMPKI curves to achieve the benefits of cache partitioning. Finally, our evaluation of various workloads shows that Premier outperforms state-of-the-art cache partitioning schemes in terms of performance and fairness. In an 8-core system, Premier achieves 15.45% higher system performance and 10.91% better fairness than the UCP scheme.

随着片上内核数量和应用程序需求的增加，共享缓存资源的有效管理变得势在必行。缓存分区技术已经研究了几十年，目的是减少共享缓存中应用程序之间的干扰，并提供性能和公平性保证。然而，关于并发内存访问如何影响分区有效性的研究很少。当存在并发内存请求时，缓存缺失不能很好地反映并发重叠。在这项工作中，我们首先引入了每千克指令的纯失误(PMPKI)，这是一个考虑并发访问活动来量化缓存效率的指标。然后我们提出了一个动态自适应并发感知缓存伪分区框架Premier。Premier提供基于PMPKI曲线的插入和提升策略，以实现缓存分区的优势。最后，我们对各种工作负载的评估表明，Premier在性能和公平性方面优于最先进的缓存分区方案。在8核系统中，与UCP方案相比，Premier方案的系统性能提高15.45%，公平性提高10.91%。

{"title":"Premier: A Concurrency-Aware Pseudo-Partitioning Framework for Shared Last-Level Cache","authors":"Xiaoyang Lu, Rujia Wang, Xian-He Sun","doi":"10.1109/ICCD53106.2021.00068","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00068","url":null,"abstract":"As the number of on-chip cores and application demands increase, efficient management of shared cache resources becomes imperative. Cache partitioning techniques have been studied for decades to reduce interference between applications in a shared cache and provide performance and fairness guarantees. However, there are few studies on how concurrent memory accesses affect the effectiveness of partitioning. When concurrent memory requests exist, cache miss does not reflect concurrency overlapping well. In this work, we first introduce pure misses per kilo instructions (PMPKI), a metric that quantifies the cache efficiency considering concurrent access activities. Then we propose Premier, a dynamically adaptive concurrency-aware cache pseudo-partitioning framework. Premier provides insertion and promotion policies based on PMPKI curves to achieve the benefits of cache partitioning. Finally, our evaluation of various workloads shows that Premier outperforms state-of-the-art cache partitioning schemes in terms of performance and fairness. In an 8-core system, Premier achieves 15.45% higher system performance and 10.91% better fairness than the UCP scheme.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2