2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献_第6页

QD-Compressor: a Quantization-based Delta Compression Framework for Deep Neural Networks QD-Compressor:基于量化的深度神经网络Delta压缩框架

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00088

Shuyu Zhang, Donglei Wu, Haoyu Jin, Xiangyu Zou, Wen Xia, Xiaojia Huang

Deep neural networks (DNNs) have achieved remarkable success in many fields. Large-scale DNNs also bring storage challenges when storing snapshots for preventing clusters’ frequent failures, and bring massive internet traffic when dispatching or updating DNNs for resource-constrained devices (e.g., IoT devices, mobile phones). Several approaches are aiming to compress DNNs. The Recent work, Delta-DNN, notices high similarity existed in DNNs and thus calculates differences between them for improving the compression ratio.However, we observe that Delta-DNN, applying traditional global lossy quantization technique in calculating differences of two neighboring versions of the DNNs, can not fully exploit the data similarity between them for delta compression. This is because the parameters’ value ranges (and also the delta data in Delta-DNN) are varying among layers in DNNs, which inspires us to propose a local-sensitive quantization scheme: the quantizers are adaptive to parameters’ local value ranges in layers. Moreover, instead of quantizing differences of DNNs in Delta-DNN, our approach quantizes DNNs before calculating differences to make the differences more compressible. Besides, we also propose an error feedback mechanism to reduce DNNs’ accuracy loss caused by the lossy quantization.Therefore, we design a novel quantization-based delta compressor called QD-Compressor, which calculates the lossy differences between epochs of DNNs for saving storage cost of backing up DNNs’ snapshots and internet traffic of dispatching DNNs for resource-constrained devices. Experiments on several popular DNNs and datasets show that QD-Compressor obtains a compression ratio of 2.4× ~ 31.5× higher than the state-of-the-art approaches while well maintaining the model’s test accuracy.

深度神经网络(dnn)在许多领域取得了显著的成功。大规模dnn在存储快照以防止集群频繁故障时也会带来存储挑战，并且在为资源受限的设备(例如物联网设备，移动电话)调度或更新dnn时带来巨大的互联网流量。有几种方法旨在压缩dnn。最近的工作Delta-DNN注意到dnn之间存在高度相似性，从而计算它们之间的差异以提高压缩比。然而，我们发现delta - dnn在计算两个相邻版本的dnn的差异时，采用传统的全局有损量化技术，不能充分利用它们之间的数据相似性进行delta压缩。这是因为dnn中参数的取值范围(以及delta - dnn中的delta数据)在各层之间是不同的，这启发我们提出了一种局部敏感的量化方案:量化器自适应各层参数的局部取值范围。此外，我们的方法不是量化Delta-DNN中dnn的差异，而是在计算差异之前量化dnn，使差异更可压缩。此外，我们还提出了一种误差反馈机制，以减少有耗量化给dnn带来的精度损失。因此，我们设计了一种新的基于量化的增量压缩器，称为QD-Compressor，它计算dnn的时代之间的有损差异，以节省备份dnn快照的存储成本和资源受限设备调度dnn的互联网流量。在几个流行的深度神经网络和数据集上的实验表明，QD-Compressor在保持模型测试精度的同时，获得了比现有方法高2.4× ~ 31.5×的压缩比。

{"title":"QD-Compressor: a Quantization-based Delta Compression Framework for Deep Neural Networks","authors":"Shuyu Zhang, Donglei Wu, Haoyu Jin, Xiangyu Zou, Wen Xia, Xiaojia Huang","doi":"10.1109/ICCD53106.2021.00088","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00088","url":null,"abstract":"Deep neural networks (DNNs) have achieved remarkable success in many fields. Large-scale DNNs also bring storage challenges when storing snapshots for preventing clusters’ frequent failures, and bring massive internet traffic when dispatching or updating DNNs for resource-constrained devices (e.g., IoT devices, mobile phones). Several approaches are aiming to compress DNNs. The Recent work, Delta-DNN, notices high similarity existed in DNNs and thus calculates differences between them for improving the compression ratio.However, we observe that Delta-DNN, applying traditional global lossy quantization technique in calculating differences of two neighboring versions of the DNNs, can not fully exploit the data similarity between them for delta compression. This is because the parameters’ value ranges (and also the delta data in Delta-DNN) are varying among layers in DNNs, which inspires us to propose a local-sensitive quantization scheme: the quantizers are adaptive to parameters’ local value ranges in layers. Moreover, instead of quantizing differences of DNNs in Delta-DNN, our approach quantizes DNNs before calculating differences to make the differences more compressible. Besides, we also propose an error feedback mechanism to reduce DNNs’ accuracy loss caused by the lossy quantization.Therefore, we design a novel quantization-based delta compressor called QD-Compressor, which calculates the lossy differences between epochs of DNNs for saving storage cost of backing up DNNs’ snapshots and internet traffic of dispatching DNNs for resource-constrained devices. Experiments on several popular DNNs and datasets show that QD-Compressor obtains a compression ratio of 2.4× ~ 31.5× higher than the state-of-the-art approaches while well maintaining the model’s test accuracy.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115077824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ModelShield: A Generic and Portable Framework Extension for Defending Bit-Flip based Adversarial Weight Attacks ModelShield:一个通用的可移植框架扩展，用于防御基于位翻转的对抗性权重攻击

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00090

Yanan Guo, Liang Liu, Yueqiang Cheng, Youtao Zhang, Jun Yang

Bit-flip attack (BFA) has become one of the most serious threats to Deep Neural Network (DNN) security. By utilizing Rowhammer to flip the bits of DNN weights stored in memory, the attacker can turn a functional DNN into a random output generator. In this work, we propose ModelShield, a defense mechanism against BFA, based on protecting the integrity of weights using hash verification. ModelShield performs real-time integrity verification on DNN weights. Since this can slow down a DNN inference by up to 7×, we further propose two optimizations for ModelShield. We implement ModelShield as a lightweight software extension that can be easily installed into popular DNN frameworks. We test both the security and performance of ModelShield, and the results show that it can effectively defend BFA with less than 2% performance overhead.

比特翻转攻击(BFA)已成为深度神经网络(DNN)安全面临的最严重威胁之一。通过利用Rowhammer翻转存储在内存中的DNN权重位，攻击者可以将功能DNN变成随机输出生成器。在这项工作中，我们提出了一种针对BFA的防御机制ModelShield，该机制基于使用哈希验证来保护权重的完整性。ModelShield对DNN权重进行实时完整性验证。由于这可以将DNN推理速度降低7倍，因此我们进一步提出了对ModelShield的两种优化。我们将ModelShield作为一个轻量级的软件扩展来实现，它可以很容易地安装到流行的DNN框架中。我们测试了ModelShield的安全性和性能，结果表明它可以在不到2%的性能开销下有效地防御BFA。

引用次数: 4

Discreet-PARA: Rowhammer Defense with Low Cost and High Efficiency 离散型:低成本高效率的回旋锤防御

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00074

Yichen Wang, Yang Liu, Peiyun Wu, Zhao Zhang

DRAM rowhammer attack is a severe security concern on computer systems using DRAM memories. A number of defense mechanisms have been proposed, but all with short-coming in either performance overhead, storage requirement, or defense strength. In this paper, we present a novel design called discreet-PARA. It creatively integrates two new components, namely Disturbance Bin Counting (DBC) and PARA-cache, into the existing PARA (Probabilistic Adjacent Row Activation) defense. The two components only require small counter and cache storages but can eliminate or significantly reduce the performance overhead of PARA. Our evaluation using SPEC CPU2017 workloads confirms that discreet-PARA can achieve very high defense strength with a performance overhead much lower than the original PARA.

对于使用DRAM存储器的计算机系统来说，DRAM rowhammer攻击是一个严重的安全问题。已经提出了许多防御机制，但它们在性能开销、存储需求或防御强度方面都存在缺点。在本文中，我们提出了一种新的设计，称为离散- para。它创造性地将两个新的组件，即干扰Bin计数(DBC)和PARA-cache集成到现有的PARA(概率相邻行激活)防御中。这两个组件只需要很小的计数器和缓存存储，但可以消除或显著降低PARA的性能开销。我们使用SPEC CPU2017工作负载进行的评估证实，discrete -PARA可以实现非常高的防御强度，性能开销远低于原始PARA。

引用次数: 2

MasterMind: Many-Accelerator SoC Architecture for Real-Time Brain-Computer Interfaces MasterMind:用于实时脑机接口的多加速器SoC架构

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00027

Guy Eichler, Luca Piccolboni, Davide Giri, L. Carloni

Hierarchical Wasserstein Alignment (HiWA) is one of the most promising Brain-Computer Interface algorithms. To enable its real-time communication with the brain and meet low-power requirements, we design and prototype a Linux-supporting, RISC-V based SoC that integrates multiple hardware accelerators. We conduct a thorough design-space exploration at the accelerator level and at the SoC level. With FPGA-based experiments, we show that one of our area-efficient SoCs provides 91x performance and 37x energy efficiency gains over software execution on an embedded processor. We further improve our gains (up to 3408x and 497x, respectively) by parallelizing the workload on multiple accelerator instances and by adopting point-to-point accelerator communication, which reduces memory accesses and software-synchronization overheads. The results include comparisons with multi-threaded software implementations of HiWA running on an Intel i7 and ARM A53 as well as a projection analysis showing that an ASIC implementation of our SoC would meet the needs of real-time Brain-Computer Interfaces.

层次Wasserstein对齐(HiWA)是目前最有前途的脑机接口算法之一。为了实现其与大脑的实时通信并满足低功耗要求，我们设计并原型化了一个支持linux的基于RISC-V的SoC，该SoC集成了多个硬件加速器。我们在加速器级和SoC级进行了彻底的设计空间探索。通过基于fpga的实验，我们证明了我们的一种面积高效的soc比嵌入式处理器上的软件执行提供了91倍的性能和37倍的能效提升。通过并行化多个加速器实例上的工作负载和采用点对点加速器通信(这减少了内存访问和软件同步开销)，我们进一步提高了增益(分别高达3408倍和497倍)。结果包括与在Intel i7和ARM A53上运行的HiWA多线程软件实现的比较，以及投影分析，表明我们的SoC的ASIC实现将满足实时脑机接口的需求。

引用次数: 4

CIDAN: Computing in DRAM with Artificial Neurons 基于人工神经元的DRAM计算

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00062

G. Singh, Ankit Wagle, S. Vrudhula, S. Khatri

Numerous applications such as graph processing, cryptography, databases, bioinformatics, etc., involve the repeated evaluation of Boolean functions on large bit vectors. In-memory architectures which perform processing in memory (PIM) are tailored for such applications. This paper describes a different architecture for in-memory computation called CIDAN, that achieves a 3X improvement in performance and a 2X improvement in energy for a representative set of algorithms over the state-of-the-art in-memory architectures. CIDAN uses a new basic processing element called a TLPE, which comprises a threshold logic gate (TLG) (a.k.a artificial neuron or perceptron). The implementation of a TLG within a TLPE is equivalent to a multi-input, edge-triggered flipflop that computes a subset of threshold functions of its inputs. The specific threshold function is selected on each cycle by enabling/disabling a subset of the weights associated with the threshold function, by using logic signals. In addition to the TLG, a TLPE realizes some non-threshold functions by a sequence of TLG evaluations. An equivalent CMOS implementation of a TLPE requires a substantially higher area and power. CIDAN has an array of TLPE(s) that is integrated with a DRAM, to allow fast evaluation of any one of its set of functions on large bit vectors. Results of running several common in-memory applications in graph processing and cryptography are presented.

许多应用，如图形处理、密码学、数据库、生物信息学等，都涉及对大比特向量的布尔函数的重复求值。执行内存处理(PIM)的内存内架构是为此类应用程序量身定制的。本文描述了一种不同的内存计算架构，称为CIDAN，与最先进的内存架构相比，它在性能上提高了3倍，在能量上提高了2倍。CIDAN使用了一种叫做TLPE的新的基本处理元素，它包括一个阈值逻辑门(TLG)(又名人工神经元或感知器)。在TLPE中实现TLG相当于计算其输入的阈值函数子集的多输入、边触发触发器。通过使用逻辑信号，通过启用/禁用与阈值功能相关的权重子集，在每个周期中选择特定的阈值功能。除了TLG之外，TLPE还通过一系列TLG求值来实现一些非阈值函数。等效的TLPE CMOS实现需要高得多的面积和功率。CIDAN有一个与DRAM集成的TLPE阵列，允许在大比特向量上快速评估其功能集中的任何一个。给出了在图形处理和密码学中运行几个常见内存应用程序的结果。

{"title":"CIDAN: Computing in DRAM with Artificial Neurons","authors":"G. Singh, Ankit Wagle, S. Vrudhula, S. Khatri","doi":"10.1109/ICCD53106.2021.00062","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00062","url":null,"abstract":"Numerous applications such as graph processing, cryptography, databases, bioinformatics, etc., involve the repeated evaluation of Boolean functions on large bit vectors. In-memory architectures which perform processing in memory (PIM) are tailored for such applications. This paper describes a different architecture for in-memory computation called CIDAN, that achieves a 3X improvement in performance and a 2X improvement in energy for a representative set of algorithms over the state-of-the-art in-memory architectures. CIDAN uses a new basic processing element called a TLPE, which comprises a threshold logic gate (TLG) (a.k.a artificial neuron or perceptron). The implementation of a TLG within a TLPE is equivalent to a multi-input, edge-triggered flipflop that computes a subset of threshold functions of its inputs. The specific threshold function is selected on each cycle by enabling/disabling a subset of the weights associated with the threshold function, by using logic signals. In addition to the TLG, a TLPE realizes some non-threshold functions by a sequence of TLG evaluations. An equivalent CMOS implementation of a TLPE requires a substantially higher area and power. CIDAN has an array of TLPE(s) that is integrated with a DRAM, to allow fast evaluation of any one of its set of functions on large bit vectors. Results of running several common in-memory applications in graph processing and cryptography are presented.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129276677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors GVSoC:用于基于RISC-V的物联网处理器的高度可配置，快速和准确的全平台模拟器

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00071

Nazareno Bruschi, Germain Haugou, Giuseppe Tagliavini, Francesco Conti, L. Benini, D. Rossi

The last few years have seen the emergence of IoT processors: ultra-low power systems-on-chips (SoCs) combining lightweight and flexible micro-controller units (MCUs), often based on open-ISA RISC-V cores, with application-specific accelerators to maximize performance and energy efficiency. Overall, this heterogeneity level requires complex hardware and a full-fledged software stack to orchestrate the execution and exploit platform features. For this reason, enabling agile design space exploration becomes a crucial asset for this new class of low-power SoCs. In this scenario, high-level simulators play an essential role in breaking the speed and design effort bottlenecks of cycle-accurate simulators and FPGA prototypes, respectively, while preserving functional and timing accuracy. We present GVSoC, a highly configurable and timing-accurate event-driven simulator that combines the efficiency of C++ models with the flexibility of Python configuration scripts. GVSoC is fully open-sourced, with the intent to drive future research in the area of highly parallel and heterogeneous RISC-V based IoT processors, leveraging three foundational features: Python-based modular configuration of the hardware description, easy calibration of platform parameters for accurate performance estimation, and high-speed simulation. Experimental results show that GVSoC enables practical functional and performance analysis and design exploration at the full-platform level (processors, memory, peripherals and IOs) with a speed-up of 2500× with respect to cycle-accurate simulation with errors typically below 10% for performance analysis.

过去几年出现了物联网处理器:超低功耗的片上系统(soc)结合了轻量级和灵活的微控制器单元(mcu)，通常基于开放的isa RISC-V内核，带有特定应用的加速器，以最大限度地提高性能和能源效率。总的来说，这种异构级别需要复杂的硬件和成熟的软件堆栈来编排执行和利用平台特性。出于这个原因，支持灵活的设计空间探索成为这种新型低功耗soc的关键资产。在这种情况下，高级模拟器在打破周期精确模拟器和FPGA原型的速度和设计瓶颈方面发挥着至关重要的作用，同时保持功能和时序精度。我们提出了GVSoC，一个高度可配置和时间精确的事件驱动模拟器，它结合了c++模型的效率和Python配置脚本的灵活性。GVSoC是完全开源的，旨在推动高度并行和异构的基于RISC-V的物联网处理器领域的未来研究，利用三个基本特性:基于python的硬件描述模块化配置，易于校准平台参数以进行准确的性能估计，以及高速仿真。实验结果表明，GVSoC能够在全平台级别(处理器、内存、外设和IOs)上进行实用的功能和性能分析和设计探索，相对于周期精确的仿真，速度提高了2500倍，性能分析的误差通常低于10%。

{"title":"GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors","authors":"Nazareno Bruschi, Germain Haugou, Giuseppe Tagliavini, Francesco Conti, L. Benini, D. Rossi","doi":"10.1109/ICCD53106.2021.00071","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00071","url":null,"abstract":"The last few years have seen the emergence of IoT processors: ultra-low power systems-on-chips (SoCs) combining lightweight and flexible micro-controller units (MCUs), often based on open-ISA RISC-V cores, with application-specific accelerators to maximize performance and energy efficiency. Overall, this heterogeneity level requires complex hardware and a full-fledged software stack to orchestrate the execution and exploit platform features. For this reason, enabling agile design space exploration becomes a crucial asset for this new class of low-power SoCs. In this scenario, high-level simulators play an essential role in breaking the speed and design effort bottlenecks of cycle-accurate simulators and FPGA prototypes, respectively, while preserving functional and timing accuracy. We present GVSoC, a highly configurable and timing-accurate event-driven simulator that combines the efficiency of C++ models with the flexibility of Python configuration scripts. GVSoC is fully open-sourced, with the intent to drive future research in the area of highly parallel and heterogeneous RISC-V based IoT processors, leveraging three foundational features: Python-based modular configuration of the hardware description, easy calibration of platform parameters for accurate performance estimation, and high-speed simulation. Experimental results show that GVSoC enables practical functional and performance analysis and design exploration at the full-platform level (processors, memory, peripherals and IOs) with a speed-up of 2500× with respect to cycle-accurate simulation with errors typically below 10% for performance analysis.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115298591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

WiNN: Wireless Interconnect based Neural Network Accelerator WiNN:无线互联神经网络加速器

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00052

Siqin Liu, S. Karmunchi, Avinash Karanth, S. Laha, S. Kaya

Deep Neural Networks (DNNs) have demonstrated promising performance in accuracy for several applications such as image processing, speech recognition, and autonomous systems and vehicles. Spatial accelerators have been proposed to achieve high parallelism with arrays of processing elements (PE) and energy efficient data movement using traditional Network-on-Chip (NoC) architectures. However, larger DNN models impose high bandwidth and low latency communication demands between PEs, which is a fundamental challenge for metallic NoC architectures. In this paper, we propose WiNN, a wireless and wired interconnected neural network accelerator that employs on-chip wireless links to provide high network bandwidth and single cycle multicast communication. We design separate wireless networks modulated with two different frequency bands one each for the weights and input Highly directional antennas are implemented to avoid noise and interference. We propose multicast-for-wireless (MW) dataflow for our proposed accelerator that efficiently exploits the wireless channels’ multicast capabilities to reduce the communication overheads. Our novel wireless transmitter integrates on-off keying (OOK) modulator with power amplifier that results in significant energy savings. Our simulation results show that WiNN achieves 74% latency reduction and 37.5% energy saving when compared to state-of-art metallic link-based accelerators, 38.1% latency reduction and 19.4% energy saving when compared to prior wireless accelerators for various neural networks (AlexNet, VGG16, and ResNet-50).

深度神经网络(dnn)已经在图像处理、语音识别、自动驾驶系统和车辆等多个应用中展示了良好的准确性。空间加速器已被提出，以实现与处理元素阵列(PE)的高并行性和使用传统的片上网络(NoC)架构的节能数据移动。然而，更大的DNN模型在pe之间施加了高带宽和低延迟的通信需求，这是金属NoC架构的基本挑战。在本文中，我们提出了WiNN，一个无线和有线互连的神经网络加速器，利用片上无线链路提供高网络带宽和单周期多播通信。我们设计了用两个不同频段调制的独立无线网络，每个频段的权重和输入都是高定向天线，以避免噪声和干扰。我们提出了无线多播(MW)数据流的加速器，有效地利用无线信道的多播能力，以减少通信开销。我们的新型无线发射机集成了开关键控(OOK)调制器和功率放大器，从而显著节省能源。我们的仿真结果表明，与最先进的基于金属链路的加速器相比，WiNN实现了74%的延迟减少和37.5%的节能，与先前的各种神经网络(AlexNet, VGG16和ResNet-50)的无线加速器相比，WiNN实现了38.1%的延迟减少和19.4%的节能。

{"title":"WiNN: Wireless Interconnect based Neural Network Accelerator","authors":"Siqin Liu, S. Karmunchi, Avinash Karanth, S. Laha, S. Kaya","doi":"10.1109/ICCD53106.2021.00052","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00052","url":null,"abstract":"Deep Neural Networks (DNNs) have demonstrated promising performance in accuracy for several applications such as image processing, speech recognition, and autonomous systems and vehicles. Spatial accelerators have been proposed to achieve high parallelism with arrays of processing elements (PE) and energy efficient data movement using traditional Network-on-Chip (NoC) architectures. However, larger DNN models impose high bandwidth and low latency communication demands between PEs, which is a fundamental challenge for metallic NoC architectures. In this paper, we propose WiNN, a wireless and wired interconnected neural network accelerator that employs on-chip wireless links to provide high network bandwidth and single cycle multicast communication. We design separate wireless networks modulated with two different frequency bands one each for the weights and input Highly directional antennas are implemented to avoid noise and interference. We propose multicast-for-wireless (MW) dataflow for our proposed accelerator that efficiently exploits the wireless channels’ multicast capabilities to reduce the communication overheads. Our novel wireless transmitter integrates on-off keying (OOK) modulator with power amplifier that results in significant energy savings. Our simulation results show that WiNN achieves 74% latency reduction and 37.5% energy saving when compared to state-of-art metallic link-based accelerators, 38.1% latency reduction and 19.4% energy saving when compared to prior wireless accelerators for various neural networks (AlexNet, VGG16, and ResNet-50).","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"434 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115866279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PSACS: Highly-Parallel Shuffle Accelerator on Computational Storage PSACS:计算存储上的高度并行Shuffle加速器

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/iccd53106.2021.00080

Chen Zou, Hui Zhang, A. Chien, Y. Ki

Shuffle is an indispensable process in distributed online analytical processing systems to enable task-level parallelism exploitation via multiple nodes. As a data-intensive data reorganization process, shuffle implemented on general-purpose CPUs not only incurs data traffic back and forth between the computing and storage resources, but also pollutes the cache hierarchy with almost zero data reuse. As a result, shuffle can easily become the bottleneck of distributed analysis pipelines.Our PSACS approach attacks these bottlenecks with the rising computational storage paradigm. Shuffle is offloaded to the storage-side PSACS accelerator to avoid polluting computing node memory hierarchy and enjoy the latency, bandwidth and energy benefits of near-data computing. Further, the microarchitecture of PSACS exploits data-, subtask-, and task-level parallelism for high performance and a customized scratchpad for fast on-chip random access.PSACS achieves 4.6x—5.7x shuffle throughput at kernel-level and up to 1.3x overall shuffle throughput with only a twentieth of CPU utilization comparing to software baselines. These mount up to 23% end-to-end OLAP query speedup on average.

Shuffle是分布式在线分析处理系统中不可缺少的过程，可以通过多个节点实现任务级并行性。shuffle作为一种数据密集型的数据重组过程，在通用cpu上实现的shuffle不仅会在计算资源和存储资源之间产生来回的数据流量，而且会导致数据几乎为零的重用，从而污染缓存层次结构。因此，shuffle很容易成为分布式分析管道的瓶颈。我们的PSACS方法通过不断发展的计算存储范式来解决这些瓶颈。Shuffle被卸载到存储端PSACS加速器，以避免污染计算节点的内存层次结构，并享受近数据计算的延迟、带宽和能源优势。此外，PSACS的微架构利用数据级、子任务级和任务级并行性来实现高性能，并利用定制的刮擦板来实现快速片上随机访问。PSACS在内核级实现4.6 - 5.7倍的shuffle吞吐量和1.3倍的shuffle吞吐量，而CPU利用率仅为软件基准的二十分之一。端到端OLAP查询平均加速高达23%。

{"title":"PSACS: Highly-Parallel Shuffle Accelerator on Computational Storage","authors":"Chen Zou, Hui Zhang, A. Chien, Y. Ki","doi":"10.1109/iccd53106.2021.00080","DOIUrl":"https://doi.org/10.1109/iccd53106.2021.00080","url":null,"abstract":"Shuffle is an indispensable process in distributed online analytical processing systems to enable task-level parallelism exploitation via multiple nodes. As a data-intensive data reorganization process, shuffle implemented on general-purpose CPUs not only incurs data traffic back and forth between the computing and storage resources, but also pollutes the cache hierarchy with almost zero data reuse. As a result, shuffle can easily become the bottleneck of distributed analysis pipelines.Our PSACS approach attacks these bottlenecks with the rising computational storage paradigm. Shuffle is offloaded to the storage-side PSACS accelerator to avoid polluting computing node memory hierarchy and enjoy the latency, bandwidth and energy benefits of near-data computing. Further, the microarchitecture of PSACS exploits data-, subtask-, and task-level parallelism for high performance and a customized scratchpad for fast on-chip random access.PSACS achieves 4.6x—5.7x shuffle throughput at kernel-level and up to 1.3x overall shuffle throughput with only a twentieth of CPU utilization comparing to software baselines. These mount up to 23% end-to-end OLAP query speedup on average.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123759442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Resonance-Based Power-Efficient Pulse Generator Design with Corresponding Distribution Network 基于谐振的高能效脉冲发生器设计及相应的配电网

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00063

K. Jia, Liang Yang, Jian Wang, B. Lin, Hao Wang, Rui-xin Shi

Pulsed-latches are treated as competing sequential elements to flip-flops, mainly for their low-power and high-performance advantages. In a typical pulsed-latch system, an explicit or implicit pulse generator (PG) is used to generate the necessary clock pulse, contributing a significant amount of power consumption. To address it, a novel resonance-based power-efficient PG circuit called RPG is proposed. A power reduction up to 60% and a more stable performance in variable temperature and voltage environments are shown in 12nm Fin-FET simulations as compared with other PG circuits in typical multi-bit applications. Furthermore, a distribution method of integrating RPG into traditional designs is provided. The evaluation in a test core shows that it achieves up to 21% in clock power reduction, with less clock skew overhead and no device area loss.

脉冲锁存器被视为与人字拖竞争的顺序元件，主要是因为它们具有低功耗和高性能的优点。在典型的脉冲锁存器系统中，使用显式或隐式脉冲发生器(PG)来产生所需的时钟脉冲，这造成了大量的功耗。为了解决这个问题，提出了一种新型的基于共振的节能PG电路，称为RPG。与典型的多比特应用中的其他PG电路相比，在12nm Fin-FET模拟中，功耗降低高达60%，并且在可变温度和电压环境中性能更稳定。此外，还提出了一种将RPG集成到传统设计中的分配方法。在测试核心中的评估表明，它可以实现高达21%的时钟功耗降低，时钟倾斜开销更小，没有器件面积损失。

引用次数: 0

A High-performance Post-deduplication Delta Compression Scheme for Packed Datasets 打包数据集的高性能重复数据删除后增量压缩方案

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00078

Yucheng Zhang, Hong Jiang, Mengtian Shi, Chunzhi Wang, Nan Jiang, Xinyun Wu

Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.

为了降低存储成本，重复数据删除已经成为大多数存储备份系统的标配特性。在实际的基于重复数据删除的备份产品中，在重复数据删除之前，小文件被分组到较大的打包文件中。对于每个文件，分组需要一个备份产品在文件内容之前插入一个元数据块。由于这些元数据块的内容随每次备份而变化，因此来自相同或高度相似的小文件的打包文件的不同备份流将包含传统重复数据删除认为主要是唯一的块。也就是说，除了元数据块之外，不同备份中这些唯一块中的大多数内容都是相同的。Delta压缩能够消除这些冗余，但不能应用于备份存储，因为检索基本块所需的额外I/ o显著降低了备份吞吐量。如果备份数据集中有许多分组的小文件，则可能会重复重写一些重复的块，称为持久性碎片块(pfc)。我们观察到pfc通常被包含元数据块的大量唯一块所包围。在本文中，我们提出了一个启发pfc的增量压缩方案，以有效地对相同pfc周围的唯一块执行增量压缩。在重复数据删除过程中，将访问包含考虑存储的块的先前副本的容器，以预取元数据，以加快重复项的检测。我们的方案背后的主要思想是识别持有pfc的容器，并在这些容器中预取块，方法是在重复数据删除期间访问元数据时，在读取数据时预取元数据。然后从预取的块中检测用于增量压缩的基本块，从而消除了用于检索基本块的额外I/ o。实验结果表明，pfc启发的增量压缩在重复数据删除的基础上实现了约2倍的额外数据减少，恢复速度提高了8.6% ~ 49.3%，同时适度牺牲了0.5% ~ 11.9%的备份吞吐量。

{"title":"A High-performance Post-deduplication Delta Compression Scheme for Packed Datasets","authors":"Yucheng Zhang, Hong Jiang, Mengtian Shi, Chunzhi Wang, Nan Jiang, Xinyun Wu","doi":"10.1109/ICCD53106.2021.00078","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00078","url":null,"abstract":"Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"15 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127560418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2