Deep neural networks (DNNs) have achieved remarkable success in many fields. Large-scale DNNs also bring storage challenges when storing snapshots for preventing clusters’ frequent failures, and bring massive internet traffic when dispatching or updating DNNs for resource-constrained devices (e.g., IoT devices, mobile phones). Several approaches are aiming to compress DNNs. The Recent work, Delta-DNN, notices high similarity existed in DNNs and thus calculates differences between them for improving the compression ratio.However, we observe that Delta-DNN, applying traditional global lossy quantization technique in calculating differences of two neighboring versions of the DNNs, can not fully exploit the data similarity between them for delta compression. This is because the parameters’ value ranges (and also the delta data in Delta-DNN) are varying among layers in DNNs, which inspires us to propose a local-sensitive quantization scheme: the quantizers are adaptive to parameters’ local value ranges in layers. Moreover, instead of quantizing differences of DNNs in Delta-DNN, our approach quantizes DNNs before calculating differences to make the differences more compressible. Besides, we also propose an error feedback mechanism to reduce DNNs’ accuracy loss caused by the lossy quantization.Therefore, we design a novel quantization-based delta compressor called QD-Compressor, which calculates the lossy differences between epochs of DNNs for saving storage cost of backing up DNNs’ snapshots and internet traffic of dispatching DNNs for resource-constrained devices. Experiments on several popular DNNs and datasets show that QD-Compressor obtains a compression ratio of 2.4× ~ 31.5× higher than the state-of-the-art approaches while well maintaining the model’s test accuracy.
{"title":"QD-Compressor: a Quantization-based Delta Compression Framework for Deep Neural Networks","authors":"Shuyu Zhang, Donglei Wu, Haoyu Jin, Xiangyu Zou, Wen Xia, Xiaojia Huang","doi":"10.1109/ICCD53106.2021.00088","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00088","url":null,"abstract":"Deep neural networks (DNNs) have achieved remarkable success in many fields. Large-scale DNNs also bring storage challenges when storing snapshots for preventing clusters’ frequent failures, and bring massive internet traffic when dispatching or updating DNNs for resource-constrained devices (e.g., IoT devices, mobile phones). Several approaches are aiming to compress DNNs. The Recent work, Delta-DNN, notices high similarity existed in DNNs and thus calculates differences between them for improving the compression ratio.However, we observe that Delta-DNN, applying traditional global lossy quantization technique in calculating differences of two neighboring versions of the DNNs, can not fully exploit the data similarity between them for delta compression. This is because the parameters’ value ranges (and also the delta data in Delta-DNN) are varying among layers in DNNs, which inspires us to propose a local-sensitive quantization scheme: the quantizers are adaptive to parameters’ local value ranges in layers. Moreover, instead of quantizing differences of DNNs in Delta-DNN, our approach quantizes DNNs before calculating differences to make the differences more compressible. Besides, we also propose an error feedback mechanism to reduce DNNs’ accuracy loss caused by the lossy quantization.Therefore, we design a novel quantization-based delta compressor called QD-Compressor, which calculates the lossy differences between epochs of DNNs for saving storage cost of backing up DNNs’ snapshots and internet traffic of dispatching DNNs for resource-constrained devices. Experiments on several popular DNNs and datasets show that QD-Compressor obtains a compression ratio of 2.4× ~ 31.5× higher than the state-of-the-art approaches while well maintaining the model’s test accuracy.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115077824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00090
Yanan Guo, Liang Liu, Yueqiang Cheng, Youtao Zhang, Jun Yang
Bit-flip attack (BFA) has become one of the most serious threats to Deep Neural Network (DNN) security. By utilizing Rowhammer to flip the bits of DNN weights stored in memory, the attacker can turn a functional DNN into a random output generator. In this work, we propose ModelShield, a defense mechanism against BFA, based on protecting the integrity of weights using hash verification. ModelShield performs real-time integrity verification on DNN weights. Since this can slow down a DNN inference by up to 7×, we further propose two optimizations for ModelShield. We implement ModelShield as a lightweight software extension that can be easily installed into popular DNN frameworks. We test both the security and performance of ModelShield, and the results show that it can effectively defend BFA with less than 2% performance overhead.
{"title":"ModelShield: A Generic and Portable Framework Extension for Defending Bit-Flip based Adversarial Weight Attacks","authors":"Yanan Guo, Liang Liu, Yueqiang Cheng, Youtao Zhang, Jun Yang","doi":"10.1109/ICCD53106.2021.00090","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00090","url":null,"abstract":"Bit-flip attack (BFA) has become one of the most serious threats to Deep Neural Network (DNN) security. By utilizing Rowhammer to flip the bits of DNN weights stored in memory, the attacker can turn a functional DNN into a random output generator. In this work, we propose ModelShield, a defense mechanism against BFA, based on protecting the integrity of weights using hash verification. ModelShield performs real-time integrity verification on DNN weights. Since this can slow down a DNN inference by up to 7×, we further propose two optimizations for ModelShield. We implement ModelShield as a lightweight software extension that can be easily installed into popular DNN frameworks. We test both the security and performance of ModelShield, and the results show that it can effectively defend BFA with less than 2% performance overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116103847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00074
Yichen Wang, Yang Liu, Peiyun Wu, Zhao Zhang
DRAM rowhammer attack is a severe security concern on computer systems using DRAM memories. A number of defense mechanisms have been proposed, but all with short-coming in either performance overhead, storage requirement, or defense strength. In this paper, we present a novel design called discreet-PARA. It creatively integrates two new components, namely Disturbance Bin Counting (DBC) and PARA-cache, into the existing PARA (Probabilistic Adjacent Row Activation) defense. The two components only require small counter and cache storages but can eliminate or significantly reduce the performance overhead of PARA. Our evaluation using SPEC CPU2017 workloads confirms that discreet-PARA can achieve very high defense strength with a performance overhead much lower than the original PARA.
{"title":"Discreet-PARA: Rowhammer Defense with Low Cost and High Efficiency","authors":"Yichen Wang, Yang Liu, Peiyun Wu, Zhao Zhang","doi":"10.1109/ICCD53106.2021.00074","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00074","url":null,"abstract":"DRAM rowhammer attack is a severe security concern on computer systems using DRAM memories. A number of defense mechanisms have been proposed, but all with short-coming in either performance overhead, storage requirement, or defense strength. In this paper, we present a novel design called discreet-PARA. It creatively integrates two new components, namely Disturbance Bin Counting (DBC) and PARA-cache, into the existing PARA (Probabilistic Adjacent Row Activation) defense. The two components only require small counter and cache storages but can eliminate or significantly reduce the performance overhead of PARA. Our evaluation using SPEC CPU2017 workloads confirms that discreet-PARA can achieve very high defense strength with a performance overhead much lower than the original PARA.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"24 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132671952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00027
Guy Eichler, Luca Piccolboni, Davide Giri, L. Carloni
Hierarchical Wasserstein Alignment (HiWA) is one of the most promising Brain-Computer Interface algorithms. To enable its real-time communication with the brain and meet low-power requirements, we design and prototype a Linux-supporting, RISC-V based SoC that integrates multiple hardware accelerators. We conduct a thorough design-space exploration at the accelerator level and at the SoC level. With FPGA-based experiments, we show that one of our area-efficient SoCs provides 91x performance and 37x energy efficiency gains over software execution on an embedded processor. We further improve our gains (up to 3408x and 497x, respectively) by parallelizing the workload on multiple accelerator instances and by adopting point-to-point accelerator communication, which reduces memory accesses and software-synchronization overheads. The results include comparisons with multi-threaded software implementations of HiWA running on an Intel i7 and ARM A53 as well as a projection analysis showing that an ASIC implementation of our SoC would meet the needs of real-time Brain-Computer Interfaces.
{"title":"MasterMind: Many-Accelerator SoC Architecture for Real-Time Brain-Computer Interfaces","authors":"Guy Eichler, Luca Piccolboni, Davide Giri, L. Carloni","doi":"10.1109/ICCD53106.2021.00027","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00027","url":null,"abstract":"Hierarchical Wasserstein Alignment (HiWA) is one of the most promising Brain-Computer Interface algorithms. To enable its real-time communication with the brain and meet low-power requirements, we design and prototype a Linux-supporting, RISC-V based SoC that integrates multiple hardware accelerators. We conduct a thorough design-space exploration at the accelerator level and at the SoC level. With FPGA-based experiments, we show that one of our area-efficient SoCs provides 91x performance and 37x energy efficiency gains over software execution on an embedded processor. We further improve our gains (up to 3408x and 497x, respectively) by parallelizing the workload on multiple accelerator instances and by adopting point-to-point accelerator communication, which reduces memory accesses and software-synchronization overheads. The results include comparisons with multi-threaded software implementations of HiWA running on an Intel i7 and ARM A53 as well as a projection analysis showing that an ASIC implementation of our SoC would meet the needs of real-time Brain-Computer Interfaces.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132891279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00062
G. Singh, Ankit Wagle, S. Vrudhula, S. Khatri
Numerous applications such as graph processing, cryptography, databases, bioinformatics, etc., involve the repeated evaluation of Boolean functions on large bit vectors. In-memory architectures which perform processing in memory (PIM) are tailored for such applications. This paper describes a different architecture for in-memory computation called CIDAN, that achieves a 3X improvement in performance and a 2X improvement in energy for a representative set of algorithms over the state-of-the-art in-memory architectures. CIDAN uses a new basic processing element called a TLPE, which comprises a threshold logic gate (TLG) (a.k.a artificial neuron or perceptron). The implementation of a TLG within a TLPE is equivalent to a multi-input, edge-triggered flipflop that computes a subset of threshold functions of its inputs. The specific threshold function is selected on each cycle by enabling/disabling a subset of the weights associated with the threshold function, by using logic signals. In addition to the TLG, a TLPE realizes some non-threshold functions by a sequence of TLG evaluations. An equivalent CMOS implementation of a TLPE requires a substantially higher area and power. CIDAN has an array of TLPE(s) that is integrated with a DRAM, to allow fast evaluation of any one of its set of functions on large bit vectors. Results of running several common in-memory applications in graph processing and cryptography are presented.
{"title":"CIDAN: Computing in DRAM with Artificial Neurons","authors":"G. Singh, Ankit Wagle, S. Vrudhula, S. Khatri","doi":"10.1109/ICCD53106.2021.00062","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00062","url":null,"abstract":"Numerous applications such as graph processing, cryptography, databases, bioinformatics, etc., involve the repeated evaluation of Boolean functions on large bit vectors. In-memory architectures which perform processing in memory (PIM) are tailored for such applications. This paper describes a different architecture for in-memory computation called CIDAN, that achieves a 3X improvement in performance and a 2X improvement in energy for a representative set of algorithms over the state-of-the-art in-memory architectures. CIDAN uses a new basic processing element called a TLPE, which comprises a threshold logic gate (TLG) (a.k.a artificial neuron or perceptron). The implementation of a TLG within a TLPE is equivalent to a multi-input, edge-triggered flipflop that computes a subset of threshold functions of its inputs. The specific threshold function is selected on each cycle by enabling/disabling a subset of the weights associated with the threshold function, by using logic signals. In addition to the TLG, a TLPE realizes some non-threshold functions by a sequence of TLG evaluations. An equivalent CMOS implementation of a TLPE requires a substantially higher area and power. CIDAN has an array of TLPE(s) that is integrated with a DRAM, to allow fast evaluation of any one of its set of functions on large bit vectors. Results of running several common in-memory applications in graph processing and cryptography are presented.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129276677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00071
Nazareno Bruschi, Germain Haugou, Giuseppe Tagliavini, Francesco Conti, L. Benini, D. Rossi
The last few years have seen the emergence of IoT processors: ultra-low power systems-on-chips (SoCs) combining lightweight and flexible micro-controller units (MCUs), often based on open-ISA RISC-V cores, with application-specific accelerators to maximize performance and energy efficiency. Overall, this heterogeneity level requires complex hardware and a full-fledged software stack to orchestrate the execution and exploit platform features. For this reason, enabling agile design space exploration becomes a crucial asset for this new class of low-power SoCs. In this scenario, high-level simulators play an essential role in breaking the speed and design effort bottlenecks of cycle-accurate simulators and FPGA prototypes, respectively, while preserving functional and timing accuracy. We present GVSoC, a highly configurable and timing-accurate event-driven simulator that combines the efficiency of C++ models with the flexibility of Python configuration scripts. GVSoC is fully open-sourced, with the intent to drive future research in the area of highly parallel and heterogeneous RISC-V based IoT processors, leveraging three foundational features: Python-based modular configuration of the hardware description, easy calibration of platform parameters for accurate performance estimation, and high-speed simulation. Experimental results show that GVSoC enables practical functional and performance analysis and design exploration at the full-platform level (processors, memory, peripherals and IOs) with a speed-up of 2500× with respect to cycle-accurate simulation with errors typically below 10% for performance analysis.
{"title":"GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors","authors":"Nazareno Bruschi, Germain Haugou, Giuseppe Tagliavini, Francesco Conti, L. Benini, D. Rossi","doi":"10.1109/ICCD53106.2021.00071","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00071","url":null,"abstract":"The last few years have seen the emergence of IoT processors: ultra-low power systems-on-chips (SoCs) combining lightweight and flexible micro-controller units (MCUs), often based on open-ISA RISC-V cores, with application-specific accelerators to maximize performance and energy efficiency. Overall, this heterogeneity level requires complex hardware and a full-fledged software stack to orchestrate the execution and exploit platform features. For this reason, enabling agile design space exploration becomes a crucial asset for this new class of low-power SoCs. In this scenario, high-level simulators play an essential role in breaking the speed and design effort bottlenecks of cycle-accurate simulators and FPGA prototypes, respectively, while preserving functional and timing accuracy. We present GVSoC, a highly configurable and timing-accurate event-driven simulator that combines the efficiency of C++ models with the flexibility of Python configuration scripts. GVSoC is fully open-sourced, with the intent to drive future research in the area of highly parallel and heterogeneous RISC-V based IoT processors, leveraging three foundational features: Python-based modular configuration of the hardware description, easy calibration of platform parameters for accurate performance estimation, and high-speed simulation. Experimental results show that GVSoC enables practical functional and performance analysis and design exploration at the full-platform level (processors, memory, peripherals and IOs) with a speed-up of 2500× with respect to cycle-accurate simulation with errors typically below 10% for performance analysis.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115298591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00052
Siqin Liu, S. Karmunchi, Avinash Karanth, S. Laha, S. Kaya
Deep Neural Networks (DNNs) have demonstrated promising performance in accuracy for several applications such as image processing, speech recognition, and autonomous systems and vehicles. Spatial accelerators have been proposed to achieve high parallelism with arrays of processing elements (PE) and energy efficient data movement using traditional Network-on-Chip (NoC) architectures. However, larger DNN models impose high bandwidth and low latency communication demands between PEs, which is a fundamental challenge for metallic NoC architectures. In this paper, we propose WiNN, a wireless and wired interconnected neural network accelerator that employs on-chip wireless links to provide high network bandwidth and single cycle multicast communication. We design separate wireless networks modulated with two different frequency bands one each for the weights and input Highly directional antennas are implemented to avoid noise and interference. We propose multicast-for-wireless (MW) dataflow for our proposed accelerator that efficiently exploits the wireless channels’ multicast capabilities to reduce the communication overheads. Our novel wireless transmitter integrates on-off keying (OOK) modulator with power amplifier that results in significant energy savings. Our simulation results show that WiNN achieves 74% latency reduction and 37.5% energy saving when compared to state-of-art metallic link-based accelerators, 38.1% latency reduction and 19.4% energy saving when compared to prior wireless accelerators for various neural networks (AlexNet, VGG16, and ResNet-50).
{"title":"WiNN: Wireless Interconnect based Neural Network Accelerator","authors":"Siqin Liu, S. Karmunchi, Avinash Karanth, S. Laha, S. Kaya","doi":"10.1109/ICCD53106.2021.00052","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00052","url":null,"abstract":"Deep Neural Networks (DNNs) have demonstrated promising performance in accuracy for several applications such as image processing, speech recognition, and autonomous systems and vehicles. Spatial accelerators have been proposed to achieve high parallelism with arrays of processing elements (PE) and energy efficient data movement using traditional Network-on-Chip (NoC) architectures. However, larger DNN models impose high bandwidth and low latency communication demands between PEs, which is a fundamental challenge for metallic NoC architectures. In this paper, we propose WiNN, a wireless and wired interconnected neural network accelerator that employs on-chip wireless links to provide high network bandwidth and single cycle multicast communication. We design separate wireless networks modulated with two different frequency bands one each for the weights and input Highly directional antennas are implemented to avoid noise and interference. We propose multicast-for-wireless (MW) dataflow for our proposed accelerator that efficiently exploits the wireless channels’ multicast capabilities to reduce the communication overheads. Our novel wireless transmitter integrates on-off keying (OOK) modulator with power amplifier that results in significant energy savings. Our simulation results show that WiNN achieves 74% latency reduction and 37.5% energy saving when compared to state-of-art metallic link-based accelerators, 38.1% latency reduction and 19.4% energy saving when compared to prior wireless accelerators for various neural networks (AlexNet, VGG16, and ResNet-50).","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"434 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115866279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/iccd53106.2021.00080
Chen Zou, Hui Zhang, A. Chien, Y. Ki
Shuffle is an indispensable process in distributed online analytical processing systems to enable task-level parallelism exploitation via multiple nodes. As a data-intensive data reorganization process, shuffle implemented on general-purpose CPUs not only incurs data traffic back and forth between the computing and storage resources, but also pollutes the cache hierarchy with almost zero data reuse. As a result, shuffle can easily become the bottleneck of distributed analysis pipelines.Our PSACS approach attacks these bottlenecks with the rising computational storage paradigm. Shuffle is offloaded to the storage-side PSACS accelerator to avoid polluting computing node memory hierarchy and enjoy the latency, bandwidth and energy benefits of near-data computing. Further, the microarchitecture of PSACS exploits data-, subtask-, and task-level parallelism for high performance and a customized scratchpad for fast on-chip random access.PSACS achieves 4.6x—5.7x shuffle throughput at kernel-level and up to 1.3x overall shuffle throughput with only a twentieth of CPU utilization comparing to software baselines. These mount up to 23% end-to-end OLAP query speedup on average.
{"title":"PSACS: Highly-Parallel Shuffle Accelerator on Computational Storage","authors":"Chen Zou, Hui Zhang, A. Chien, Y. Ki","doi":"10.1109/iccd53106.2021.00080","DOIUrl":"https://doi.org/10.1109/iccd53106.2021.00080","url":null,"abstract":"Shuffle is an indispensable process in distributed online analytical processing systems to enable task-level parallelism exploitation via multiple nodes. As a data-intensive data reorganization process, shuffle implemented on general-purpose CPUs not only incurs data traffic back and forth between the computing and storage resources, but also pollutes the cache hierarchy with almost zero data reuse. As a result, shuffle can easily become the bottleneck of distributed analysis pipelines.Our PSACS approach attacks these bottlenecks with the rising computational storage paradigm. Shuffle is offloaded to the storage-side PSACS accelerator to avoid polluting computing node memory hierarchy and enjoy the latency, bandwidth and energy benefits of near-data computing. Further, the microarchitecture of PSACS exploits data-, subtask-, and task-level parallelism for high performance and a customized scratchpad for fast on-chip random access.PSACS achieves 4.6x—5.7x shuffle throughput at kernel-level and up to 1.3x overall shuffle throughput with only a twentieth of CPU utilization comparing to software baselines. These mount up to 23% end-to-end OLAP query speedup on average.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123759442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00063
K. Jia, Liang Yang, Jian Wang, B. Lin, Hao Wang, Rui-xin Shi
Pulsed-latches are treated as competing sequential elements to flip-flops, mainly for their low-power and high-performance advantages. In a typical pulsed-latch system, an explicit or implicit pulse generator (PG) is used to generate the necessary clock pulse, contributing a significant amount of power consumption. To address it, a novel resonance-based power-efficient PG circuit called RPG is proposed. A power reduction up to 60% and a more stable performance in variable temperature and voltage environments are shown in 12nm Fin-FET simulations as compared with other PG circuits in typical multi-bit applications. Furthermore, a distribution method of integrating RPG into traditional designs is provided. The evaluation in a test core shows that it achieves up to 21% in clock power reduction, with less clock skew overhead and no device area loss.
{"title":"Resonance-Based Power-Efficient Pulse Generator Design with Corresponding Distribution Network","authors":"K. Jia, Liang Yang, Jian Wang, B. Lin, Hao Wang, Rui-xin Shi","doi":"10.1109/ICCD53106.2021.00063","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00063","url":null,"abstract":"Pulsed-latches are treated as competing sequential elements to flip-flops, mainly for their low-power and high-performance advantages. In a typical pulsed-latch system, an explicit or implicit pulse generator (PG) is used to generate the necessary clock pulse, contributing a significant amount of power consumption. To address it, a novel resonance-based power-efficient PG circuit called RPG is proposed. A power reduction up to 60% and a more stable performance in variable temperature and voltage environments are shown in 12nm Fin-FET simulations as compared with other PG circuits in typical multi-bit applications. Furthermore, a distribution method of integrating RPG into traditional designs is provided. The evaluation in a test core shows that it achieves up to 21% in clock power reduction, with less clock skew overhead and no device area loss.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123782179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00078
Yucheng Zhang, Hong Jiang, Mengtian Shi, Chunzhi Wang, Nan Jiang, Xinyun Wu
Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.
{"title":"A High-performance Post-deduplication Delta Compression Scheme for Packed Datasets","authors":"Yucheng Zhang, Hong Jiang, Mengtian Shi, Chunzhi Wang, Nan Jiang, Xinyun Wu","doi":"10.1109/ICCD53106.2021.00078","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00078","url":null,"abstract":"Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"15 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127560418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}