2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)最新文献_第2页

AntiDote: Attention-based Dynamic Optimization for Neural Network Runtime Efficiency 解毒剂:神经网络运行效率的基于注意力的动态优化

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116416

Fuxun Yu, Chenchen Liu, Di Wang, Yanzhi Wang, Xiang Chen

Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components’ static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4%∼54.5% FLOPs reduction with negligible accuracy drop on various of test networks.

卷积神经网络(Convolutional Neural Networks, cnn)以相当大的计算负荷为代价取得了很好的认知性能。为了减轻计算量，开发了许多优化工作，通过识别和去除无关紧要的模型组件来减少模型冗余，例如权值稀疏性和滤波器剪枝。然而，这些工作仅用内部参数信息评估模型组件的静态意义，而忽略了它们与外部输入的动态交互作用。随着每输入特征的激活，模型成分的显著性会发生动态变化，因此静态方法只能获得次优结果。因此，我们在这项工作中提出了一个动态CNN优化框架。基于神经网络注意力机制，提出了一个综合的动态优化框架，包括(1)测试阶段的通道和列特征映射修剪，以及(2)训练阶段的目标dropout优化。这种动态优化框架具有以下几个优点:(1)首先，它可以在考虑模型-输入交互的情况下，准确地识别和积极地去除每个输入的特征冗余;(2)同时，由于具有多维灵活性，可以最大限度地去除各个维度的特征映射冗余;(3)训练-测试协同优化有利于动态剪枝，即使在特征剪枝率很高的情况下也能保持模型的准确性。大量的实验表明，我们的方法可以在各种测试网络上降低37.4% ~ 54.5%的FLOPs，而精度下降可以忽略不计。

{"title":"AntiDote: Attention-based Dynamic Optimization for Neural Network Runtime Efficiency","authors":"Fuxun Yu, Chenchen Liu, Di Wang, Yanzhi Wang, Xiang Chen","doi":"10.23919/DATE48585.2020.9116416","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116416","url":null,"abstract":"Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components’ static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4%∼54.5% FLOPs reduction with negligible accuracy drop on various of test networks.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115553216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

GhostBusters: Mitigating Spectre Attacks on a DBT-Based Processor 捉鬼敢死队:减轻基于dbt处理器上的幽灵攻击

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116402

Simon Rokicki

Unveiled early 2018, the Spectre vulnerability affects most of the modern high-performance processors. Spectre variants exploit the speculative execution mechanisms and a cache side-channel attack to leak secret data. As of today, the main countermeasures consist of turning off the speculation, which drastically reduces the processor performance. In this work, we focus on a different kind of micro-architecture: the DBT based processors, such as Transmeta Crusoe [1], NVidia Denver [2], or Hybrid-DBT [3]. Instead of using complex outof-order (OoO) mechanisms, those cores combines a software Dynamic Binary Translation mechanism (DBT) and a parallel in-order architecture, typically a VLIW core. The DBT is in charge of translating and optimizing the binaries before their execution. Studies show that DBT based processors can reach the performance level of OoO cores for regular enough applications. In this paper, we demonstrate that, even if those processors do not use OoO execution, they are still vulnerable to Spectre variants, because of the DBT optimizations. However, we also demonstrate that those systems can easily be patched, as the DBT is done in software and has fine-grained control over the optimization process.

“幽灵”漏洞于2018年初公布，影响了大多数现代高性能处理器。Spectre变体利用推测执行机制和缓存侧通道攻击来泄露机密数据。到目前为止，主要的对策包括关闭猜测，这大大降低了处理器的性能。在这项工作中，我们专注于一种不同的微架构:基于DBT的处理器，如Transmeta Crusoe[1]、NVidia Denver[2]或Hybrid-DBT[3]。这些核心没有使用复杂的无序机制，而是结合了软件动态二进制转换机制(DBT)和并行有序架构(通常是VLIW核心)。DBT负责在二进制文件执行之前对其进行翻译和优化。研究表明，对于常规的应用程序，基于DBT的处理器可以达到OoO核的性能水平。在本文中，我们证明，即使这些处理器不使用OoO执行，由于DBT优化，它们仍然容易受到Spectre变体的攻击。然而，我们也证明了这些系统可以很容易地修补，因为DBT是在软件中完成的，并且对优化过程有细粒度的控制。

{"title":"GhostBusters: Mitigating Spectre Attacks on a DBT-Based Processor","authors":"Simon Rokicki","doi":"10.23919/DATE48585.2020.9116402","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116402","url":null,"abstract":"Unveiled early 2018, the Spectre vulnerability affects most of the modern high-performance processors. Spectre variants exploit the speculative execution mechanisms and a cache side-channel attack to leak secret data. As of today, the main countermeasures consist of turning off the speculation, which drastically reduces the processor performance. In this work, we focus on a different kind of micro-architecture: the DBT based processors, such as Transmeta Crusoe [1], NVidia Denver [2], or Hybrid-DBT [3]. Instead of using complex outof-order (OoO) mechanisms, those cores combines a software Dynamic Binary Translation mechanism (DBT) and a parallel in-order architecture, typically a VLIW core. The DBT is in charge of translating and optimizing the binaries before their execution. Studies show that DBT based processors can reach the performance level of OoO cores for regular enough applications. In this paper, we demonstrate that, even if those processors do not use OoO execution, they are still vulnerable to Spectre variants, because of the DBT optimizations. However, we also demonstrate that those systems can easily be patched, as the DBT is done in software and has fine-grained control over the optimization process.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116515455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Modeling and Verifying Uncertainty-Aware Timing Behaviors using Parametric Logical Time Constraint 基于参数化逻辑时间约束的不确定性感知时序行为建模与验证

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116344

Fei Gao, F. Mallet, Min Zhang, Mingsong Chen

The Clock Constraint Specification Language (CCSL) is a logical time based modeling language to formalize timing behaviors of real-time and embedded systems. However, it cannot capture timing behaviors that contain uncertainties, e.g., uncertainty in execution time and period. This limits the application of the language to real-world systems, as uncertainty often exists in practice due to both internal and external factors. To capture uncertainties in timing behaviors, in this paper we extend CCSL by introducing parameters into constraints. We then propose an approach to transform parametric CCSL constraints into SMT formulas for efficient verification. We apply our approach to an industrial case which is proposed as the FMTV (Formal Methods for Timing Verification) Challenge in 2015, which shows that timing behaviors with uncertainties can be effectively modeled and verified using the parametric CCSL.

时钟约束规范语言(CCSL)是一种基于逻辑时间的建模语言，用于形式化实时和嵌入式系统的时序行为。然而，它不能捕获包含不确定性的计时行为，例如，执行时间和周期的不确定性。这限制了该语言在现实世界系统中的应用，因为由于内部和外部因素，不确定性在实践中经常存在。为了捕获时序行为中的不确定性，本文通过在约束中引入参数来扩展CCSL。然后，我们提出了一种将参数CCSL约束转换为SMT公式的方法，以进行有效的验证。我们将我们的方法应用于2015年提出的FMTV(时序验证的正式方法)挑战的工业案例，该案例表明，使用参数CCSL可以有效地建模和验证具有不确定性的时序行为。

引用次数: 2

On the Automatic Exploration of Weight Sharing for Deep Neural Network Compression 深度神经网络压缩权值共享的自动探索

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116350

Etienne Dupuis, D. Novo, Ian O’Connor, A. Bosio

Deep neural networks demonstrate impressive levels of performance, particularly in computer vision and speech recognition. However, the computational workload and associated storage inhibit their potential in resource-limited embedded systems. The approximate computing paradigm has been widely explored in the literature. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. There are a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, etc.). To the best of our knowledge, no automated approach exists to explore, select and generate the best approximate versions of a given convolutional neural network (CNN) according to the design objectives. The goal of this work in progress is to demonstrate that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that our method can obtain a 4x compression rate in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNNs) without re-training and without any accuracy loss.

深度神经网络表现出令人印象深刻的性能水平，特别是在计算机视觉和语音识别方面。然而，计算工作量和相关存储限制了它们在资源有限的嵌入式系统中的潜力。近似计算范式在文献中得到了广泛的探讨。它通过放松对完全精确操作的需求来提高性能和能源效率。有大量的实现选项使用非常不同的逼近策略(如修剪、量化、低秩分解、知识蒸馏等)。据我们所知，没有一种自动化的方法可以根据设计目标来探索、选择和生成给定卷积神经网络(CNN)的最佳近似版本。这项正在进行的工作的目标是证明设计空间探索阶段可以在没有明显精度损失的情况下实现显著的网络压缩。我们通过一个基于权值共享的例子来证明这一点，并表明我们的方法可以在int-16版本的LeNet-5(5层1720 kbit cnn)中获得4倍的压缩率，而无需重新训练，也没有任何准确性损失。

{"title":"On the Automatic Exploration of Weight Sharing for Deep Neural Network Compression","authors":"Etienne Dupuis, D. Novo, Ian O’Connor, A. Bosio","doi":"10.23919/DATE48585.2020.9116350","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116350","url":null,"abstract":"Deep neural networks demonstrate impressive levels of performance, particularly in computer vision and speech recognition. However, the computational workload and associated storage inhibit their potential in resource-limited embedded systems. The approximate computing paradigm has been widely explored in the literature. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. There are a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, etc.). To the best of our knowledge, no automated approach exists to explore, select and generate the best approximate versions of a given convolutional neural network (CNN) according to the design objectives. The goal of this work in progress is to demonstrate that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that our method can obtain a 4x compression rate in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNNs) without re-training and without any accuracy loss.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122854114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient and Robust High-Level Synthesis Design Space Exploration through offline Micro-kernels Pre-characterization 基于离线微核预表征的高效鲁棒高级综合设计空间探索

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116309

Zi Wang, Jianqi Chen, Benjamin Carrión Schäfer

This work proposes a method to accelerate the process of High-Level Synthesis (HLS) Design Space Exploration (DSE) by pre-characterizing micro-kernels offline and creating predictive models of these. HLS allows to generate different types of micro-architectures from the same untimed behavioral description. This is typically done by setting different combinations of synthesis options in the form or synthesis directives specified as pragmas in the code. This allows, e.g. to control how loops should be synthesized, arrays and functions. Unique combinations of these pragmas leads to micro-architectures with a unique area vs. performance/power trade-offs. The main problem is that the search space grows exponentially with the number of explorable operations. Thus, the main goal of efficient HLS DSE is to find the synthesis directives’ combinations that lead to the Pareto-optimal designs quickly. Our proposed method is based on the pre-characterization of micro-kernels offline, creating predictive models for each of the kernels, and using the results to explore a new unseen behavioral description using compositional methods. In addition, we make use of perceptual hashing to match new unseen micro-kernels with the pre-characterized micro-kernels in order to further speed up the search process. Experimental results show that our proposed method is orders of magnitude faster than traditional methods.

本研究提出了一种通过离线预表征微核并创建预测模型来加速高级综合(HLS)设计空间探索(DSE)过程的方法。HLS允许从相同的非定时行为描述生成不同类型的微架构。这通常是通过在表单中设置合成选项的不同组合或在代码中作为pragmas指定的合成指令来完成的。这允许，例如，控制循环应该如何合成，数组和函数。这些实用程序的独特组合导致了具有独特区域与性能/功耗权衡的微体系结构。主要问题是搜索空间随着可探索操作的数量呈指数增长。因此，高效HLS DSE的主要目标是找到快速实现帕累托最优设计的合成指令组合。我们提出的方法是基于离线对微核的预表征，为每个核创建预测模型，并利用结果利用组合方法探索一种新的看不见的行为描述。此外，我们利用感知哈希将新的未见微核与预表征的微核进行匹配，以进一步加快搜索过程。实验结果表明，该方法比传统方法快了几个数量级。

{"title":"Efficient and Robust High-Level Synthesis Design Space Exploration through offline Micro-kernels Pre-characterization","authors":"Zi Wang, Jianqi Chen, Benjamin Carrión Schäfer","doi":"10.23919/DATE48585.2020.9116309","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116309","url":null,"abstract":"This work proposes a method to accelerate the process of High-Level Synthesis (HLS) Design Space Exploration (DSE) by pre-characterizing micro-kernels offline and creating predictive models of these. HLS allows to generate different types of micro-architectures from the same untimed behavioral description. This is typically done by setting different combinations of synthesis options in the form or synthesis directives specified as pragmas in the code. This allows, e.g. to control how loops should be synthesized, arrays and functions. Unique combinations of these pragmas leads to micro-architectures with a unique area vs. performance/power trade-offs. The main problem is that the search space grows exponentially with the number of explorable operations. Thus, the main goal of efficient HLS DSE is to find the synthesis directives’ combinations that lead to the Pareto-optimal designs quickly. Our proposed method is based on the pre-characterization of micro-kernels offline, creating predictive models for each of the kernels, and using the results to explore a new unseen behavioral description using compositional methods. In addition, we make use of perceptual hashing to match new unseen micro-kernels with the pre-characterized micro-kernels in order to further speed up the search process. Experimental results show that our proposed method is orders of magnitude faster than traditional methods.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114442002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Flexible Group-Level Pruning of Deep Neural Networks for On-Device Machine Learning 用于设备上机器学习的深度神经网络的灵活群级修剪

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116287

Kwangbae Lee, Hoseung Kim, Hayun Lee, Dongkun Shin

Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it is hard to reduce model execution time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations. In this paper, we propose a more flexible approach, called unaligned group-level pruning, to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% a lower error rate at ResNet-20 network on CIFAR-10 compared to the previous 2D aligned group-level pruning under 95% of sparsity.

网络修剪是一种很有前途的压缩技术，可以减少深度神经网络的计算量和内存访问成本。修剪技术分为两类:细粒度修剪和粗粒度修剪。细粒度修剪消除了无关紧要的单个连接，因此通常会产生不规则的网络。因此，很难减少模型的执行时间。诸如过滤器级和通道级技术之类的粗粒度修剪可以使网络对硬件友好。然而，它的准确性可能较低。在本文中，我们重点研究了在移动gpu上加速深度神经网络的组级剪枝方法，该方法在一组中剪枝几个相邻的权值，以减轻剪枝网络的不规则性，同时提供较高的准确性。虽然已经提出了几种组级修剪技术，但以前的技术选择在组大小一致的位置修剪权重组。在本文中，我们提出了一种更灵活的方法，称为未对齐组级修剪，以提高压缩模型的准确性。本文用动态规划方法找到了未对齐群体选择问题的最优解。我们的技术还生成平衡稀疏网络，以实现并行计算单元的负载平衡。实验表明，在95%的稀疏度下，在CIFAR-10上的ResNet-20网络上，2D未对齐组级剪枝比之前的2D对齐组级剪枝错误率降低了3.12%。

{"title":"Flexible Group-Level Pruning of Deep Neural Networks for On-Device Machine Learning","authors":"Kwangbae Lee, Hoseung Kim, Hayun Lee, Dongkun Shin","doi":"10.23919/DATE48585.2020.9116287","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116287","url":null,"abstract":"Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it is hard to reduce model execution time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations. In this paper, we propose a more flexible approach, called unaligned group-level pruning, to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% a lower error rate at ResNet-20 network on CIFAR-10 compared to the previous 2D aligned group-level pruning under 95% of sparsity.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114589281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

GRAMARCH: A GPU-ReRAM based Heterogeneous Architecture for Neural Image Segmentation GRAMARCH:一种基于GPU-ReRAM的神经图像分割异构架构

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116273

Biresh Kumar Joardar, Nitthilan Kanappan Jayakodi, J. Doppa, H. Li, P. Pande, K. Chakrabarty

Deep Neural Networks (DNNs) employed for image segmentation are computationally more expensive and complex compared to the ones used for classification. However, manycore architectures to accelerate the training of these DNNs are relatively unexplored. Resistive random-access memory (ReRAM)-based architectures offer a promising alternative to commonly used GPU-based platforms for training DNNs. However, due to their low-precision storage capability, these architectures cannot support all DNN layers and suffer from accuracy loss of the learned models. To address these challenges, we propose GRAMARCH, a heterogeneous architecture that combines the benefits of ReRAM and GPUs simultaneously by using a high-throughput 3D Network-on-Chip. Experimental results indicate that by suitably mapping DNN layers to processing elements, it is possible to achieve up to 53X better performance compared to conventional GPUs for image segmentation.

与用于分类的深度神经网络相比，用于图像分割的深度神经网络(dnn)在计算上更加昂贵和复杂。然而，许多加速这些深度神经网络训练的核心架构还相对未被探索。基于电阻随机存取存储器(ReRAM)的架构为训练dnn的常用gpu平台提供了一个有前途的替代方案。然而，由于其低精度的存储能力，这些架构不能支持所有的深度神经网络层，并且存在学习模型精度损失的问题。为了应对这些挑战，我们提出了GRAMARCH，这是一种异构架构，通过使用高吞吐量的3D片上网络，同时结合了ReRAM和gpu的优点。实验结果表明，通过适当地将DNN层映射到处理元素，与传统gpu相比，可以实现高达53倍的图像分割性能。

引用次数: 9

Using Universal Composition to Design and Analyze Secure Complex Hardware Systems 用通用组合设计和分析安全的复杂硬件系统

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116295

R. Canetti, Marten van Dijk, Hoda Maleki, U. Rührmair, P. Schaumont

Modern hardware typically is characterized by a multitude of interacting physical components and software mechanisms. To address this complexity, security analysis should be modular: We would like to formulate and prove security properties of individual components, and then deduce the security of the overall design (encompassing hardware and software) from the security of the components. While this seems like an elusive goal, we argue that this is essentially the only feasible way to provide rigorous security analysis of modern hardware.This paper investigates the possibility of using the Universally Composable (UC) security framework towards this aim. The UC framework has been devised and successfully used in the theoretical cryptography community to study and formally prove security of arbitrarily interleaving cryptographic protocols. In particular, a sophisticated analytical toolbox has been developed using this framework. We provide an introduction to this frame-work, and investigate, via a number of examples, ways by which this framework can be used to facilitate a novel type of modular security analysis. This analysis applies to combined hardware and software systems, and investigates their security against attacks that combine both physical and digital steps.

现代硬件通常以大量相互作用的物理组件和软件机制为特征。为了解决这种复杂性，安全性分析应该是模块化的:我们希望制定和证明单个组件的安全性属性，然后从组件的安全性推断出整体设计(包括硬件和软件)的安全性。虽然这似乎是一个难以实现的目标，但我们认为这实际上是为现代硬件提供严格的安全性分析的唯一可行方法。本文探讨了采用通用可组合(UC)安全框架实现这一目标的可能性。UC框架已被设计并成功应用于理论密码学界，用于研究和形式化证明任意交错密码协议的安全性。特别是，使用这个框架开发了一个复杂的分析工具箱。我们对该框架进行了介绍，并通过一些示例研究了使用该框架促进新型模块化安全分析的方法。这种分析适用于结合了硬件和软件的系统，并研究了它们对结合了物理和数字步骤的攻击的安全性。

{"title":"Using Universal Composition to Design and Analyze Secure Complex Hardware Systems","authors":"R. Canetti, Marten van Dijk, Hoda Maleki, U. Rührmair, P. Schaumont","doi":"10.23919/DATE48585.2020.9116295","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116295","url":null,"abstract":"Modern hardware typically is characterized by a multitude of interacting physical components and software mechanisms. To address this complexity, security analysis should be modular: We would like to formulate and prove security properties of individual components, and then deduce the security of the overall design (encompassing hardware and software) from the security of the components. While this seems like an elusive goal, we argue that this is essentially the only feasible way to provide rigorous security analysis of modern hardware.This paper investigates the possibility of using the Universally Composable (UC) security framework towards this aim. The UC framework has been devised and successfully used in the theoretical cryptography community to study and formally prove security of arbitrarily interleaving cryptographic protocols. In particular, a sophisticated analytical toolbox has been developed using this framework. We provide an introduction to this frame-work, and investigate, via a number of examples, ways by which this framework can be used to facilitate a novel type of modular security analysis. This analysis applies to combined hardware and software systems, and investigates their security against attacks that combine both physical and digital steps.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128634379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ARS: Reducing F2FS Fragmentation for Smartphones using Decision Trees ARS:使用决策树减少智能手机的F2FS碎片

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116318

Lihua Yang, F. Wang, Zhipeng Tan, D. Feng, Jiaxing Qian, Shiyun Tu

As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36× longer than that of continuous files and that F2FS’s in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience.(p)(/p)Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26× transactions per second under SQLite than traditional F2FS and reduces running time of realistic workloads by up to 41.72% than F2FS with in-place updates.

众所周知，文件和空闲空间碎片会对文件系统的性能产生负面影响。F2FS是为闪存设计的文件系统。然而，由于它的错位更新和移动应用程序的高度同步、多线程编写行为，它遭受了严重的碎片化。我们观察到碎片文件的运行时间比连续文件的运行时间长2.36倍，并且F2FS的就地更新方案无法减少碎片。(p)(/p)为防止文件系统碎片化而预留空间是一种直观的方法。但是，为所有文件保留空间会浪费空间，因为文件数量很多。为了解决这一难题，我们提出了一种自适应保留空间(ARS)方案，在保留空间中选择一些特定的文件进行更新。如何有效地选择保留文件对性能至关重要。我们收集与碎片相关的文件特征来构建数据集，并使用决策树来准确地选择保留文件。采用可调预留空间和动态预留策略。我们在HiKey960开发平台和商用智能手机上实现ARS，具有轻微的空间和文件创建时间开销。实验结果表明，与传统的F2FS和具有就地更新的F2FS相比，ARS显著减少了文件和空闲空间碎片，提高了文件I/O性能，减少了垃圾收集开销。此外，与传统的F2FS相比，ARS在SQLite下每秒提供高达1.26个事务，并且与具有就地更新的F2FS相比，实际工作负载的运行时间减少了41.72%。

{"title":"ARS: Reducing F2FS Fragmentation for Smartphones using Decision Trees","authors":"Lihua Yang, F. Wang, Zhipeng Tan, D. Feng, Jiaxing Qian, Shiyun Tu","doi":"10.23919/DATE48585.2020.9116318","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116318","url":null,"abstract":"As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36× longer than that of continuous files and that F2FS’s in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience.(p)(/p)Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26× transactions per second under SQLite than traditional F2FS and reduces running time of realistic workloads by up to 41.72% than F2FS with in-place updates.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129663981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Modeling and Designing of a PVT Auto-tracking Timing-speculative SRAM PVT自跟踪时间推测SRAM的建模与设计

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2020-03-01 DOI: 10.23919/DATE48585.2020.9116569

Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang, Longxing Shi

In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing- speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing- speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative.(p)(/p)This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.

在低电源电压区域，6T单元SRAM的性能下降严重，需要更多的时间才能在位线上达到足够的电压差。时序-推测技术提出了提高SRAM的频率和吞吐量推测读取数据在一个积极的时序和纠正时序故障在一个或多个延长的周期。然而，时序投机式SRAM的吞吐量增益受到工艺、电压和温度(PVT)变化的影响，这导致投机式SRAM的时序设计要么过于积极，要么过于保守。(p)(/p)本文首先提出了一个统计模型来抽象投机式SRAM的特征，并表明存在一个使总体吞吐量最大化的最佳感知时间。然后，在性能模型的指导下，设计并制作了PVT自跟踪推测式SRAM，该SRAM可以随着工作状态的变化动态自调整位线感知到最优时间。根据测量结果，所提出的28nm SRAM在0.6V VDD下的最大吞吐量增益是基线的1.62倍。

{"title":"Modeling and Designing of a PVT Auto-tracking Timing-speculative SRAM","authors":"Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang, Longxing Shi","doi":"10.23919/DATE48585.2020.9116569","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116569","url":null,"abstract":"In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing- speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing- speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative.(p)(/p)This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130296486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3