Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116416
Fuxun Yu, Chenchen Liu, Di Wang, Yanzhi Wang, Xiang Chen
Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components’ static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4%∼54.5% FLOPs reduction with negligible accuracy drop on various of test networks.
{"title":"AntiDote: Attention-based Dynamic Optimization for Neural Network Runtime Efficiency","authors":"Fuxun Yu, Chenchen Liu, Di Wang, Yanzhi Wang, Xiang Chen","doi":"10.23919/DATE48585.2020.9116416","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116416","url":null,"abstract":"Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components’ static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4%∼54.5% FLOPs reduction with negligible accuracy drop on various of test networks.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115553216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116402
Simon Rokicki
Unveiled early 2018, the Spectre vulnerability affects most of the modern high-performance processors. Spectre variants exploit the speculative execution mechanisms and a cache side-channel attack to leak secret data. As of today, the main countermeasures consist of turning off the speculation, which drastically reduces the processor performance. In this work, we focus on a different kind of micro-architecture: the DBT based processors, such as Transmeta Crusoe [1], NVidia Denver [2], or Hybrid-DBT [3]. Instead of using complex outof-order (OoO) mechanisms, those cores combines a software Dynamic Binary Translation mechanism (DBT) and a parallel in-order architecture, typically a VLIW core. The DBT is in charge of translating and optimizing the binaries before their execution. Studies show that DBT based processors can reach the performance level of OoO cores for regular enough applications. In this paper, we demonstrate that, even if those processors do not use OoO execution, they are still vulnerable to Spectre variants, because of the DBT optimizations. However, we also demonstrate that those systems can easily be patched, as the DBT is done in software and has fine-grained control over the optimization process.
{"title":"GhostBusters: Mitigating Spectre Attacks on a DBT-Based Processor","authors":"Simon Rokicki","doi":"10.23919/DATE48585.2020.9116402","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116402","url":null,"abstract":"Unveiled early 2018, the Spectre vulnerability affects most of the modern high-performance processors. Spectre variants exploit the speculative execution mechanisms and a cache side-channel attack to leak secret data. As of today, the main countermeasures consist of turning off the speculation, which drastically reduces the processor performance. In this work, we focus on a different kind of micro-architecture: the DBT based processors, such as Transmeta Crusoe [1], NVidia Denver [2], or Hybrid-DBT [3]. Instead of using complex outof-order (OoO) mechanisms, those cores combines a software Dynamic Binary Translation mechanism (DBT) and a parallel in-order architecture, typically a VLIW core. The DBT is in charge of translating and optimizing the binaries before their execution. Studies show that DBT based processors can reach the performance level of OoO cores for regular enough applications. In this paper, we demonstrate that, even if those processors do not use OoO execution, they are still vulnerable to Spectre variants, because of the DBT optimizations. However, we also demonstrate that those systems can easily be patched, as the DBT is done in software and has fine-grained control over the optimization process.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116515455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116344
Fei Gao, F. Mallet, Min Zhang, Mingsong Chen
The Clock Constraint Specification Language (CCSL) is a logical time based modeling language to formalize timing behaviors of real-time and embedded systems. However, it cannot capture timing behaviors that contain uncertainties, e.g., uncertainty in execution time and period. This limits the application of the language to real-world systems, as uncertainty often exists in practice due to both internal and external factors. To capture uncertainties in timing behaviors, in this paper we extend CCSL by introducing parameters into constraints. We then propose an approach to transform parametric CCSL constraints into SMT formulas for efficient verification. We apply our approach to an industrial case which is proposed as the FMTV (Formal Methods for Timing Verification) Challenge in 2015, which shows that timing behaviors with uncertainties can be effectively modeled and verified using the parametric CCSL.
{"title":"Modeling and Verifying Uncertainty-Aware Timing Behaviors using Parametric Logical Time Constraint","authors":"Fei Gao, F. Mallet, Min Zhang, Mingsong Chen","doi":"10.23919/DATE48585.2020.9116344","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116344","url":null,"abstract":"The Clock Constraint Specification Language (CCSL) is a logical time based modeling language to formalize timing behaviors of real-time and embedded systems. However, it cannot capture timing behaviors that contain uncertainties, e.g., uncertainty in execution time and period. This limits the application of the language to real-world systems, as uncertainty often exists in practice due to both internal and external factors. To capture uncertainties in timing behaviors, in this paper we extend CCSL by introducing parameters into constraints. We then propose an approach to transform parametric CCSL constraints into SMT formulas for efficient verification. We apply our approach to an industrial case which is proposed as the FMTV (Formal Methods for Timing Verification) Challenge in 2015, which shows that timing behaviors with uncertainties can be effectively modeled and verified using the parametric CCSL.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122799731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116350
Etienne Dupuis, D. Novo, Ian O’Connor, A. Bosio
Deep neural networks demonstrate impressive levels of performance, particularly in computer vision and speech recognition. However, the computational workload and associated storage inhibit their potential in resource-limited embedded systems. The approximate computing paradigm has been widely explored in the literature. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. There are a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, etc.). To the best of our knowledge, no automated approach exists to explore, select and generate the best approximate versions of a given convolutional neural network (CNN) according to the design objectives. The goal of this work in progress is to demonstrate that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that our method can obtain a 4x compression rate in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNNs) without re-training and without any accuracy loss.
{"title":"On the Automatic Exploration of Weight Sharing for Deep Neural Network Compression","authors":"Etienne Dupuis, D. Novo, Ian O’Connor, A. Bosio","doi":"10.23919/DATE48585.2020.9116350","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116350","url":null,"abstract":"Deep neural networks demonstrate impressive levels of performance, particularly in computer vision and speech recognition. However, the computational workload and associated storage inhibit their potential in resource-limited embedded systems. The approximate computing paradigm has been widely explored in the literature. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. There are a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, etc.). To the best of our knowledge, no automated approach exists to explore, select and generate the best approximate versions of a given convolutional neural network (CNN) according to the design objectives. The goal of this work in progress is to demonstrate that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that our method can obtain a 4x compression rate in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNNs) without re-training and without any accuracy loss.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122854114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116309
Zi Wang, Jianqi Chen, Benjamin Carrión Schäfer
This work proposes a method to accelerate the process of High-Level Synthesis (HLS) Design Space Exploration (DSE) by pre-characterizing micro-kernels offline and creating predictive models of these. HLS allows to generate different types of micro-architectures from the same untimed behavioral description. This is typically done by setting different combinations of synthesis options in the form or synthesis directives specified as pragmas in the code. This allows, e.g. to control how loops should be synthesized, arrays and functions. Unique combinations of these pragmas leads to micro-architectures with a unique area vs. performance/power trade-offs. The main problem is that the search space grows exponentially with the number of explorable operations. Thus, the main goal of efficient HLS DSE is to find the synthesis directives’ combinations that lead to the Pareto-optimal designs quickly. Our proposed method is based on the pre-characterization of micro-kernels offline, creating predictive models for each of the kernels, and using the results to explore a new unseen behavioral description using compositional methods. In addition, we make use of perceptual hashing to match new unseen micro-kernels with the pre-characterized micro-kernels in order to further speed up the search process. Experimental results show that our proposed method is orders of magnitude faster than traditional methods.
{"title":"Efficient and Robust High-Level Synthesis Design Space Exploration through offline Micro-kernels Pre-characterization","authors":"Zi Wang, Jianqi Chen, Benjamin Carrión Schäfer","doi":"10.23919/DATE48585.2020.9116309","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116309","url":null,"abstract":"This work proposes a method to accelerate the process of High-Level Synthesis (HLS) Design Space Exploration (DSE) by pre-characterizing micro-kernels offline and creating predictive models of these. HLS allows to generate different types of micro-architectures from the same untimed behavioral description. This is typically done by setting different combinations of synthesis options in the form or synthesis directives specified as pragmas in the code. This allows, e.g. to control how loops should be synthesized, arrays and functions. Unique combinations of these pragmas leads to micro-architectures with a unique area vs. performance/power trade-offs. The main problem is that the search space grows exponentially with the number of explorable operations. Thus, the main goal of efficient HLS DSE is to find the synthesis directives’ combinations that lead to the Pareto-optimal designs quickly. Our proposed method is based on the pre-characterization of micro-kernels offline, creating predictive models for each of the kernels, and using the results to explore a new unseen behavioral description using compositional methods. In addition, we make use of perceptual hashing to match new unseen micro-kernels with the pre-characterized micro-kernels in order to further speed up the search process. Experimental results show that our proposed method is orders of magnitude faster than traditional methods.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114442002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116287
Kwangbae Lee, Hoseung Kim, Hayun Lee, Dongkun Shin
Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it is hard to reduce model execution time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations. In this paper, we propose a more flexible approach, called unaligned group-level pruning, to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% a lower error rate at ResNet-20 network on CIFAR-10 compared to the previous 2D aligned group-level pruning under 95% of sparsity.
{"title":"Flexible Group-Level Pruning of Deep Neural Networks for On-Device Machine Learning","authors":"Kwangbae Lee, Hoseung Kim, Hayun Lee, Dongkun Shin","doi":"10.23919/DATE48585.2020.9116287","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116287","url":null,"abstract":"Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it is hard to reduce model execution time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations. In this paper, we propose a more flexible approach, called unaligned group-level pruning, to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% a lower error rate at ResNet-20 network on CIFAR-10 compared to the previous 2D aligned group-level pruning under 95% of sparsity.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114589281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116273
Biresh Kumar Joardar, Nitthilan Kanappan Jayakodi, J. Doppa, H. Li, P. Pande, K. Chakrabarty
Deep Neural Networks (DNNs) employed for image segmentation are computationally more expensive and complex compared to the ones used for classification. However, manycore architectures to accelerate the training of these DNNs are relatively unexplored. Resistive random-access memory (ReRAM)-based architectures offer a promising alternative to commonly used GPU-based platforms for training DNNs. However, due to their low-precision storage capability, these architectures cannot support all DNN layers and suffer from accuracy loss of the learned models. To address these challenges, we propose GRAMARCH, a heterogeneous architecture that combines the benefits of ReRAM and GPUs simultaneously by using a high-throughput 3D Network-on-Chip. Experimental results indicate that by suitably mapping DNN layers to processing elements, it is possible to achieve up to 53X better performance compared to conventional GPUs for image segmentation.
{"title":"GRAMARCH: A GPU-ReRAM based Heterogeneous Architecture for Neural Image Segmentation","authors":"Biresh Kumar Joardar, Nitthilan Kanappan Jayakodi, J. Doppa, H. Li, P. Pande, K. Chakrabarty","doi":"10.23919/DATE48585.2020.9116273","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116273","url":null,"abstract":"Deep Neural Networks (DNNs) employed for image segmentation are computationally more expensive and complex compared to the ones used for classification. However, manycore architectures to accelerate the training of these DNNs are relatively unexplored. Resistive random-access memory (ReRAM)-based architectures offer a promising alternative to commonly used GPU-based platforms for training DNNs. However, due to their low-precision storage capability, these architectures cannot support all DNN layers and suffer from accuracy loss of the learned models. To address these challenges, we propose GRAMARCH, a heterogeneous architecture that combines the benefits of ReRAM and GPUs simultaneously by using a high-throughput 3D Network-on-Chip. Experimental results indicate that by suitably mapping DNN layers to processing elements, it is possible to achieve up to 53X better performance compared to conventional GPUs for image segmentation.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128211040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116295
R. Canetti, Marten van Dijk, Hoda Maleki, U. Rührmair, P. Schaumont
Modern hardware typically is characterized by a multitude of interacting physical components and software mechanisms. To address this complexity, security analysis should be modular: We would like to formulate and prove security properties of individual components, and then deduce the security of the overall design (encompassing hardware and software) from the security of the components. While this seems like an elusive goal, we argue that this is essentially the only feasible way to provide rigorous security analysis of modern hardware.This paper investigates the possibility of using the Universally Composable (UC) security framework towards this aim. The UC framework has been devised and successfully used in the theoretical cryptography community to study and formally prove security of arbitrarily interleaving cryptographic protocols. In particular, a sophisticated analytical toolbox has been developed using this framework. We provide an introduction to this frame-work, and investigate, via a number of examples, ways by which this framework can be used to facilitate a novel type of modular security analysis. This analysis applies to combined hardware and software systems, and investigates their security against attacks that combine both physical and digital steps.
{"title":"Using Universal Composition to Design and Analyze Secure Complex Hardware Systems","authors":"R. Canetti, Marten van Dijk, Hoda Maleki, U. Rührmair, P. Schaumont","doi":"10.23919/DATE48585.2020.9116295","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116295","url":null,"abstract":"Modern hardware typically is characterized by a multitude of interacting physical components and software mechanisms. To address this complexity, security analysis should be modular: We would like to formulate and prove security properties of individual components, and then deduce the security of the overall design (encompassing hardware and software) from the security of the components. While this seems like an elusive goal, we argue that this is essentially the only feasible way to provide rigorous security analysis of modern hardware.This paper investigates the possibility of using the Universally Composable (UC) security framework towards this aim. The UC framework has been devised and successfully used in the theoretical cryptography community to study and formally prove security of arbitrarily interleaving cryptographic protocols. In particular, a sophisticated analytical toolbox has been developed using this framework. We provide an introduction to this frame-work, and investigate, via a number of examples, ways by which this framework can be used to facilitate a novel type of modular security analysis. This analysis applies to combined hardware and software systems, and investigates their security against attacks that combine both physical and digital steps.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128634379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116318
Lihua Yang, F. Wang, Zhipeng Tan, D. Feng, Jiaxing Qian, Shiyun Tu
As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36× longer than that of continuous files and that F2FS’s in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience.(p)(/p)Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26× transactions per second under SQLite than traditional F2FS and reduces running time of realistic workloads by up to 41.72% than F2FS with in-place updates.
{"title":"ARS: Reducing F2FS Fragmentation for Smartphones using Decision Trees","authors":"Lihua Yang, F. Wang, Zhipeng Tan, D. Feng, Jiaxing Qian, Shiyun Tu","doi":"10.23919/DATE48585.2020.9116318","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116318","url":null,"abstract":"As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36× longer than that of continuous files and that F2FS’s in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience.(p)(/p)Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26× transactions per second under SQLite than traditional F2FS and reduces running time of realistic workloads by up to 41.72% than F2FS with in-place updates.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129663981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.23919/DATE48585.2020.9116569
Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang, Longxing Shi
In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing- speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing- speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative.(p)(/p)This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.
{"title":"Modeling and Designing of a PVT Auto-tracking Timing-speculative SRAM","authors":"Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang, Longxing Shi","doi":"10.23919/DATE48585.2020.9116569","DOIUrl":"https://doi.org/10.23919/DATE48585.2020.9116569","url":null,"abstract":"In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing- speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing- speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative.(p)(/p)This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.","PeriodicalId":289525,"journal":{"name":"2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130296486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}