Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00040
Hsin-Yu Ting, Tootiya Giyahchi, A. A. Sani, E. Bozorgzadeh
Edge computing can potentially provide abundant processing resources for compute-intensive applications while bringing services close to end devices. With the increasing demands for computing acceleration at the edge, FPGAs have been deployed to provide custom deep neural network accelerators. This paper explores a DNN accelerator sharing system at the edge FPGA device, that serves various DNN applications from multiple end devices simultaneously. The proposed SharedDNN/PlanAhead policy exploits the regularity among requests for various DNN accelerators and determines which accelerator to allocate for each request and in what order to respond to the requests that achieve maximum responsiveness for a queue of acceleration requests. Our results show overall 2. 20x performance gain at best and utilization improvement by reducing up to 27% of DNN library usage while staying within the requests’ requirements and resource constraints.
{"title":"Dynamic Sharing in Multi-accelerators of Neural Networks on an FPGA Edge Device","authors":"Hsin-Yu Ting, Tootiya Giyahchi, A. A. Sani, E. Bozorgzadeh","doi":"10.1109/ASAP49362.2020.00040","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00040","url":null,"abstract":"Edge computing can potentially provide abundant processing resources for compute-intensive applications while bringing services close to end devices. With the increasing demands for computing acceleration at the edge, FPGAs have been deployed to provide custom deep neural network accelerators. This paper explores a DNN accelerator sharing system at the edge FPGA device, that serves various DNN applications from multiple end devices simultaneously. The proposed SharedDNN/PlanAhead policy exploits the regularity among requests for various DNN accelerators and determines which accelerator to allocate for each request and in what order to respond to the requests that achieve maximum responsiveness for a queue of acceleration requests. Our results show overall 2. 20x performance gain at best and utilization improvement by reducing up to 27% of DNN library usage while staying within the requests’ requirements and resource constraints.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124278105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00023
Jiuxi Meng, Ce Guo, Nadeen Gebara, W. Luk
Random projection is gaining more attention in large scale machine learning. It has been proved to reduce the dimensionality of a set of data whilst approximately preserving the pairwise distance between points by multiplying the original dataset with a chosen matrix. However, projecting data to a lower dimension subspace typically reduces the training accuracy. In this paper, we propose a novel architecture that combines an FPGA-based switch with the ensemble learning method. This architecture enables reducing training time while maintaining high accuracy. Our initial result shows a speedup of 2.12-6.77 times using four different high dimensionality datasets.
{"title":"Fast and Accurate Training of Ensemble Models with FPGA-based Switch","authors":"Jiuxi Meng, Ce Guo, Nadeen Gebara, W. Luk","doi":"10.1109/ASAP49362.2020.00023","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00023","url":null,"abstract":"Random projection is gaining more attention in large scale machine learning. It has been proved to reduce the dimensionality of a set of data whilst approximately preserving the pairwise distance between points by multiplying the original dataset with a chosen matrix. However, projecting data to a lower dimension subspace typically reduces the training accuracy. In this paper, we propose a novel architecture that combines an FPGA-based switch with the ensemble learning method. This architecture enables reducing training time while maintaining high accuracy. Our initial result shows a speedup of 2.12-6.77 times using four different high dimensionality datasets.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124436953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00027
T. Soliman, R. Olivo, T. Kirchner, Cecilia De la Parra, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn
This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.
{"title":"Efficient FeFET Crossbar Accelerator for Binary Neural Networks","authors":"T. Soliman, R. Olivo, T. Kirchner, Cecilia De la Parra, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn","doi":"10.1109/ASAP49362.2020.00027","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00027","url":null,"abstract":"This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126348766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00029
Jan Richter-Brockmann, T. Güneysu
Side-channel analysis and fault-injection attacks are known as serious threats to cryptographic hardware implementations and the combined protection against both is currently an open line of research. A promising countermeasure with considerable implementation overhead appears to be a mix of first-order secure Threshold Implementations and linear Error-Correcting Codes.In this paper we employ for the first time the inherent structure of non-systematic codes as fault countermeasure which dynamically mutates the applied generator matrices to achieve a higher-order side-channel and fault-protected design. As a case study, we apply our scheme to the PRESENT block cipher that do not show any higher-order side-channel leakage after measuring 150 million power traces.
{"title":"Improved Side-Channel Resistance by Dynamic Fault-Injection Countermeasures","authors":"Jan Richter-Brockmann, T. Güneysu","doi":"10.1109/ASAP49362.2020.00029","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00029","url":null,"abstract":"Side-channel analysis and fault-injection attacks are known as serious threats to cryptographic hardware implementations and the combined protection against both is currently an open line of research. A promising countermeasure with considerable implementation overhead appears to be a mix of first-order secure Threshold Implementations and linear Error-Correcting Codes.In this paper we employ for the first time the inherent structure of non-systematic codes as fault countermeasure which dynamically mutates the applied generator matrices to achieve a higher-order side-channel and fault-protected design. As a case study, we apply our scheme to the PRESENT block cipher that do not show any higher-order side-channel leakage after measuring 150 million power traces.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121220213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00025
J. Reuben, Stefan Pechmann
Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.
{"title":"A Parallel-friendly Majority Gate to Accelerate In-memory Computation","authors":"J. Reuben, Stefan Pechmann","doi":"10.1109/ASAP49362.2020.00025","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00025","url":null,"abstract":"Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"33 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124341728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00038
Mohammad Pivezhandi, Phillip H. Jones, Joseph Zambreno
In this paper, we present an FPGA-based architecture for histogram generation to support event-based camera optical flow calculation. Our proposed histogram generation mechanism reduces memory and logic resources by storing the time difference between consecutive events, instead of the absolute time of each event. Additionally, we explore the trade-off between system resource usage and histogram accuracy as a function of the precision at which time is encoded. Our results show that across three event-based camera benchmarks we can reduce the encoding of time from 32 to 7 bits with a loss of only approximately 3% in histogram accuracy. In comparison to a software implementation, our architecture shows a significant speedup.
{"title":"ParaHist: FPGA Implementation of Parallel Event-Based Histogram for Optical Flow Calculation","authors":"Mohammad Pivezhandi, Phillip H. Jones, Joseph Zambreno","doi":"10.1109/ASAP49362.2020.00038","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00038","url":null,"abstract":"In this paper, we present an FPGA-based architecture for histogram generation to support event-based camera optical flow calculation. Our proposed histogram generation mechanism reduces memory and logic resources by storing the time difference between consecutive events, instead of the absolute time of each event. Additionally, we explore the trade-off between system resource usage and histogram accuracy as a function of the precision at which time is encoded. Our results show that across three event-based camera benchmarks we can reduce the encoding of time from 32 to 7 bits with a loss of only approximately 3% in histogram accuracy. In comparison to a software implementation, our architecture shows a significant speedup.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125103010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/asap49362.2020.00003
{"title":"[Copyright notice]","authors":"","doi":"10.1109/asap49362.2020.00003","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00003","url":null,"abstract":"","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133933362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00041
L. Dias, M. G. Coutinho, E. Gaura, Marcelo A. C. Fernandes
Self-Organizing Maps (SOMs) are widely used as a data mining technique for applications that require data dimensionality reduction and clustering. Given the complexity of the SOM learning phase and the massive dimensionality of many data sets as well as their sample size in Big Data applications, high-speed processing is critical when implementing SOM approaches. This paper proposes a new hardware approach to SOM implementation, exploiting parallelization, to optimize the system’s processing time. Unlike most implementations in the literature, this proposed approach allows the parallelization of the data dimensions instead of the map, ensuring high processing speed regardless of data dimensions. An implementation with field-programmable gate arrays (FPGA) is presented and evaluated. Key evaluation metrics are processing time (or throughput) and FPGA area occupancy (or hardware resources).
{"title":"A New Hardware Approach to Self-Organizing Maps","authors":"L. Dias, M. G. Coutinho, E. Gaura, Marcelo A. C. Fernandes","doi":"10.1109/ASAP49362.2020.00041","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00041","url":null,"abstract":"Self-Organizing Maps (SOMs) are widely used as a data mining technique for applications that require data dimensionality reduction and clustering. Given the complexity of the SOM learning phase and the massive dimensionality of many data sets as well as their sample size in Big Data applications, high-speed processing is critical when implementing SOM approaches. This paper proposes a new hardware approach to SOM implementation, exploiting parallelization, to optimize the system’s processing time. Unlike most implementations in the literature, this proposed approach allows the parallelization of the data dimensions instead of the map, ensuring high processing speed regardless of data dimensions. An implementation with field-programmable gate arrays (FPGA) is presented and evaluated. Key evaluation metrics are processing time (or throughput) and FPGA area occupancy (or hardware resources).","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129415893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00010
Artur Podobas, K. Sano, S. Matsuoka
Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.
{"title":"A Template-based Framework for Exploring Coarse-Grained Reconfigurable Architectures","authors":"Artur Podobas, K. Sano, S. Matsuoka","doi":"10.1109/ASAP49362.2020.00010","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00010","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128212961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}