Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00027
T. Soliman, R. Olivo, T. Kirchner, Cecilia De la Parra, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn
This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.
{"title":"Efficient FeFET Crossbar Accelerator for Binary Neural Networks","authors":"T. Soliman, R. Olivo, T. Kirchner, Cecilia De la Parra, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn","doi":"10.1109/ASAP49362.2020.00027","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00027","url":null,"abstract":"This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126348766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00023
Jiuxi Meng, Ce Guo, Nadeen Gebara, W. Luk
Random projection is gaining more attention in large scale machine learning. It has been proved to reduce the dimensionality of a set of data whilst approximately preserving the pairwise distance between points by multiplying the original dataset with a chosen matrix. However, projecting data to a lower dimension subspace typically reduces the training accuracy. In this paper, we propose a novel architecture that combines an FPGA-based switch with the ensemble learning method. This architecture enables reducing training time while maintaining high accuracy. Our initial result shows a speedup of 2.12-6.77 times using four different high dimensionality datasets.
{"title":"Fast and Accurate Training of Ensemble Models with FPGA-based Switch","authors":"Jiuxi Meng, Ce Guo, Nadeen Gebara, W. Luk","doi":"10.1109/ASAP49362.2020.00023","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00023","url":null,"abstract":"Random projection is gaining more attention in large scale machine learning. It has been proved to reduce the dimensionality of a set of data whilst approximately preserving the pairwise distance between points by multiplying the original dataset with a chosen matrix. However, projecting data to a lower dimension subspace typically reduces the training accuracy. In this paper, we propose a novel architecture that combines an FPGA-based switch with the ensemble learning method. This architecture enables reducing training time while maintaining high accuracy. Our initial result shows a speedup of 2.12-6.77 times using four different high dimensionality datasets.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124436953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00011
Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura
Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing. This is ascribed to the drastic improvement in their computational and communication capabilities in recent years owing to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to these performance improvements, toolchains for the development of FPGAs in OpenCL have been offered by FPGA vendors to reduce the programming effort required. These improvements suggest the possibility of implementing the concept of enabling on-the-fly offloading computation at which CPUs/GPUs perform poorly relative to FPGAs while performing low-latency data transfers. We consider this concept to be of key importance to improve the performance of heterogeneous supercomputers that employ accelerators such as a GPU. In this study, we propose GPU–FPGA-accelerated simulation based on this concept and demonstrate the implementation of the proposed method with CUDA and OpenCL mixed programming. The experimental results showed that our proposed method can increase the performance by up to $17.4 times$ compared with GPU-based implementation. This performance is still $1.32 times$ higher even when solving problems with the largest size, which is the fastest problem size for GPU-based implementation. We consider the realization of GPU–FPGA-accelerated simulation to be the most significant difference between our work and previous studies.
{"title":"Accelerating Radiative Transfer Simulation with GPU-FPGA Cooperative Computation","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura","doi":"10.1109/ASAP49362.2020.00011","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00011","url":null,"abstract":"Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing. This is ascribed to the drastic improvement in their computational and communication capabilities in recent years owing to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to these performance improvements, toolchains for the development of FPGAs in OpenCL have been offered by FPGA vendors to reduce the programming effort required. These improvements suggest the possibility of implementing the concept of enabling on-the-fly offloading computation at which CPUs/GPUs perform poorly relative to FPGAs while performing low-latency data transfers. We consider this concept to be of key importance to improve the performance of heterogeneous supercomputers that employ accelerators such as a GPU. In this study, we propose GPU–FPGA-accelerated simulation based on this concept and demonstrate the implementation of the proposed method with CUDA and OpenCL mixed programming. The experimental results showed that our proposed method can increase the performance by up to $17.4 times$ compared with GPU-based implementation. This performance is still $1.32 times$ higher even when solving problems with the largest size, which is the fastest problem size for GPU-based implementation. We consider the realization of GPU–FPGA-accelerated simulation to be the most significant difference between our work and previous studies.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132217653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00029
Jan Richter-Brockmann, T. Güneysu
Side-channel analysis and fault-injection attacks are known as serious threats to cryptographic hardware implementations and the combined protection against both is currently an open line of research. A promising countermeasure with considerable implementation overhead appears to be a mix of first-order secure Threshold Implementations and linear Error-Correcting Codes.In this paper we employ for the first time the inherent structure of non-systematic codes as fault countermeasure which dynamically mutates the applied generator matrices to achieve a higher-order side-channel and fault-protected design. As a case study, we apply our scheme to the PRESENT block cipher that do not show any higher-order side-channel leakage after measuring 150 million power traces.
{"title":"Improved Side-Channel Resistance by Dynamic Fault-Injection Countermeasures","authors":"Jan Richter-Brockmann, T. Güneysu","doi":"10.1109/ASAP49362.2020.00029","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00029","url":null,"abstract":"Side-channel analysis and fault-injection attacks are known as serious threats to cryptographic hardware implementations and the combined protection against both is currently an open line of research. A promising countermeasure with considerable implementation overhead appears to be a mix of first-order secure Threshold Implementations and linear Error-Correcting Codes.In this paper we employ for the first time the inherent structure of non-systematic codes as fault countermeasure which dynamically mutates the applied generator matrices to achieve a higher-order side-channel and fault-protected design. As a case study, we apply our scheme to the PRESENT block cipher that do not show any higher-order side-channel leakage after measuring 150 million power traces.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121220213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00025
J. Reuben, Stefan Pechmann
Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.
{"title":"A Parallel-friendly Majority Gate to Accelerate In-memory Computation","authors":"J. Reuben, Stefan Pechmann","doi":"10.1109/ASAP49362.2020.00025","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00025","url":null,"abstract":"Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"33 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124341728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00038
Mohammad Pivezhandi, Phillip H. Jones, Joseph Zambreno
In this paper, we present an FPGA-based architecture for histogram generation to support event-based camera optical flow calculation. Our proposed histogram generation mechanism reduces memory and logic resources by storing the time difference between consecutive events, instead of the absolute time of each event. Additionally, we explore the trade-off between system resource usage and histogram accuracy as a function of the precision at which time is encoded. Our results show that across three event-based camera benchmarks we can reduce the encoding of time from 32 to 7 bits with a loss of only approximately 3% in histogram accuracy. In comparison to a software implementation, our architecture shows a significant speedup.
{"title":"ParaHist: FPGA Implementation of Parallel Event-Based Histogram for Optical Flow Calculation","authors":"Mohammad Pivezhandi, Phillip H. Jones, Joseph Zambreno","doi":"10.1109/ASAP49362.2020.00038","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00038","url":null,"abstract":"In this paper, we present an FPGA-based architecture for histogram generation to support event-based camera optical flow calculation. Our proposed histogram generation mechanism reduces memory and logic resources by storing the time difference between consecutive events, instead of the absolute time of each event. Additionally, we explore the trade-off between system resource usage and histogram accuracy as a function of the precision at which time is encoded. Our results show that across three event-based camera benchmarks we can reduce the encoding of time from 32 to 7 bits with a loss of only approximately 3% in histogram accuracy. In comparison to a software implementation, our architecture shows a significant speedup.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125103010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/asap49362.2020.00003
{"title":"[Copyright notice]","authors":"","doi":"10.1109/asap49362.2020.00003","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00003","url":null,"abstract":"","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133933362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00041
L. Dias, M. G. Coutinho, E. Gaura, Marcelo A. C. Fernandes
Self-Organizing Maps (SOMs) are widely used as a data mining technique for applications that require data dimensionality reduction and clustering. Given the complexity of the SOM learning phase and the massive dimensionality of many data sets as well as their sample size in Big Data applications, high-speed processing is critical when implementing SOM approaches. This paper proposes a new hardware approach to SOM implementation, exploiting parallelization, to optimize the system’s processing time. Unlike most implementations in the literature, this proposed approach allows the parallelization of the data dimensions instead of the map, ensuring high processing speed regardless of data dimensions. An implementation with field-programmable gate arrays (FPGA) is presented and evaluated. Key evaluation metrics are processing time (or throughput) and FPGA area occupancy (or hardware resources).
{"title":"A New Hardware Approach to Self-Organizing Maps","authors":"L. Dias, M. G. Coutinho, E. Gaura, Marcelo A. C. Fernandes","doi":"10.1109/ASAP49362.2020.00041","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00041","url":null,"abstract":"Self-Organizing Maps (SOMs) are widely used as a data mining technique for applications that require data dimensionality reduction and clustering. Given the complexity of the SOM learning phase and the massive dimensionality of many data sets as well as their sample size in Big Data applications, high-speed processing is critical when implementing SOM approaches. This paper proposes a new hardware approach to SOM implementation, exploiting parallelization, to optimize the system’s processing time. Unlike most implementations in the literature, this proposed approach allows the parallelization of the data dimensions instead of the map, ensuring high processing speed regardless of data dimensions. An implementation with field-programmable gate arrays (FPGA) is presented and evaluated. Key evaluation metrics are processing time (or throughput) and FPGA area occupancy (or hardware resources).","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129415893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00010
Artur Podobas, K. Sano, S. Matsuoka
Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.
{"title":"A Template-based Framework for Exploring Coarse-Grained Reconfigurable Architectures","authors":"Artur Podobas, K. Sano, S. Matsuoka","doi":"10.1109/ASAP49362.2020.00010","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00010","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128212961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}