{"title":"Optimal hardware implementation for end-to-end CNN-based classification","authors":"S. Aydin, H. Ş. Bilge","doi":"10.1109/ICITIIT57246.2023.10068601","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNN) show promising results in many fields, especially in computer vision tasks. However, implementing these networks requires computationally intensive operations. Increasing computational workloads makes it difficult to use CNN models in real-time applications. To overcome these challenges, CNN must be implemented on a dedicated hardware platform such as a field-programmable gate array (FPGA). The parallel processing and reconfigurable features of FPGA hardware make it suitable for real-time applications. Nevertheless, due to limited resources and memory units, various optimizations must be applied prior to implementing processing-intensive structures. Both the resources and the memory units used in hardware applications are affected by the data types and byte lengths used to display data. This study proposes arbitrary, precision fixed-point data types for optimal end-to-end CNN hardware implementation. The network was trained on the Central Processing Unit (CPU) to address the classification problem. The CNN architecture was implemented on a Zynq-7 ZC702 evaluation board with a target device xc7z020clg484-1 platform utilizing high level synthesis (HLS) for the inference stage, based on the calculated weight parameters, and predetermined hyperparameters. The proposed implementation produced the results in 0.00329 s based on hardware implementation. In terms of latency metrics, the hardware-based CNN application produced a response approximately 18.9 times faster than the CPU-based CNN application in the inference phase while retaining the same accuracy. In terms of memory utilization and calculation units, the proposed design uses 52% fewer memory units and 68% fewer calculation units than the baseline design. While the proposed method used fewer resources, the classification success remained at 98.9%.","PeriodicalId":170485,"journal":{"name":"2023 4th International Conference on Innovative Trends in Information Technology (ICITIIT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 4th International Conference on Innovative Trends in Information Technology (ICITIIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITIIT57246.2023.10068601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Convolutional neural networks (CNN) show promising results in many fields, especially in computer vision tasks. However, implementing these networks requires computationally intensive operations. Increasing computational workloads makes it difficult to use CNN models in real-time applications. To overcome these challenges, CNN must be implemented on a dedicated hardware platform such as a field-programmable gate array (FPGA). The parallel processing and reconfigurable features of FPGA hardware make it suitable for real-time applications. Nevertheless, due to limited resources and memory units, various optimizations must be applied prior to implementing processing-intensive structures. Both the resources and the memory units used in hardware applications are affected by the data types and byte lengths used to display data. This study proposes arbitrary, precision fixed-point data types for optimal end-to-end CNN hardware implementation. The network was trained on the Central Processing Unit (CPU) to address the classification problem. The CNN architecture was implemented on a Zynq-7 ZC702 evaluation board with a target device xc7z020clg484-1 platform utilizing high level synthesis (HLS) for the inference stage, based on the calculated weight parameters, and predetermined hyperparameters. The proposed implementation produced the results in 0.00329 s based on hardware implementation. In terms of latency metrics, the hardware-based CNN application produced a response approximately 18.9 times faster than the CPU-based CNN application in the inference phase while retaining the same accuracy. In terms of memory utilization and calculation units, the proposed design uses 52% fewer memory units and 68% fewer calculation units than the baseline design. While the proposed method used fewer resources, the classification success remained at 98.9%.