This work is a deep sparse autoencoder network intrusion detection system which addresses the issue of interpretability of L2 regularization technique used in other works. The proposed model was trained using a mini-batch gradient descent technique, L1 regularization technique and ReLU activation function to arrive at a better performance. Results based on the KDDCUP'99 dataset show that our approach provides significant performance improvements over other deep sparse autoencoder Network Intrusion Detection Systems.
{"title":"Network Intrusion Detection System Using Deep Learning Method with KDD Cup'99 Dataset","authors":"Jesse Jeremiah Tanimu, Mohamed Hamada, Patience Robert, Anish Mahendran","doi":"10.1109/MCSoC57363.2022.00047","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00047","url":null,"abstract":"This work is a deep sparse autoencoder network intrusion detection system which addresses the issue of interpretability of L2 regularization technique used in other works. The proposed model was trained using a mini-batch gradient descent technique, L1 regularization technique and ReLU activation function to arrive at a better performance. Results based on the KDDCUP'99 dataset show that our approach provides significant performance improvements over other deep sparse autoencoder Network Intrusion Detection Systems.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134124617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00070
Yizhi Chen, Yarib Nevarez, Zhonghai Lu, A. García-Ortiz
Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by $5.597times$ and reduces energy consumption by $69.323times$. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves $3.718times$ acceleration and $8.345times$ energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.
{"title":"Accelerating Non-Negative Matrix Factorization on Embedded FPGA with Hybrid Logarithmic Dot-Product Approximation","authors":"Yizhi Chen, Yarib Nevarez, Zhonghai Lu, A. García-Ortiz","doi":"10.1109/MCSoC57363.2022.00070","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00070","url":null,"abstract":"Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by $5.597times$ and reduces energy consumption by $69.323times$. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves $3.718times$ acceleration and $8.345times$ energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116058713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00037
Vijayalakshmi Saravanan, Gang Wan, A. Pillai
Developing a new scheduling algorithm and conducting the performance analysis to recognize its effect in practice can be a laborious task. CPU scheduling is crucial in achieving the operating system's (OS) design goals. There exists a variety of scheduling algorithms in the field and in this paper, a performance comparison of different existing scheduling algorithms by simulating the same bundle of tasks is carried out. A variety of algorithms under batch OS and time-sharing OS are considered. Upon the analysis, a novel task scheduling algorithm incorporating the merits of existing algorithms is proposed for a single CPU system. The performance of various algorithms is compared with the proposed algorithm for parameters viz., throughput, CPU utilization, average turnaround time, waiting time, and response time. Extensive simulation analysis for the various bundle of tasks is conducted and the proposed algorithm is found to outperform the other algorithms in terms of guaranteed reduced average response time. Thus, an efficient CPU scheduler is proposed to accommodate varying workloads at run-time making the best use of the CPU in a particular execution scenario.
{"title":"Exploration of an Enhanced Scheduling Approach with Feasibility Analysis on a Single CPU System","authors":"Vijayalakshmi Saravanan, Gang Wan, A. Pillai","doi":"10.1109/MCSoC57363.2022.00037","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00037","url":null,"abstract":"Developing a new scheduling algorithm and conducting the performance analysis to recognize its effect in practice can be a laborious task. CPU scheduling is crucial in achieving the operating system's (OS) design goals. There exists a variety of scheduling algorithms in the field and in this paper, a performance comparison of different existing scheduling algorithms by simulating the same bundle of tasks is carried out. A variety of algorithms under batch OS and time-sharing OS are considered. Upon the analysis, a novel task scheduling algorithm incorporating the merits of existing algorithms is proposed for a single CPU system. The performance of various algorithms is compared with the proposed algorithm for parameters viz., throughput, CPU utilization, average turnaround time, waiting time, and response time. Extensive simulation analysis for the various bundle of tasks is conducted and the proposed algorithm is found to outperform the other algorithms in terms of guaranteed reduced average response time. Thus, an efficient CPU scheduler is proposed to accommodate varying workloads at run-time making the best use of the CPU in a particular execution scenario.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130533709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
All-digital deep neural network (DNN) accelerators or processors suffer from the Von-Neumann bottleneck, because of the massive data movement required in DNNs. Computation-in-memory (CIM) can reduce the data movement by performing the computations in the memory to save the above problem. However, the analog CIM is susceptible to PVT variations and limited by the analog-digital/digital-analog conversions (ADC/DAC). Most of the current digital CIM techniques adopt integer operation and the bit-serial method, which limits the throughput to the total number of bits. Moreover, they use the adder tree for accumulation, which causes severe area overhead. In this paper, a folded architecture based on time-division multiplexing is proposed to reduce the area and improve the energy efficiency without reducing the throughput. We quantize and ternarize the adaptive floating point (ADP) format with low bits, which can achieve the same or better accuracy than integer quantization, to improve the energy cost of calculation and data movement. This proposed technique can improve the overall throughput and energy efficiency up to 3.83x and 2.19x, respectively, compared to other state-of-the-art digital CIMs with integer.
{"title":"Digital Computation-in-Memory Design with Adaptive Floating Point for Deep Neural Networks","authors":"Yunhan Yang, Wei Lu, Po-Tsang Huang, Hung-Ming Chen","doi":"10.1109/MCSoC57363.2022.00042","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00042","url":null,"abstract":"All-digital deep neural network (DNN) accelerators or processors suffer from the Von-Neumann bottleneck, because of the massive data movement required in DNNs. Computation-in-memory (CIM) can reduce the data movement by performing the computations in the memory to save the above problem. However, the analog CIM is susceptible to PVT variations and limited by the analog-digital/digital-analog conversions (ADC/DAC). Most of the current digital CIM techniques adopt integer operation and the bit-serial method, which limits the throughput to the total number of bits. Moreover, they use the adder tree for accumulation, which causes severe area overhead. In this paper, a folded architecture based on time-division multiplexing is proposed to reduce the area and improve the energy efficiency without reducing the throughput. We quantize and ternarize the adaptive floating point (ADP) format with low bits, which can achieve the same or better accuracy than integer quantization, to improve the energy cost of calculation and data movement. This proposed technique can improve the overall throughput and energy efficiency up to 3.83x and 2.19x, respectively, compared to other state-of-the-art digital CIMs with integer.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00051
Wei-Che Sun, Chih-Peng Fan, Chung-Bin Wu
In this study, the effective low-complexity Convolutional Neural Network (CNN) inference network is implemented by the FPGA-based hardware accelerator for the biometric authentications. After the labeling processes, the eye images with partial iris and sclera zones are used to train and test the LeNet-based Lite-CNN model. Then the lightweight CNN classifier is rapidly prototyped via FPGA for hardware acceleration. Through testing, the proposed Lite-CNN model achieves up to 98% recognition accuracy with the eye images. Compared with the software-based implementation, the proposed Lite-CNN hardware accelerator provides similar detection accuracy, and the inference time of 0.0246 seconds is accelerated about 377 times on the Xilinx ZCU102 FPGA platform. Besides, compared with the previous FPGA implementation by the high level synthesis design, the proposed hardware acceleration design performs the computing speed more than about 92 times.
{"title":"Design and FPGA Implementation of Lite Convolutional Neural Network Based Hardware Accelerator for Ocular Biometrics Recognition Technology","authors":"Wei-Che Sun, Chih-Peng Fan, Chung-Bin Wu","doi":"10.1109/MCSoC57363.2022.00051","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00051","url":null,"abstract":"In this study, the effective low-complexity Convolutional Neural Network (CNN) inference network is implemented by the FPGA-based hardware accelerator for the biometric authentications. After the labeling processes, the eye images with partial iris and sclera zones are used to train and test the LeNet-based Lite-CNN model. Then the lightweight CNN classifier is rapidly prototyped via FPGA for hardware acceleration. Through testing, the proposed Lite-CNN model achieves up to 98% recognition accuracy with the eye images. Compared with the software-based implementation, the proposed Lite-CNN hardware accelerator provides similar detection accuracy, and the inference time of 0.0246 seconds is accelerated about 377 times on the Xilinx ZCU102 FPGA platform. Besides, compared with the previous FPGA implementation by the high level synthesis design, the proposed hardware acceleration design performs the computing speed more than about 92 times.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128054431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00058
Eito Sato, Koji Inoue, Satoshi Kawakami
Recurrent neural networks (RNNs) have achieved high performance in inference processing that handles time-series data. Among them, hardware acceleration for fast processing RNNs is helpful for tasks where real-time performance is es-sential, such as speech recognition and stock market prediction. The nano-photonic neural network accelerator is an approach that takes advantage of the high speed, high parallelism, and low power consumption of light to achieve high performance in neural network processing. However, existing methods are inefficient for RNNs due to significant overhead caused by the absence of recursive paths and the immaturity of the model to be designed. Therefore, architectural considerations that take advantage of RNN characteristics are essential for low latency. This paper proposes a fast and low-power processing unit for RNNs that introduces activation functions and recursion processing using optical devices. We clarified the impact of noise on the proposed circuit's calculation accuracy and inference accuracy. As a result, the calculation accuracy deteriorated significantly in proportion to the increase in the number of recursions, but the effect on inference accuracy was negligible. We also compared the performance of the proposed circuit to an all-electric design and a hybrid design that processes the vector-matrix product optically and the recursion electrically. As a result, the performance of the proposed circuit improves latency by 467x, reduces power consumption by 93.0% compared with the all-electrical design, improves latency by 7.3x, and reduces power consumption by 58.6% compared with the hybrid design.
{"title":"Design and Analysis of a Nano-photonic Processing Unit for Low-Latency Recurrent Neural Network Applications","authors":"Eito Sato, Koji Inoue, Satoshi Kawakami","doi":"10.1109/MCSoC57363.2022.00058","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00058","url":null,"abstract":"Recurrent neural networks (RNNs) have achieved high performance in inference processing that handles time-series data. Among them, hardware acceleration for fast processing RNNs is helpful for tasks where real-time performance is es-sential, such as speech recognition and stock market prediction. The nano-photonic neural network accelerator is an approach that takes advantage of the high speed, high parallelism, and low power consumption of light to achieve high performance in neural network processing. However, existing methods are inefficient for RNNs due to significant overhead caused by the absence of recursive paths and the immaturity of the model to be designed. Therefore, architectural considerations that take advantage of RNN characteristics are essential for low latency. This paper proposes a fast and low-power processing unit for RNNs that introduces activation functions and recursion processing using optical devices. We clarified the impact of noise on the proposed circuit's calculation accuracy and inference accuracy. As a result, the calculation accuracy deteriorated significantly in proportion to the increase in the number of recursions, but the effect on inference accuracy was negligible. We also compared the performance of the proposed circuit to an all-electric design and a hybrid design that processes the vector-matrix product optically and the recursion electrically. As a result, the performance of the proposed circuit improves latency by 467x, reduces power consumption by 93.0% compared with the all-electrical design, improves latency by 7.3x, and reduces power consumption by 58.6% compared with the hybrid design.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124528212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00064
Xuewen He, Yajie Wu, Yichuan Bai, Jie Liu, Li Du, Yuan Du
In the system of multi-core SoC, many specifications need to be considered to optimize interconnect bus architecture, such as the arbitration mechanism, latency, area and power consumption. This paper proposes a reconfigurable design of flexible-arbitrated crossbar to analyze the relevant factors and improve the performance and practicality with the reconfigurable implementation. Two priority matching algorithms are proposed in the design to meet more flexible-arbitrated choices for the application scenarios of multi-core SoC. Moreover, the static and dynamic reconfiguration proposed in the paper provides a valuable reference for the design of bus structure in SoC systems. Compared with the original design in the case analysis, the reconfigurable design achieves 23.3% smaller area, 15.7% less latency, and 23% power saving.
{"title":"A Reconfigurable Design of Flexible-arbitrated Crossbar Interconnects in Multi-core SoC system","authors":"Xuewen He, Yajie Wu, Yichuan Bai, Jie Liu, Li Du, Yuan Du","doi":"10.1109/MCSoC57363.2022.00064","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00064","url":null,"abstract":"In the system of multi-core SoC, many specifications need to be considered to optimize interconnect bus architecture, such as the arbitration mechanism, latency, area and power consumption. This paper proposes a reconfigurable design of flexible-arbitrated crossbar to analyze the relevant factors and improve the performance and practicality with the reconfigurable implementation. Two priority matching algorithms are proposed in the design to meet more flexible-arbitrated choices for the application scenarios of multi-core SoC. Moreover, the static and dynamic reconfiguration proposed in the paper provides a valuable reference for the design of bus structure in SoC systems. Compared with the original design in the case analysis, the reconfigurable design achieves 23.3% smaller area, 15.7% less latency, and 23% power saving.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121642582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00035
Kai Yan, Chaoyue Zhao, Chengkang Shen, Peiyan Wang, Guoqing Wang
Automobiles have become an indispensable part of life for both business and pleasure in today's society. Because of the long-term continuous work, fatigue presents a great danger to ride-sharing and truck drivers. Therefore, this paper aims to design a device that provides valuable feedback by evaluating driver status and surroundings. A gradient judgment is made through lane detection and face detection. When a dangerous condition is detected, the driver will be alerted by music and audio announcements with different degrees. The system also has two additional functions. First, a digital record-keeping to assist the professional driver. The other is a security system that if a stranger starts the car, a text message will be sent to the owner's phone. Compared with those in previous works, the proposed system's efficacy and efficiency are validated qualitatively and quantitatively in driver fatigue detection.
{"title":"Driver Status Monitoring System with Feedback from Fatigue Detection and Lane Line Detection","authors":"Kai Yan, Chaoyue Zhao, Chengkang Shen, Peiyan Wang, Guoqing Wang","doi":"10.1109/MCSoC57363.2022.00035","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00035","url":null,"abstract":"Automobiles have become an indispensable part of life for both business and pleasure in today's society. Because of the long-term continuous work, fatigue presents a great danger to ride-sharing and truck drivers. Therefore, this paper aims to design a device that provides valuable feedback by evaluating driver status and surroundings. A gradient judgment is made through lane detection and face detection. When a dangerous condition is detected, the driver will be alerted by music and audio announcements with different degrees. The system also has two additional functions. First, a digital record-keeping to assist the professional driver. The other is a security system that if a stranger starts the car, a text message will be sent to the owner's phone. Compared with those in previous works, the proposed system's efficacy and efficiency are validated qualitatively and quantitatively in driver fatigue detection.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00017
Kazuei Hironaka, Kensuke Iizuka, H. Amano
One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated.
{"title":"A Message Passing Interface Library for High-Level Synthesis on Multi-FPGA Systems","authors":"Kazuei Hironaka, Kensuke Iizuka, H. Amano","doi":"10.1109/MCSoC57363.2022.00017","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00017","url":null,"abstract":"One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117263539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01DOI: 10.1109/MCSoC57363.2022.00011
M.A. Akbar, Bo Wang, A. Bermak
With the increasing number of invasive attacks, cryptographic processors are becoming more susceptible to failure. Therefore, the desire for reliable hardware is becoming increasingly important. Since an adder is a vital component in the hardware design of cryptographic protocols, a reliable adder can significantly improve the vulnerability against invasive attacks. Adders with different architectures have already been widely studied and analyzed and appropriate types have been proposed based on the application. This paper considers the design of adder most suitable for reliable cryptographic operation and investigates the optimal self-checking carry propagate adder design offering the best possible performance in terms of latency, delay, and area. In terms of area versus delay, the self-checking parallel ripple carry adder (PRCA) with 23.4% area overhead as compared to the self-checking ripple carry adder (RCA) provides a delay efficiency of 70.31%. However, the area-delay product for 64-bit self-checking designs showed that the hybrid adder is 71.2%, 21.4%, and 37.9% more efficient than the RCA, PRCA and carry look-ahead adder design, respectively.
{"title":"Evaluating the Optimal Self-Checking Carry Propagate Adder for Cryptographic Processor","authors":"M.A. Akbar, Bo Wang, A. Bermak","doi":"10.1109/MCSoC57363.2022.00011","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00011","url":null,"abstract":"With the increasing number of invasive attacks, cryptographic processors are becoming more susceptible to failure. Therefore, the desire for reliable hardware is becoming increasingly important. Since an adder is a vital component in the hardware design of cryptographic protocols, a reliable adder can significantly improve the vulnerability against invasive attacks. Adders with different architectures have already been widely studied and analyzed and appropriate types have been proposed based on the application. This paper considers the design of adder most suitable for reliable cryptographic operation and investigates the optimal self-checking carry propagate adder design offering the best possible performance in terms of latency, delay, and area. In terms of area versus delay, the self-checking parallel ripple carry adder (PRCA) with 23.4% area overhead as compared to the self-checking ripple carry adder (RCA) provides a delay efficiency of 70.31%. However, the area-delay product for 64-bit self-checking designs showed that the hybrid adder is 71.2%, 21.4%, and 37.9% more efficient than the RCA, PRCA and carry look-ahead adder design, respectively.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128832262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}