首页 > 最新文献

2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文 中文
Network Intrusion Detection System Using Deep Learning Method with KDD Cup'99 Dataset 基于KDD Cup'99数据集的深度学习网络入侵检测系统
Jesse Jeremiah Tanimu, Mohamed Hamada, Patience Robert, Anish Mahendran
This work is a deep sparse autoencoder network intrusion detection system which addresses the issue of interpretability of L2 regularization technique used in other works. The proposed model was trained using a mini-batch gradient descent technique, L1 regularization technique and ReLU activation function to arrive at a better performance. Results based on the KDDCUP'99 dataset show that our approach provides significant performance improvements over other deep sparse autoencoder Network Intrusion Detection Systems.
本研究是一个深度稀疏自编码器网络入侵检测系统,它解决了L2正则化技术在其他研究中使用的可解释性问题。采用小批量梯度下降技术、L1正则化技术和ReLU激活函数对模型进行训练,得到了较好的训练效果。基于KDDCUP'99数据集的结果表明,我们的方法比其他深度稀疏自编码器网络入侵检测系统提供了显着的性能改进。
{"title":"Network Intrusion Detection System Using Deep Learning Method with KDD Cup'99 Dataset","authors":"Jesse Jeremiah Tanimu, Mohamed Hamada, Patience Robert, Anish Mahendran","doi":"10.1109/MCSoC57363.2022.00047","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00047","url":null,"abstract":"This work is a deep sparse autoencoder network intrusion detection system which addresses the issue of interpretability of L2 regularization technique used in other works. The proposed model was trained using a mini-batch gradient descent technique, L1 regularization technique and ReLU activation function to arrive at a better performance. Results based on the KDDCUP'99 dataset show that our approach provides significant performance improvements over other deep sparse autoencoder Network Intrusion Detection Systems.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134124617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Non-Negative Matrix Factorization on Embedded FPGA with Hybrid Logarithmic Dot-Product Approximation 基于混合对数点积逼近的嵌入式FPGA加速非负矩阵分解
Yizhi Chen, Yarib Nevarez, Zhonghai Lu, A. García-Ortiz
Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by $5.597times$ and reduces energy consumption by $69.323times$. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves $3.718times$ acceleration and $8.345times$ energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.
非负矩阵分解(NMF)是一种有效的降维和稀疏分解方法。该方法在信号处理、数据挖掘、压缩和模式识别等应用领域引起了科学界的极大兴趣。然而,NMF意味着在性能和能耗方面的计算成本增加,这对于嵌入式应用来说是不够的。为了克服这一限制,我们实现了混合对数近似的向量点积作为硬件优化方法。该技术加速了浮点计算,降低了能耗,并保持了精度。为了演示我们的方法,我们在嵌入式FPGA上使用高级合成的设计探索流程。与ARM CPU上的软件解决方案相比,该硬件实现使分解矩阵的整体计算速度提高了5.597times$,能耗降低了69.323times$。结合KNN(k近邻)的对数近似NMF与KNN在MNIST上对矩阵进行浮点NMF处理后的结果相比,准确率仅下降了2.38%。进一步说,与专用浮点加速器相比,对数近似方法实现了3.718倍的加速和8.345倍的能量减少。与定点方法相比,我们的方法在不预先知道数据范围的情况下,在MNIST数据集上的精度降低了1.93%,在FASHION MNIST数据集上的精度提高了28.2%。因此,我们的方法与输入数据范围具有更好的兼容性。
{"title":"Accelerating Non-Negative Matrix Factorization on Embedded FPGA with Hybrid Logarithmic Dot-Product Approximation","authors":"Yizhi Chen, Yarib Nevarez, Zhonghai Lu, A. García-Ortiz","doi":"10.1109/MCSoC57363.2022.00070","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00070","url":null,"abstract":"Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by $5.597times$ and reduces energy consumption by $69.323times$. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves $3.718times$ acceleration and $8.345times$ energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116058713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploration of an Enhanced Scheduling Approach with Feasibility Analysis on a Single CPU System 单CPU系统上一种增强调度方法的探索及可行性分析
Vijayalakshmi Saravanan, Gang Wan, A. Pillai
Developing a new scheduling algorithm and conducting the performance analysis to recognize its effect in practice can be a laborious task. CPU scheduling is crucial in achieving the operating system's (OS) design goals. There exists a variety of scheduling algorithms in the field and in this paper, a performance comparison of different existing scheduling algorithms by simulating the same bundle of tasks is carried out. A variety of algorithms under batch OS and time-sharing OS are considered. Upon the analysis, a novel task scheduling algorithm incorporating the merits of existing algorithms is proposed for a single CPU system. The performance of various algorithms is compared with the proposed algorithm for parameters viz., throughput, CPU utilization, average turnaround time, waiting time, and response time. Extensive simulation analysis for the various bundle of tasks is conducted and the proposed algorithm is found to outperform the other algorithms in terms of guaranteed reduced average response time. Thus, an efficient CPU scheduler is proposed to accommodate varying workloads at run-time making the best use of the CPU in a particular execution scenario.
开发一种新的调度算法并进行性能分析以识别其在实践中的效果可能是一项艰巨的任务。CPU调度对于实现操作系统(OS)的设计目标至关重要。该领域存在多种调度算法,本文通过模拟同一任务束,对现有的不同调度算法进行性能比较。讨论了批处理操作系统和分时操作系统下的各种算法。在此基础上,提出了一种新的单CPU系统任务调度算法,并结合现有算法的优点。在吞吐量、CPU利用率、平均周转时间、等待时间和响应时间等参数方面,比较了不同算法的性能。对各种任务包进行了广泛的仿真分析,发现该算法在保证减少平均响应时间方面优于其他算法。因此,建议使用高效的CPU调度器在运行时适应不同的工作负载,从而在特定的执行场景中充分利用CPU。
{"title":"Exploration of an Enhanced Scheduling Approach with Feasibility Analysis on a Single CPU System","authors":"Vijayalakshmi Saravanan, Gang Wan, A. Pillai","doi":"10.1109/MCSoC57363.2022.00037","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00037","url":null,"abstract":"Developing a new scheduling algorithm and conducting the performance analysis to recognize its effect in practice can be a laborious task. CPU scheduling is crucial in achieving the operating system's (OS) design goals. There exists a variety of scheduling algorithms in the field and in this paper, a performance comparison of different existing scheduling algorithms by simulating the same bundle of tasks is carried out. A variety of algorithms under batch OS and time-sharing OS are considered. Upon the analysis, a novel task scheduling algorithm incorporating the merits of existing algorithms is proposed for a single CPU system. The performance of various algorithms is compared with the proposed algorithm for parameters viz., throughput, CPU utilization, average turnaround time, waiting time, and response time. Extensive simulation analysis for the various bundle of tasks is conducted and the proposed algorithm is found to outperform the other algorithms in terms of guaranteed reduced average response time. Thus, an efficient CPU scheduler is proposed to accommodate varying workloads at run-time making the best use of the CPU in a particular execution scenario.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130533709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Digital Computation-in-Memory Design with Adaptive Floating Point for Deep Neural Networks 基于自适应浮点数的深度神经网络数字内存计算设计
Yunhan Yang, Wei Lu, Po-Tsang Huang, Hung-Ming Chen
All-digital deep neural network (DNN) accelerators or processors suffer from the Von-Neumann bottleneck, because of the massive data movement required in DNNs. Computation-in-memory (CIM) can reduce the data movement by performing the computations in the memory to save the above problem. However, the analog CIM is susceptible to PVT variations and limited by the analog-digital/digital-analog conversions (ADC/DAC). Most of the current digital CIM techniques adopt integer operation and the bit-serial method, which limits the throughput to the total number of bits. Moreover, they use the adder tree for accumulation, which causes severe area overhead. In this paper, a folded architecture based on time-division multiplexing is proposed to reduce the area and improve the energy efficiency without reducing the throughput. We quantize and ternarize the adaptive floating point (ADP) format with low bits, which can achieve the same or better accuracy than integer quantization, to improve the energy cost of calculation and data movement. This proposed technique can improve the overall throughput and energy efficiency up to 3.83x and 2.19x, respectively, compared to other state-of-the-art digital CIMs with integer.
全数字深度神经网络(DNN)加速器或处理器遭受冯-诺伊曼瓶颈,因为DNN需要大量数据移动。内存计算(CIM)可以通过在内存中执行计算来减少数据移动,从而避免上述问题。然而,模拟CIM易受PVT变化的影响,并且受模数/数模转换(ADC/DAC)的限制。目前的数字CIM技术大多采用整数运算和位串行方法,这将吞吐量限制在位的总数上。此外,它们使用加法器树进行累积,这会导致严重的面积开销。本文提出了一种基于时分复用的折叠架构,在不降低吞吐量的前提下减小了面积,提高了能效。我们对低比特自适应浮点(ADP)格式进行量化和三化处理,可以达到与整数量化相同或更好的精度,从而降低了计算和数据移动的能量消耗。与其他最先进的整数型数字cim相比,该技术可将总体吞吐量和能源效率分别提高3.83倍和2.19倍。
{"title":"Digital Computation-in-Memory Design with Adaptive Floating Point for Deep Neural Networks","authors":"Yunhan Yang, Wei Lu, Po-Tsang Huang, Hung-Ming Chen","doi":"10.1109/MCSoC57363.2022.00042","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00042","url":null,"abstract":"All-digital deep neural network (DNN) accelerators or processors suffer from the Von-Neumann bottleneck, because of the massive data movement required in DNNs. Computation-in-memory (CIM) can reduce the data movement by performing the computations in the memory to save the above problem. However, the analog CIM is susceptible to PVT variations and limited by the analog-digital/digital-analog conversions (ADC/DAC). Most of the current digital CIM techniques adopt integer operation and the bit-serial method, which limits the throughput to the total number of bits. Moreover, they use the adder tree for accumulation, which causes severe area overhead. In this paper, a folded architecture based on time-division multiplexing is proposed to reduce the area and improve the energy efficiency without reducing the throughput. We quantize and ternarize the adaptive floating point (ADP) format with low bits, which can achieve the same or better accuracy than integer quantization, to improve the energy cost of calculation and data movement. This proposed technique can improve the overall throughput and energy efficiency up to 3.83x and 2.19x, respectively, compared to other state-of-the-art digital CIMs with integer.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and FPGA Implementation of Lite Convolutional Neural Network Based Hardware Accelerator for Ocular Biometrics Recognition Technology 基于Lite卷积神经网络的眼部生物特征识别硬件加速器的设计与FPGA实现
Wei-Che Sun, Chih-Peng Fan, Chung-Bin Wu
In this study, the effective low-complexity Convolutional Neural Network (CNN) inference network is implemented by the FPGA-based hardware accelerator for the biometric authentications. After the labeling processes, the eye images with partial iris and sclera zones are used to train and test the LeNet-based Lite-CNN model. Then the lightweight CNN classifier is rapidly prototyped via FPGA for hardware acceleration. Through testing, the proposed Lite-CNN model achieves up to 98% recognition accuracy with the eye images. Compared with the software-based implementation, the proposed Lite-CNN hardware accelerator provides similar detection accuracy, and the inference time of 0.0246 seconds is accelerated about 377 times on the Xilinx ZCU102 FPGA platform. Besides, compared with the previous FPGA implementation by the high level synthesis design, the proposed hardware acceleration design performs the computing speed more than about 92 times.
在本研究中,采用基于fpga的硬件加速器实现了有效的低复杂度卷积神经网络(CNN)推理网络,用于生物识别认证。经过标记处理后,使用部分虹膜和巩膜区域的眼睛图像来训练和测试基于lenet的Lite-CNN模型。然后通过FPGA快速原型化轻量级CNN分类器,实现硬件加速。通过测试,本文提出的Lite-CNN模型对人眼图像的识别准确率高达98%。与基于软件的实现相比,本文提出的Lite-CNN硬件加速器提供了相似的检测精度,并且在Xilinx ZCU102 FPGA平台上将0.0246秒的推理时间加快了约377倍。此外,与以往采用高级综合设计的FPGA实现相比,所提出的硬件加速设计将计算速度提高了约92倍。
{"title":"Design and FPGA Implementation of Lite Convolutional Neural Network Based Hardware Accelerator for Ocular Biometrics Recognition Technology","authors":"Wei-Che Sun, Chih-Peng Fan, Chung-Bin Wu","doi":"10.1109/MCSoC57363.2022.00051","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00051","url":null,"abstract":"In this study, the effective low-complexity Convolutional Neural Network (CNN) inference network is implemented by the FPGA-based hardware accelerator for the biometric authentications. After the labeling processes, the eye images with partial iris and sclera zones are used to train and test the LeNet-based Lite-CNN model. Then the lightweight CNN classifier is rapidly prototyped via FPGA for hardware acceleration. Through testing, the proposed Lite-CNN model achieves up to 98% recognition accuracy with the eye images. Compared with the software-based implementation, the proposed Lite-CNN hardware accelerator provides similar detection accuracy, and the inference time of 0.0246 seconds is accelerated about 377 times on the Xilinx ZCU102 FPGA platform. Besides, compared with the previous FPGA implementation by the high level synthesis design, the proposed hardware acceleration design performs the computing speed more than about 92 times.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128054431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and Analysis of a Nano-photonic Processing Unit for Low-Latency Recurrent Neural Network Applications 用于低延迟递归神经网络的纳米光子处理单元的设计与分析
Eito Sato, Koji Inoue, Satoshi Kawakami
Recurrent neural networks (RNNs) have achieved high performance in inference processing that handles time-series data. Among them, hardware acceleration for fast processing RNNs is helpful for tasks where real-time performance is es-sential, such as speech recognition and stock market prediction. The nano-photonic neural network accelerator is an approach that takes advantage of the high speed, high parallelism, and low power consumption of light to achieve high performance in neural network processing. However, existing methods are inefficient for RNNs due to significant overhead caused by the absence of recursive paths and the immaturity of the model to be designed. Therefore, architectural considerations that take advantage of RNN characteristics are essential for low latency. This paper proposes a fast and low-power processing unit for RNNs that introduces activation functions and recursion processing using optical devices. We clarified the impact of noise on the proposed circuit's calculation accuracy and inference accuracy. As a result, the calculation accuracy deteriorated significantly in proportion to the increase in the number of recursions, but the effect on inference accuracy was negligible. We also compared the performance of the proposed circuit to an all-electric design and a hybrid design that processes the vector-matrix product optically and the recursion electrically. As a result, the performance of the proposed circuit improves latency by 467x, reduces power consumption by 93.0% compared with the all-electrical design, improves latency by 7.3x, and reduces power consumption by 58.6% compared with the hybrid design.
递归神经网络(RNNs)在处理时间序列数据的推理处理中取得了很高的性能。其中,快速处理rnn的硬件加速有助于实时性能要求很高的任务,如语音识别和股票市场预测。纳米光子神经网络加速器是一种利用光的高速、高并行性和低功耗来实现神经网络处理高性能的方法。然而,由于缺乏递归路径和待设计模型的不成熟,现有的方法对于rnn来说效率低下。因此,利用RNN特性的架构考虑对于低延迟至关重要。本文提出了一种快速、低功耗的rnn处理单元,该单元采用光学器件引入激活函数和递归处理。我们阐明了噪声对所提出电路的计算精度和推理精度的影响。结果,计算精度随递归次数的增加而显著下降,但对推理精度的影响可以忽略不计。我们还将所提出电路的性能与全电设计和混合设计进行了比较,混合设计以光学方式处理矢量矩阵乘积和电递归。结果表明,与全电设计相比,该电路的性能延迟提高了467x,功耗降低了93.0%,与混合设计相比,延迟提高了7.3x,功耗降低了58.6%。
{"title":"Design and Analysis of a Nano-photonic Processing Unit for Low-Latency Recurrent Neural Network Applications","authors":"Eito Sato, Koji Inoue, Satoshi Kawakami","doi":"10.1109/MCSoC57363.2022.00058","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00058","url":null,"abstract":"Recurrent neural networks (RNNs) have achieved high performance in inference processing that handles time-series data. Among them, hardware acceleration for fast processing RNNs is helpful for tasks where real-time performance is es-sential, such as speech recognition and stock market prediction. The nano-photonic neural network accelerator is an approach that takes advantage of the high speed, high parallelism, and low power consumption of light to achieve high performance in neural network processing. However, existing methods are inefficient for RNNs due to significant overhead caused by the absence of recursive paths and the immaturity of the model to be designed. Therefore, architectural considerations that take advantage of RNN characteristics are essential for low latency. This paper proposes a fast and low-power processing unit for RNNs that introduces activation functions and recursion processing using optical devices. We clarified the impact of noise on the proposed circuit's calculation accuracy and inference accuracy. As a result, the calculation accuracy deteriorated significantly in proportion to the increase in the number of recursions, but the effect on inference accuracy was negligible. We also compared the performance of the proposed circuit to an all-electric design and a hybrid design that processes the vector-matrix product optically and the recursion electrically. As a result, the performance of the proposed circuit improves latency by 467x, reduces power consumption by 93.0% compared with the all-electrical design, improves latency by 7.3x, and reduces power consumption by 58.6% compared with the hybrid design.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124528212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Reconfigurable Design of Flexible-arbitrated Crossbar Interconnects in Multi-core SoC system 多核SoC系统中柔性仲裁交叉互连的可重构设计
Xuewen He, Yajie Wu, Yichuan Bai, Jie Liu, Li Du, Yuan Du
In the system of multi-core SoC, many specifications need to be considered to optimize interconnect bus architecture, such as the arbitration mechanism, latency, area and power consumption. This paper proposes a reconfigurable design of flexible-arbitrated crossbar to analyze the relevant factors and improve the performance and practicality with the reconfigurable implementation. Two priority matching algorithms are proposed in the design to meet more flexible-arbitrated choices for the application scenarios of multi-core SoC. Moreover, the static and dynamic reconfiguration proposed in the paper provides a valuable reference for the design of bus structure in SoC systems. Compared with the original design in the case analysis, the reconfigurable design achieves 23.3% smaller area, 15.7% less latency, and 23% power saving.
在多核SoC系统中,为了优化互连总线架构,需要考虑许多规格,如仲裁机制、时延、面积和功耗等。本文提出了一种柔性仲裁横杆的可重构设计,分析了相关因素,通过可重构实现提高了横杆的性能和实用性。设计中提出了两种优先级匹配算法,以满足多核SoC应用场景更灵活的仲裁选择。此外,本文提出的静态和动态重构方法为SoC系统的总线结构设计提供了有价值的参考。在案例分析中,可重构设计与原设计相比,面积缩小23.3%,时延降低15.7%,功耗降低23%。
{"title":"A Reconfigurable Design of Flexible-arbitrated Crossbar Interconnects in Multi-core SoC system","authors":"Xuewen He, Yajie Wu, Yichuan Bai, Jie Liu, Li Du, Yuan Du","doi":"10.1109/MCSoC57363.2022.00064","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00064","url":null,"abstract":"In the system of multi-core SoC, many specifications need to be considered to optimize interconnect bus architecture, such as the arbitration mechanism, latency, area and power consumption. This paper proposes a reconfigurable design of flexible-arbitrated crossbar to analyze the relevant factors and improve the performance and practicality with the reconfigurable implementation. Two priority matching algorithms are proposed in the design to meet more flexible-arbitrated choices for the application scenarios of multi-core SoC. Moreover, the static and dynamic reconfiguration proposed in the paper provides a valuable reference for the design of bus structure in SoC systems. Compared with the original design in the case analysis, the reconfigurable design achieves 23.3% smaller area, 15.7% less latency, and 23% power saving.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121642582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Driver Status Monitoring System with Feedback from Fatigue Detection and Lane Line Detection 基于疲劳检测和车道线检测反馈的驾驶员状态监测系统
Kai Yan, Chaoyue Zhao, Chengkang Shen, Peiyan Wang, Guoqing Wang
Automobiles have become an indispensable part of life for both business and pleasure in today's society. Because of the long-term continuous work, fatigue presents a great danger to ride-sharing and truck drivers. Therefore, this paper aims to design a device that provides valuable feedback by evaluating driver status and surroundings. A gradient judgment is made through lane detection and face detection. When a dangerous condition is detected, the driver will be alerted by music and audio announcements with different degrees. The system also has two additional functions. First, a digital record-keeping to assist the professional driver. The other is a security system that if a stranger starts the car, a text message will be sent to the owner's phone. Compared with those in previous works, the proposed system's efficacy and efficiency are validated qualitatively and quantitatively in driver fatigue detection.
在当今社会,汽车已经成为生活中不可或缺的一部分,无论是商业还是娱乐。由于长期连续的工作,疲劳对拼车司机和卡车司机来说是非常危险的。因此,本文旨在设计一种通过评估驾驶员状态和周围环境来提供有价值反馈的装置。通过车道检测和人脸检测进行梯度判断。当检测到危险情况时,会通过不同程度的音乐和音频提示提醒驾驶员。该系统还有两个附加功能。首先,一个数字记录,以协助专业司机。另一个是安全系统,如果有陌生人启动汽车,就会向车主的手机发送短信。通过与以往工作的对比,定性和定量地验证了该系统在驾驶员疲劳检测中的有效性和效率。
{"title":"Driver Status Monitoring System with Feedback from Fatigue Detection and Lane Line Detection","authors":"Kai Yan, Chaoyue Zhao, Chengkang Shen, Peiyan Wang, Guoqing Wang","doi":"10.1109/MCSoC57363.2022.00035","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00035","url":null,"abstract":"Automobiles have become an indispensable part of life for both business and pleasure in today's society. Because of the long-term continuous work, fatigue presents a great danger to ride-sharing and truck drivers. Therefore, this paper aims to design a device that provides valuable feedback by evaluating driver status and surroundings. A gradient judgment is made through lane detection and face detection. When a dangerous condition is detected, the driver will be alerted by music and audio announcements with different degrees. The system also has two additional functions. First, a digital record-keeping to assist the professional driver. The other is a security system that if a stranger starts the car, a text message will be sent to the owner's phone. Compared with those in previous works, the proposed system's efficacy and efficiency are validated qualitatively and quantitatively in driver fatigue detection.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Message Passing Interface Library for High-Level Synthesis on Multi-FPGA Systems 面向多fpga系统高级综合的消息传递接口库
Kazuei Hironaka, Kensuke Iizuka, H. Amano
One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated.
在具有高级综合(HLS)的多fpga系统上开发应用程序的一个障碍是缺乏对编程接口的支持。如果没有标准接口,在多个FPGA板上实现和调试应用程序是困难的。消息传递接口(Message Passing Interface, MPI)是分布式内存系统中常用的标准并行编程接口。本文提出了一个独立于工具的MPI库FiC-MPI,它可以用于多FPGA系统的HLS,其中每个FPGA节点直接连接。通过使用FiC-MPI,可以很容易地实现各种并行软件,包括通用基准测试。FiC-MPI在由Zynq MPSoC板组成的M-KUBOS集群上实现和评估,该集群与静态时分多路复用网络相连。通过使用FiC-MPI模拟器,可以在实际机器上实现并行程序之前对其进行调试。作为一个案例研究,Himeno-BMT基准是用FiC-MPI实现的。在M-KUBOS集群中,单节点时达到178.7 MFLOPS, 4节点时达到643.7 MFLOPS, 6节点时达到896.9 MFLOPS。通过实现,证明了用FiC-MPI在多fpga系统上开发并行程序的便利性。
{"title":"A Message Passing Interface Library for High-Level Synthesis on Multi-FPGA Systems","authors":"Kazuei Hironaka, Kensuke Iizuka, H. Amano","doi":"10.1109/MCSoC57363.2022.00017","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00017","url":null,"abstract":"One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117263539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Optimal Self-Checking Carry Propagate Adder for Cryptographic Processor 加密处理器中最优进位传播加法器的评估
M.A. Akbar, Bo Wang, A. Bermak
With the increasing number of invasive attacks, cryptographic processors are becoming more susceptible to failure. Therefore, the desire for reliable hardware is becoming increasingly important. Since an adder is a vital component in the hardware design of cryptographic protocols, a reliable adder can significantly improve the vulnerability against invasive attacks. Adders with different architectures have already been widely studied and analyzed and appropriate types have been proposed based on the application. This paper considers the design of adder most suitable for reliable cryptographic operation and investigates the optimal self-checking carry propagate adder design offering the best possible performance in terms of latency, delay, and area. In terms of area versus delay, the self-checking parallel ripple carry adder (PRCA) with 23.4% area overhead as compared to the self-checking ripple carry adder (RCA) provides a delay efficiency of 70.31%. However, the area-delay product for 64-bit self-checking designs showed that the hybrid adder is 71.2%, 21.4%, and 37.9% more efficient than the RCA, PRCA and carry look-ahead adder design, respectively.
随着侵入性攻击的增加,加密处理器变得越来越容易出错。因此,对可靠硬件的需求变得越来越重要。由于加法器是加密协议硬件设计中的重要组成部分,因此可靠的加法器可以显著提高加密协议抵御入侵攻击的脆弱性。对不同结构的加法器进行了广泛的研究和分析,并根据应用提出了合适的加法器类型。本文考虑了最适合可靠密码操作的加法器设计,并研究了在延迟、延迟和面积方面提供最佳性能的最优自检进位传播加法器设计。在面积与延迟方面,与自检纹波进位加法器(RCA)相比,自检并行纹波进位加法器(PRCA)的面积开销为23.4%,延迟效率为70.31%。然而,64位自检设计的面积延迟产品表明,混合加法器的效率分别比RCA、PRCA和进位前视加法器设计高71.2%、21.4%和37.9%。
{"title":"Evaluating the Optimal Self-Checking Carry Propagate Adder for Cryptographic Processor","authors":"M.A. Akbar, Bo Wang, A. Bermak","doi":"10.1109/MCSoC57363.2022.00011","DOIUrl":"https://doi.org/10.1109/MCSoC57363.2022.00011","url":null,"abstract":"With the increasing number of invasive attacks, cryptographic processors are becoming more susceptible to failure. Therefore, the desire for reliable hardware is becoming increasingly important. Since an adder is a vital component in the hardware design of cryptographic protocols, a reliable adder can significantly improve the vulnerability against invasive attacks. Adders with different architectures have already been widely studied and analyzed and appropriate types have been proposed based on the application. This paper considers the design of adder most suitable for reliable cryptographic operation and investigates the optimal self-checking carry propagate adder design offering the best possible performance in terms of latency, delay, and area. In terms of area versus delay, the self-checking parallel ripple carry adder (PRCA) with 23.4% area overhead as compared to the self-checking ripple carry adder (RCA) provides a delay efficiency of 70.31%. However, the area-delay product for 64-bit self-checking designs showed that the hybrid adder is 71.2%, 21.4%, and 37.9% more efficient than the RCA, PRCA and carry look-ahead adder design, respectively.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128832262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1