首页 > 最新文献

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文 中文
Title Page III 第三页
{"title":"Title Page III","authors":"","doi":"10.1109/mcsoc.2019.00002","DOIUrl":"https://doi.org/10.1109/mcsoc.2019.00002","url":null,"abstract":"","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130422542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Message from the Chairs 来自主席的信息
Hitesh Sajnani, Chaiyong Ragkhitwetsagul, Manishankar Mondal
Software clone research is of high relevance for software engineering research and practice. Software clones are often a result of copying and pasting as an act of ad-hoc reuse by programmers, and can occur at many levels, from simple statement sequences to blocks, methods, classes, source files, subsystems, models, architectures and entire designs, and in all software artifacts (code, models, requirements or architecture documentation, etc.). While sometimes clones have a demonstrably bad influence on code quality, other studies have shown they can have beneficial effects on the code if used carefully. In this workshop, we seek to discuss new and active results from the research community. In particular, IWSC aims to bring together researchers and practitioners to evaluate the current state of research, discuss common problems, discover opportunities for collaboration, exchange ideas, and explore synergies with similarity analysis in other areas and disciplines.
软件克隆研究对软件工程的研究和实践具有重要意义。软件克隆通常是作为程序员特别重用行为的复制和粘贴的结果,并且可以发生在许多层次上,从简单的语句序列到块、方法、类、源文件、子系统、模型、体系结构和整个设计,以及所有软件工件(代码、模型、需求或体系结构文档等)。虽然有时克隆对代码质量有明显的不良影响,但其他研究表明,如果谨慎使用,它们可以对代码产生有益的影响。在本次研讨会中,我们将讨论研究领域最新的积极成果。特别是,IWSC旨在将研究人员和实践者聚集在一起,评估研究现状,讨论共同问题,发现合作机会,交流思想,并探索与其他领域和学科的相似性分析的协同作用。
{"title":"Message from the Chairs","authors":"Hitesh Sajnani, Chaiyong Ragkhitwetsagul, Manishankar Mondal","doi":"10.1109/mcsoc.2019.00005","DOIUrl":"https://doi.org/10.1109/mcsoc.2019.00005","url":null,"abstract":"Software clone research is of high relevance for software engineering research and practice. Software clones are often a result of copying and pasting as an act of ad-hoc reuse by programmers, and can occur at many levels, from simple statement sequences to blocks, methods, classes, source files, subsystems, models, architectures and entire designs, and in all software artifacts (code, models, requirements or architecture documentation, etc.). While sometimes clones have a demonstrably bad influence on code quality, other studies have shown they can have beneficial effects on the code if used carefully. In this workshop, we seek to discuss new and active results from the research community. In particular, IWSC aims to bring together researchers and practitioners to evaluate the current state of research, discuss common problems, discover opportunities for collaboration, exchange ideas, and explore synergies with similarity analysis in other areas and disciplines.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131837140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hotspot-Pattern-Aware Routing Algorithm for Networks-on-Chip 片上网络的热点模式感知路由算法
Yaoying Luo, M. Meyer, Xin Jiang, Takahiro Watanabe
The Networks-on-Chip (NoC) is widely accepted as an advanced on-chip system which replaces the traditional bus structure. NoC is promising as a solution for future many-core chip processor with better scalability and flexibility. Routers in NoC make the routing decision based on the routing algorithm. Many routing algorithms have been proposed to improve the performance of NoC. Some routing algorithms only have superiority under a specific traffic pattern, but they can have poor performance under other traffic patterns. Compared to uniform traffic, some complex hotspot patterns are closer to reality. Traffic-aware routing algorithms are designed to solve this problem. These traffic-aware routing algorithms commonly utilize virtual channels (VC) or routing tables to predict the future traffic distribution, which will have large power and hardware overheads that cannot be ignored. To solve these problems, a VC-free traffic-pattern-aware routing algorithm based on West-first routing and North-last routing is proposed in this paper. This algorithm contains a hotspot node and hotspot pattern detecting mechanism, which were designed to improve the performance of NoCs under different traffic patterns. A hotspot information block which has a small cost is connected to each router to deal with the hotspot information and detect the hotspot patterns. The simulation results show that routing algorithm proposed combines the advantages of the two existing routing algorithms and has better performance when considering different traffic patterns.
片上网络(NoC)作为取代传统总线结构的一种先进的片上系统被广泛接受。NoC作为未来多核芯片处理器的解决方案,具有更好的可扩展性和灵活性。NoC中的路由器根据路由算法进行路由决策。为了提高NoC的性能,人们提出了许多路由算法。有些路由算法仅在某一特定的流量模式下具有优势,而在其他流量模式下可能表现不佳。与均匀流量相比,一些复杂的热点模式更接近现实。流量感知路由算法就是为了解决这个问题而设计的。这些流量感知路由算法通常利用虚拟通道(VC)或路由表来预测未来的流量分布,这将带来不可忽视的巨大功耗和硬件开销。为了解决这些问题,本文提出了一种基于西先北后的无vc流量模式感知路由算法。该算法包含热点节点和热点模式检测机制,旨在提高noc在不同流量模式下的性能。在每台路由器上连接一个开销较小的热点信息块,用于处理热点信息和检测热点模式。仿真结果表明,所提出的路由算法综合了两种现有路由算法的优点,在考虑不同的流量模式时具有更好的性能。
{"title":"A Hotspot-Pattern-Aware Routing Algorithm for Networks-on-Chip","authors":"Yaoying Luo, M. Meyer, Xin Jiang, Takahiro Watanabe","doi":"10.1109/MCSoC.2019.00040","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00040","url":null,"abstract":"The Networks-on-Chip (NoC) is widely accepted as an advanced on-chip system which replaces the traditional bus structure. NoC is promising as a solution for future many-core chip processor with better scalability and flexibility. Routers in NoC make the routing decision based on the routing algorithm. Many routing algorithms have been proposed to improve the performance of NoC. Some routing algorithms only have superiority under a specific traffic pattern, but they can have poor performance under other traffic patterns. Compared to uniform traffic, some complex hotspot patterns are closer to reality. Traffic-aware routing algorithms are designed to solve this problem. These traffic-aware routing algorithms commonly utilize virtual channels (VC) or routing tables to predict the future traffic distribution, which will have large power and hardware overheads that cannot be ignored. To solve these problems, a VC-free traffic-pattern-aware routing algorithm based on West-first routing and North-last routing is proposed in this paper. This algorithm contains a hotspot node and hotspot pattern detecting mechanism, which were designed to improve the performance of NoCs under different traffic patterns. A hotspot information block which has a small cost is connected to each router to deal with the hotspot information and detect the hotspot patterns. The simulation results show that routing algorithm proposed combines the advantages of the two existing routing algorithms and has better performance when considering different traffic patterns.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127627072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Performance Tuning of Tile Matrix Decomposition 瓷砖矩阵分解的性能调优
Tomohiro Suzuki
Task parallel algorithms have attracted attention as algorithms for highly parallel architectures in recent years. The aim of such algorithms is to keep all computing resources running without stalling by executing a large number of fine-grained tasks asynchronously while observing data dependencies. The tile algorithm of matrix decomposition of dense matrices is implemented using a task parallel programming model following such an approach. In this article, we will consider how to select tile size, which is an important performance parameter.
任务并行算法作为一种高度并行的算法,近年来受到了广泛的关注。这种算法的目的是通过在观察数据依赖关系的同时异步执行大量细粒度任务来保持所有计算资源的运行,而不会出现停顿。在此基础上,利用任务并行规划模型实现了密集矩阵的矩阵分解算法。在本文中,我们将考虑如何选择瓷砖大小,这是一个重要的性能参数。
{"title":"Performance Tuning of Tile Matrix Decomposition","authors":"Tomohiro Suzuki","doi":"10.1109/MCSoC.2019.00011","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00011","url":null,"abstract":"Task parallel algorithms have attracted attention as algorithms for highly parallel architectures in recent years. The aim of such algorithms is to keep all computing resources running without stalling by executing a large number of fine-grained tasks asynchronously while observing data dependencies. The tile algorithm of matrix decomposition of dense matrices is implemented using a task parallel programming model following such an approach. In this article, we will consider how to select tile size, which is an important performance parameter.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132751342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Time Implementation of Time-Space Continuous Dynamic Programming for Air-Drawn Character Recognition Using GPUs 利用gpu实时实现空绘字符识别的时-空连续动态规划
Aki Nakamura, Y. Okuyama, R. Oka
Air-drawn character recognition is one of the input methods using human body movements. Time-Space Continuous Dynamic Programming (TSCDP) is one of the algorithms that can implement such a task by detecting pre-defined trajectories from input videos. Since TSCDP requires massive computation, it is hard to make the system work in real-time with a single processor. In this paper, we investigated the frames per second (fps) requirements for the air-drawn character recognition system using TSCDP. We analyzed the dependencies among the calculations of TSCDP for the parallelization using GPUs. We evaluated the computation time with CPU and GPU for desktop and embedded environments. We confirmed that the proposed system works in real-time for real videos in both desktop and embedded environments by comparing with the fps requirements.
手绘字符识别是利用人体动作进行输入的一种方法。时空连续动态规划(TSCDP)算法通过从输入视频中检测预定义轨迹来实现这一任务。由于TSCDP需要大量的计算,单处理器很难使系统实时工作。在本文中,我们研究了使用TSCDP的空气绘制字符识别系统的每秒帧数(fps)要求。我们分析了gpu并行化中TSCDP计算之间的依赖关系。我们评估了桌面和嵌入式环境下CPU和GPU的计算时间。通过与fps要求的比较,我们证实了所提出的系统在桌面和嵌入式环境下都能实时处理真实视频。
{"title":"Real-Time Implementation of Time-Space Continuous Dynamic Programming for Air-Drawn Character Recognition Using GPUs","authors":"Aki Nakamura, Y. Okuyama, R. Oka","doi":"10.1109/MCSoC.2019.00048","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00048","url":null,"abstract":"Air-drawn character recognition is one of the input methods using human body movements. Time-Space Continuous Dynamic Programming (TSCDP) is one of the algorithms that can implement such a task by detecting pre-defined trajectories from input videos. Since TSCDP requires massive computation, it is hard to make the system work in real-time with a single processor. In this paper, we investigated the frames per second (fps) requirements for the air-drawn character recognition system using TSCDP. We analyzed the dependencies among the calculations of TSCDP for the parallelization using GPUs. We evaluated the computation time with CPU and GPU for desktop and embedded environments. We confirmed that the proposed system works in real-time for real videos in both desktop and embedded environments by comparing with the fps requirements.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128217192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight Semantics-Preserving Communication for Real-Time Automotive Software 面向实时汽车软件的轻量级语义保持通信
Eugene Yip, Erjola Lalo, Gerald Lüttgen, A. Sailer
The automotive industry is confronting the multi-core challenge, where legacy and modern software must run correctly and efficiently in parallel, by designing their software around the Logical Execution Time (LET) model. While such designs offer implementations that are platform independent and time predictable, task communications are assumed to complete instantaneously. Thus, it is critical to implement timely data transfers between LET tasks, which may be on different cores, in order to preserve a design's data-flow. In this paper, we develop a lightweight Static Buffering Protocol (SBP) that satisfies the LET communication semantics and supports signal-based communication with multiple signal writers. Our simulation-based evaluation with realistic industrial automotive benchmarks shows that the execution overhead of SBP is at most half that of the traditional Point-To-Point (PTP) communication method. Moreover, SBP needs on average 60% less buffer memory than PTP.
汽车行业正面临着多核挑战,通过围绕逻辑执行时间(LET)模型设计软件,传统软件和现代软件必须正确有效地并行运行。虽然这样的设计提供了独立于平台和时间可预测的实现,但任务通信被假定为即时完成。因此,为了保持设计的数据流,在可能位于不同核心的LET任务之间实现及时的数据传输至关重要。在本文中,我们开发了一个轻量级的静态缓冲协议(SBP),它满足LET通信语义,并支持与多个信号写入器的基于信号的通信。我们对现实工业汽车基准的仿真评估表明,SBP的执行开销最多是传统点对点(PTP)通信方法的一半。此外,SBP需要的缓冲内存比PTP平均少60%。
{"title":"Lightweight Semantics-Preserving Communication for Real-Time Automotive Software","authors":"Eugene Yip, Erjola Lalo, Gerald Lüttgen, A. Sailer","doi":"10.1109/MCSoC.2019.00059","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00059","url":null,"abstract":"The automotive industry is confronting the multi-core challenge, where legacy and modern software must run correctly and efficiently in parallel, by designing their software around the Logical Execution Time (LET) model. While such designs offer implementations that are platform independent and time predictable, task communications are assumed to complete instantaneously. Thus, it is critical to implement timely data transfers between LET tasks, which may be on different cores, in order to preserve a design's data-flow. In this paper, we develop a lightweight Static Buffering Protocol (SBP) that satisfies the LET communication semantics and supports signal-based communication with multiple signal writers. Our simulation-based evaluation with realistic industrial automotive benchmarks shows that the execution overhead of SBP is at most half that of the traditional Point-To-Point (PTP) communication method. Moreover, SBP needs on average 60% less buffer memory than PTP.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131022185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Efficient Implementation of a TAGE Branch Predictor for Soft Processors on FPGA 基于FPGA的软处理器TAGE分支预测器的高效实现
Katsunoshin Matsui, Md. Ashraful Islam, Kenji Kise
Soft processors are becoming a common component on reconfigurable computing like FPGA. For some accelerators, custom logic functions are implemented as processing elements besides the soft processor. Since the resources in FPGA are fixed and limited, it is desired to implement the soft processor with less logical resources as possible. One of the important parts of the processor is an instruction fetch unit whose performance is dependent on branch prediction. Conventional branch predictors like bimodal or gshare are simple to implement but their prediction accuracy is not good enough. On the other hand, TAGE branch predictor has better prediction accuracy but contains complex logic path for branch prediction, which results in the lower operating frequency. In this paper, we propose a branch predictor called pTAGE, which has almost the same prediction accuracy as TAGE and avoids becoming the critical path of the processor. The branch prediction of pTAGE is pipelined, so prediction result is available on each clock cycle. We implement gshare, TAGE, and pTAGE, respectively in Verilog HDL and evaluate their operating frequency and prediction rate based on FPGA implementation. In this result, pTAGE has almost the same prediction rate as TAGE and 1.41 times higher operating frequency than that of TAGE. Also, we evaluate the performance by varying the latency for updating branch prediction, and the evaluation result shows that pTAGE exhibits higher performance in deep pipelined processors than gshare.
软处理器正在成为FPGA等可重构计算的常见组件。对于某些加速器,除了软处理器之外,还将自定义逻辑功能实现为处理元素。由于FPGA中的资源是固定的和有限的,因此希望用尽可能少的逻辑资源实现软处理器。指令提取单元是处理器的重要组成部分之一,其性能依赖于分支预测。传统的分支预测器(如双峰或gshare)易于实现,但其预测精度不够好。另一方面,TAGE支路预测器预测精度较高,但支路预测逻辑路径复杂,导致工作频率较低。在本文中,我们提出了一个分支预测器pTAGE,它具有与TAGE几乎相同的预测精度,并且避免了成为处理器的关键路径。pTAGE的分支预测采用流水线方式,每个时钟周期都有预测结果。我们在Verilog HDL中分别实现了gshare、TAGE和pTAGE,并基于FPGA实现对它们的工作频率和预测率进行了评估。在这个结果中,pTAGE的预测率与TAGE几乎相同,工作频率是TAGE的1.41倍。此外,我们通过改变更新分支预测的延迟来评估性能,评估结果表明pTAGE在深度流水线处理器中表现出比gshare更高的性能。
{"title":"An Efficient Implementation of a TAGE Branch Predictor for Soft Processors on FPGA","authors":"Katsunoshin Matsui, Md. Ashraful Islam, Kenji Kise","doi":"10.1109/MCSoC.2019.00023","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00023","url":null,"abstract":"Soft processors are becoming a common component on reconfigurable computing like FPGA. For some accelerators, custom logic functions are implemented as processing elements besides the soft processor. Since the resources in FPGA are fixed and limited, it is desired to implement the soft processor with less logical resources as possible. One of the important parts of the processor is an instruction fetch unit whose performance is dependent on branch prediction. Conventional branch predictors like bimodal or gshare are simple to implement but their prediction accuracy is not good enough. On the other hand, TAGE branch predictor has better prediction accuracy but contains complex logic path for branch prediction, which results in the lower operating frequency. In this paper, we propose a branch predictor called pTAGE, which has almost the same prediction accuracy as TAGE and avoids becoming the critical path of the processor. The branch prediction of pTAGE is pipelined, so prediction result is available on each clock cycle. We implement gshare, TAGE, and pTAGE, respectively in Verilog HDL and evaluate their operating frequency and prediction rate based on FPGA implementation. In this result, pTAGE has almost the same prediction rate as TAGE and 1.41 times higher operating frequency than that of TAGE. Also, we evaluate the performance by varying the latency for updating branch prediction, and the evaluation result shows that pTAGE exhibits higher performance in deep pipelined processors than gshare.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"6 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks 集合稀疏卷积神经网络的通用卷积核
Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara
A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.
卷积神经网络(CNN)是应用最成功的神经网络之一,它被广泛应用于许多嵌入式计算机视觉任务。然而,它需要大量的乘法累加(MAC)计算和高功耗来实现,并且现代任务需要更高的识别精度。在本文中,我们应用稀疏技术生成一个弱分类器来构建一个集成CNN。在识别精度和推理速度之间进行权衡,我们通过控制稀疏(零权)比来获得优异的性能和更好的识别精度。我们使用P稀疏权CNN和数据流管道架构,该架构隐藏了对集成CNN进行多个CNN评估的性能开销。我们设置了适当的稀疏比率来调整每个阶段的操作周期数。所提出的集成CNN取决于数据集质量,具有不同的层配置。我们提出了一种通用的卷积核来实现现代卷积运算的变化,并将其扩展到多核,采用流水线架构实现高吞吐量运算。因此,虽然gpu上的计算效率较差,不适合稀疏卷积,但在我们的通用卷积核上可以实现具有优秀流水线效率的架构。我们使用现有的基准数据集和CNN模型来衡量识别精度和推理速度之间的权衡。通过适当设置稀疏比和预测器的数量,在多个通用覆盖上实现了高速架构,与传统的单一CNN实现相比,识别精度得到了提高。我们在Xilinx Kintex UltraScale+ FPGA上实现了多通用卷积核的原型,与集成的桌面GPU实现相比,所提出的多核集成稀疏CNN加速器的速度提高了3.09倍,功耗降低了4.20倍,单位功率性能提高了13.33倍。因此,通过使用多个通用卷积核实现所提出的集成方法,与传统的桌面GPU上的密集权CNN相比,可以在提高识别精度的同时实现高速推理。
{"title":"Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks","authors":"Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara","doi":"10.1109/MCSoC.2019.00021","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00021","url":null,"abstract":"A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115681538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FPGA/Python Co-Design for Lane Line Detection on a PYNQ-Z1 Board FPGA/Python协同设计在PYNQ-Z1板上的线路检测
Koki Honda, Kaijie Wei, H. Amano
This paper presents the implementation of lane line detection on FPGA and Python. Lane line detection consists of three functions, median blur, adaptive threshold, and Hough transform. We implemented only accumulation of Hough transform on FPGA. Although the Hough transform cannot be implemented on a low-end FPGA board if implemented directly, by reducing ρθ space, it was successfully implemented on a low-end FPGA board. The rest of the Hough transform was implemented using Python's NumPy and SciPy, and OpenCV. Although it was very easy to write, it did not become a bottleneck for the whole process because of its effectiveness. As a result, we could achieve a 3.9x speedup compared to OpenCV and kept the developing cost down. When implementing median blur and adaptive threshold on an FPGA, we could achieve a 6.34x speedup.
本文介绍了用FPGA和Python实现线路检测的方法。车道线检测包括中值模糊、自适应阈值和霍夫变换三个功能。我们在FPGA上只实现了霍夫变换的累加。虽然直接实现霍夫变换无法在低端FPGA板上实现,但通过减小ρθ空间,在低端FPGA板上成功实现。霍夫变换的其余部分是使用Python的NumPy和SciPy以及OpenCV实现的。虽然它很容易编写,但由于它的有效性,它并没有成为整个过程的瓶颈。因此,与OpenCV相比,我们可以实现3.9倍的加速,并降低开发成本。当在FPGA上实现中值模糊和自适应阈值时,我们可以实现6.34倍的加速。
{"title":"FPGA/Python Co-Design for Lane Line Detection on a PYNQ-Z1 Board","authors":"Koki Honda, Kaijie Wei, H. Amano","doi":"10.1109/MCSoC.2019.00015","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00015","url":null,"abstract":"This paper presents the implementation of lane line detection on FPGA and Python. Lane line detection consists of three functions, median blur, adaptive threshold, and Hough transform. We implemented only accumulation of Hough transform on FPGA. Although the Hough transform cannot be implemented on a low-end FPGA board if implemented directly, by reducing ρθ space, it was successfully implemented on a low-end FPGA board. The rest of the Hough transform was implemented using Python's NumPy and SciPy, and OpenCV. Although it was very easy to write, it did not become a bottleneck for the whole process because of its effectiveness. As a result, we could achieve a 3.9x speedup compared to OpenCV and kept the developing cost down. When implementing median blur and adaptive threshold on an FPGA, we could achieve a 6.34x speedup.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116175906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Title Page I 第一页
{"title":"Title Page I","authors":"","doi":"10.1109/mcsoc.2019.00001","DOIUrl":"https://doi.org/10.1109/mcsoc.2019.00001","url":null,"abstract":"","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"127 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120818359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1