首页 > 最新文献

2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)最新文献

英文 中文
Scalable HEVC decoder for mobile devices: Trade-off between energy consumption and quality 移动设备的可扩展HEVC解码器:能源消耗和质量之间的权衡
Pub Date : 2016-10-12 DOI: 10.1109/DASIP.2016.7853791
E. Raffin, W. Hamidouche, Erwan Nogues, M. Pelcat, D. Ménard
Scalable video coding offers a large choice of configurations when decoding a compressed video. A single encoded bitstream can be decoded in multiple modes, from a full video quality mode to different degraded video quality modes. In the bitstream, data is separated into layers, each layer containing the information relative to a quality level and depending on information from other layers. In the context of an energy constrained scalable video decoder executed on an embedded multicore platform, this paper investigates the energy consumption of an optimized decoder relative to the decoded layers and decoded video quality. These numbers show that a large set of trade-offs between energy and quality is offered by SHVC and can be used to precisely adapt the decoder to its energy constraints.
可扩展的视频编码在解码压缩视频时提供了大量的配置选择。单个编码的比特流可以在多种模式下解码,从完整的视频质量模式到不同的降级视频质量模式。在比特流中,数据被分成层,每层包含相对于质量级别的信息,并依赖于来自其他层的信息。在嵌入式多核平台上运行的能量受限可扩展视频解码器的背景下,本文研究了优化后的解码器相对于解码层和解码视频质量的能量消耗。这些数字表明,SHVC提供了大量的能量和质量之间的权衡,可以用来精确地调整解码器以适应其能量限制。
{"title":"Scalable HEVC decoder for mobile devices: Trade-off between energy consumption and quality","authors":"E. Raffin, W. Hamidouche, Erwan Nogues, M. Pelcat, D. Ménard","doi":"10.1109/DASIP.2016.7853791","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853791","url":null,"abstract":"Scalable video coding offers a large choice of configurations when decoding a compressed video. A single encoded bitstream can be decoded in multiple modes, from a full video quality mode to different degraded video quality modes. In the bitstream, data is separated into layers, each layer containing the information relative to a quality level and depending on information from other layers. In the context of an energy constrained scalable video decoder executed on an embedded multicore platform, this paper investigates the energy consumption of an optimized decoder relative to the decoded layers and decoded video quality. These numbers show that a large set of trade-offs between energy and quality is offered by SHVC and can be used to precisely adapt the decoder to its energy constraints.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"1 1","pages":"18-25"},"PeriodicalIF":0.0,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91300786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demo abstract: FPGA-based implementation of a flexible FFT dedicated to LTE standard 演示摘要:基于fpga实现一种专用于LTE标准的灵活FFT
Pub Date : 2016-10-12 DOI: 10.1109/DASIP.2016.7853833
M. Tran, E. Casseau, M. Gautier
Field Programmable Gate Array (FPGA) technology is expected to play a key role in the development of Software Defined Radio platforms. To reduce design time required when targeting such a technology, high-level synthesis tools can be used. These tools are available in current FPGA CAD tools. In this demo, we will present the design of a FFT component for Long Term Evolution standard and its implementation on a Xilinx Virtex 6 based ML605 board. Our flexible FFT can support FFT sizes among 128, 256, 512, 1024, 1536 and 2048 to compute OFDM symbols. The FFT is specified at a high-level (i.e. in C language). Both dynamic partial reconfiguration and run-time configuration based on input control signals of the flexible FFT will be shown. These two approaches provide interesting tradeoff between reconfiguration time and area.
现场可编程门阵列(FPGA)技术有望在软件定义无线电平台的发展中发挥关键作用。为了减少针对这种技术所需的设计时间,可以使用高级合成工具。这些工具在当前的FPGA CAD工具中可用。在这个演示中,我们将介绍长期演进标准的FFT组件的设计及其在基于Xilinx Virtex 6的ML605板上的实现。我们的灵活FFT可以支持128、256、512、1024、1536和2048之间的FFT大小来计算OFDM符号。FFT是在高级(即C语言)中指定的。给出了基于柔性FFT输入控制信号的动态局部重构和运行时重构。这两种方法在重新配置时间和面积之间提供了有趣的权衡。
{"title":"Demo abstract: FPGA-based implementation of a flexible FFT dedicated to LTE standard","authors":"M. Tran, E. Casseau, M. Gautier","doi":"10.1109/DASIP.2016.7853833","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853833","url":null,"abstract":"Field Programmable Gate Array (FPGA) technology is expected to play a key role in the development of Software Defined Radio platforms. To reduce design time required when targeting such a technology, high-level synthesis tools can be used. These tools are available in current FPGA CAD tools. In this demo, we will present the design of a FFT component for Long Term Evolution standard and its implementation on a Xilinx Virtex 6 based ML605 board. Our flexible FFT can support FFT sizes among 128, 256, 512, 1024, 1536 and 2048 to compute OFDM symbols. The FFT is specified at a high-level (i.e. in C language). Both dynamic partial reconfiguration and run-time configuration based on input control signals of the flexible FFT will be shown. These two approaches provide interesting tradeoff between reconfiguration time and area.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"26 1","pages":"241-242"},"PeriodicalIF":0.0,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81591561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Demo: Ker-ONE: Embedded virtualization approach with dynamic reconfigurable accelerators management 演示:Ker-ONE:具有动态可重构加速器管理的嵌入式虚拟化方法
Pub Date : 2016-10-12 DOI: 10.1109/DASIP.2016.7853825
Tian Xia, Mohamad-Al-Fadl Rihani, Jean-Christophe Prévotet, F. Nouvel
Today, the CPU-FPGA hybrid architecture has become more and more popular in embedded systems. In this approach CPU and FPGA domains are tightly connected by dedicated interconnections, which makes it possible to enhance the traditional CPU virtualization with the dynamic partial reconfiguration (DPR) technology on FPGA. Our research is intended to propose an innovative approach Ker-ONE, which provides a lightweight micro-kernel to support real-time virtualization. Plus, it provide an abstract and transparent layer for virtual machines (VM) to access reconfigurable accelerators. In this demo, the proposed framework is implemented on ARM-FPGA platform, and the mechanism of real-time scheduling/allocation is presented in details via GUI demonstration. We have shown that our approach manages to achieve the a high level of performance with low overheads.
目前,CPU-FPGA混合架构在嵌入式系统中越来越受欢迎。该方法通过专用互连将CPU和FPGA域紧密连接起来,从而可以利用FPGA上的动态部分重构(DPR)技术来增强传统的CPU虚拟化。我们的研究旨在提出一种创新的方法Ker-ONE,它提供了一个轻量级的微内核来支持实时虚拟化。此外,它为虚拟机(VM)提供了一个抽象和透明的层来访问可重构加速器。在此演示中,提出的框架在ARM-FPGA平台上实现,并通过GUI演示详细介绍了实时调度/分配机制。我们已经证明,我们的方法能够以较低的开销实现高水平的性能。
{"title":"Demo: Ker-ONE: Embedded virtualization approach with dynamic reconfigurable accelerators management","authors":"Tian Xia, Mohamad-Al-Fadl Rihani, Jean-Christophe Prévotet, F. Nouvel","doi":"10.1109/DASIP.2016.7853825","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853825","url":null,"abstract":"Today, the CPU-FPGA hybrid architecture has become more and more popular in embedded systems. In this approach CPU and FPGA domains are tightly connected by dedicated interconnections, which makes it possible to enhance the traditional CPU virtualization with the dynamic partial reconfiguration (DPR) technology on FPGA. Our research is intended to propose an innovative approach Ker-ONE, which provides a lightweight micro-kernel to support real-time virtualization. Plus, it provide an abstract and transparent layer for virtual machines (VM) to access reconfigurable accelerators. In this demo, the proposed framework is implemented on ARM-FPGA platform, and the mechanism of real-time scheduling/allocation is presented in details via GUI demonstration. We have shown that our approach manages to achieve the a high level of performance with low overheads.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"52 1","pages":"225-226"},"PeriodicalIF":0.0,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82860397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Demo: Overlay architectures for heterogeneous FPGA cluster management 演示:异构FPGA集群管理的覆盖架构
Pub Date : 2016-10-12 DOI: 10.1109/DASIP.2016.7853832
Théotime Bollengier, M. Najem, Jean-Christophe Le Lann, Loïc Lagadec
Overlays are reconfigurable architectures synthesized on commercial of the shelf (COTS) FPGAs. Overlays bring some advantages such as portability, resources abstraction, fast configuration, and can exhibit features independent from the host FPGA. We designed a fine-grained overlay implementing novel features easing the management of such architectures in a cluster of heterogeneous COTS FPGAs. This demonstration shows the use of this overlay in an FPGA cluster, performing a hardware application live migration between two nodes of a cluster. It also illustrates fault tolerance of the cluster.
覆盖层是在商用fpga (COTS)上合成的可重构架构。覆盖具有可移植性、资源抽象化、快速配置等优点,并且可以独立于主FPGA展现其特性。我们设计了一个细粒度的覆盖层,实现了新的功能,简化了异构COTS fpga集群中这种架构的管理。这个演示展示了在FPGA集群中使用这种覆盖,在集群的两个节点之间执行硬件应用程序的实时迁移。它还说明了集群的容错能力。
{"title":"Demo: Overlay architectures for heterogeneous FPGA cluster management","authors":"Théotime Bollengier, M. Najem, Jean-Christophe Le Lann, Loïc Lagadec","doi":"10.1109/DASIP.2016.7853832","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853832","url":null,"abstract":"Overlays are reconfigurable architectures synthesized on commercial of the shelf (COTS) FPGAs. Overlays bring some advantages such as portability, resources abstraction, fast configuration, and can exhibit features independent from the host FPGA. We designed a fine-grained overlay implementing novel features easing the management of such architectures in a cluster of heterogeneous COTS FPGAs. This demonstration shows the use of this overlay in an FPGA cluster, performing a hardware application live migration between two nodes of a cluster. It also illustrates fault tolerance of the cluster.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"115 1","pages":"239-240"},"PeriodicalIF":0.0,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80852288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Associative Memory based on clustered Neural Networks: Improved model and architecture for Oriented Edge Detection 基于聚类神经网络的联想记忆:面向边缘检测的改进模型和体系结构
Pub Date : 2016-10-10 DOI: 10.1109/DASIP.2016.7853796
R. Danilo, Hugues Wouafo, C. Chavet, Vincent Gripon, L. Conde-Canencia, P. Coussy
Associative Memories (AM) are storage devices that allow addressing content from part of it, in opposition of classical index-based memories. This property makes them promising candidates for various search challenges including pattern detection in images. Clustered based Neural Networks (CbNN) allow efficient design of AM by providing fast pattern retrieval, especially when implemented in hardware. In particular, they can be used to store and next quickly identify oriented edges in images. However, current models of CbNN only provide good performances when facing erasures in the inputs. This paper introduces several improvements to the CbNN model in order to cope with intrusion and additive noises. Namely, we change the initialization of neurons to account for precise information depending on Euclidean distance. We also update the activation rules accordingly, resulting in an efficient handling of various types of input noise. To complete this paper, associated hardware architectures are presented along with the proposed computation models and those are compared with the existing CbNN implementation. Synthesis results show that among them, several divide the cost of that implementation by 3 while increasing the maximal frequency by 25%.
联想存储器(AM)是一种存储设备,它允许从其中的一部分寻址内容,与传统的基于索引的存储器相反。这种特性使它们成为各种搜索挑战的有希望的候选者,包括图像中的模式检测。基于聚类的神经网络(CbNN)通过提供快速的模式检索,特别是在硬件实现时,允许有效的AM设计。特别是,它们可以用于存储和快速识别图像中的定向边缘。然而,目前的CbNN模型只有在面对输入中的擦除时才能提供良好的性能。本文对CbNN模型进行了改进,以应对入侵噪声和加性噪声。也就是说,我们改变神经元的初始化,根据欧几里得距离来解释精确的信息。我们还相应地更新了激活规则,从而有效地处理各种类型的输入噪声。为了完成本文,给出了相关的硬件架构以及所提出的计算模型,并与现有的CbNN实现进行了比较。综合结果表明,其中有几种方法将实现成本减半,而最大频率提高了25%。
{"title":"Associative Memory based on clustered Neural Networks: Improved model and architecture for Oriented Edge Detection","authors":"R. Danilo, Hugues Wouafo, C. Chavet, Vincent Gripon, L. Conde-Canencia, P. Coussy","doi":"10.1109/DASIP.2016.7853796","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853796","url":null,"abstract":"Associative Memories (AM) are storage devices that allow addressing content from part of it, in opposition of classical index-based memories. This property makes them promising candidates for various search challenges including pattern detection in images. Clustered based Neural Networks (CbNN) allow efficient design of AM by providing fast pattern retrieval, especially when implemented in hardware. In particular, they can be used to store and next quickly identify oriented edges in images. However, current models of CbNN only provide good performances when facing erasures in the inputs. This paper introduces several improvements to the CbNN model in order to cope with intrusion and additive noises. Namely, we change the initialization of neurons to account for precise information depending on Euclidean distance. We also update the activation rules accordingly, resulting in an efficient handling of various types of input noise. To complete this paper, associated hardware architectures are presented along with the proposed computation models and those are compared with the existing CbNN implementation. Synthesis results show that among them, several divide the cost of that implementation by 3 while increasing the maximal frequency by 25%.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"1 1","pages":"51-58"},"PeriodicalIF":0.0,"publicationDate":"2016-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77104112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Estimating encoding complexity of a real-time embedded software HEVC codec 实时嵌入式软件HEVC编解码器的编码复杂度估计
Pub Date : 2016-10-01 DOI: 10.1109/DASIP.2016.7853792
Alexandre Mercat, W. Hamidouche, M. Pelcat, D. Ménard
The High Efficiency Video Coding (HEVC) standard provides up to 40% bitrate savings compared to the state-of-art H.264/AVC standard for the same perceptual video quality. Power consumption constraints represent a serious challenge for embedded applications based on a software design. A large number of systems are likely to integrate the HEVC codec in the long run and will need to be energy aware. In this context, we carry out a complexity study of the HEVC coding trees encoding process. This study shows that the complexity of encoding a Coding Unit (CU) of a given size has a non trivial probability density shape and thus can hardly be predicted with accuracy. However, we propose a model that linearly links the ratios between the complexities of coarse-grain and lower-grain CU encodings with a precision error under 6%. This model is valid for a wide range of video contents coded in Intra configurations at different bitrates. This information is useful to control encoder energy during the encoding process on battery limited devices.
在相同的感知视频质量下,与最先进的H.264/AVC标准相比,高效视频编码(HEVC)标准提供了高达40%的比特率节省。功耗限制是基于软件设计的嵌入式应用程序面临的一个严峻挑战。从长远来看,大量的系统可能会集成HEVC编解码器,并且需要具有能源意识。在此背景下,我们对HEVC编码树的编码过程进行了复杂性研究。研究表明,给定大小的编码单元(CU)的编码复杂度具有非平凡的概率密度形状,因此难以准确预测。然而,我们提出了一个模型,将粗粒和低粒CU编码的复杂性之间的比例线性联系起来,精度误差低于6%。该模型适用于以不同比特率在Intra配置中编码的各种视频内容。这个信息对于在电池有限的设备上的编码过程中控制编码器能量是有用的。
{"title":"Estimating encoding complexity of a real-time embedded software HEVC codec","authors":"Alexandre Mercat, W. Hamidouche, M. Pelcat, D. Ménard","doi":"10.1109/DASIP.2016.7853792","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853792","url":null,"abstract":"The High Efficiency Video Coding (HEVC) standard provides up to 40% bitrate savings compared to the state-of-art H.264/AVC standard for the same perceptual video quality. Power consumption constraints represent a serious challenge for embedded applications based on a software design. A large number of systems are likely to integrate the HEVC codec in the long run and will need to be energy aware. In this context, we carry out a complexity study of the HEVC coding trees encoding process. This study shows that the complexity of encoding a Coding Unit (CU) of a given size has a non trivial probability density shape and thus can hardly be predicted with accuracy. However, we propose a model that linearly links the ratios between the complexities of coarse-grain and lower-grain CU encodings with a precision error under 6%. This model is valid for a wide range of video contents coded in Intra configurations at different bitrates. This information is useful to control encoder energy during the encoding process on battery limited devices.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"65 1","pages":"26-33"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74007789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Batched Cholesky factorization for tiny matrices 微矩阵的批处理Cholesky分解
Pub Date : 2016-10-01 DOI: 10.1109/DASIP.2016.7853809
F. Lemaitre, L. Lacassagne
Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speedup the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current Intel SIMD architectures (SSE, AVX2, KNC, AVX512). We achieve with these transformations combined with SIMD a speedup from 13 to 31 for the whole resolution compared to the naive code on a single core AVX2 machine and a speedup from 15 to 33 with multithreading compared to the multithreaded naive code.
许多线性代数库,如Intel MKL、Magma或Eigen,都提供了快速的Cholesky分解。这些库适合于大矩阵,但在小矩阵上执行缓慢。尽管最先进的研究开始对小矩阵感兴趣,但它们通常只有几百行。像计算机视觉或高能物理这样的领域使用微小的矩阵。在本文中,我们证明了通过将小矩阵分组并使用高度专门化的代码来加速小矩阵的Cholesky分解是可能的。我们提供高级转换,加速当前英特尔SIMD架构(SSE, AVX2, KNC, AVX512)的因式分解。与单核AVX2机器上的原始代码相比,我们将这些转换与SIMD相结合,将整个分辨率的加速从13提高到31,与多线程的原始代码相比,将多线程的加速从15提高到33。
{"title":"Batched Cholesky factorization for tiny matrices","authors":"F. Lemaitre, L. Lacassagne","doi":"10.1109/DASIP.2016.7853809","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853809","url":null,"abstract":"Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speedup the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current Intel SIMD architectures (SSE, AVX2, KNC, AVX512). We achieve with these transformations combined with SIMD a speedup from 13 to 31 for the whole resolution compared to the naive code on a single core AVX2 machine and a speedup from 15 to 33 with multithreading compared to the multithreaded naive code.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"5 1","pages":"130-137"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74032305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Monte Carlo method based precision analysis of deep convolution nets 基于蒙特卡罗方法的深度卷积网络精度分析
Pub Date : 2016-10-01 DOI: 10.1109/DASIP.2016.7853814
Robert Krutsch, S. Naidu
Convolution Neural Networks today provide the best results for many image detection and image recognition problems. The network accuracy increase in the past years is obtained through an increase in complexity of the structure and amount of parameters of the deep networks. Memory bandwidth and power consumption constraints are limiting the deployment of such state-of-the-art architecture in low power embedded applications. Reduced coefficient bit depth is one of the most frequently used approach to bring the deep learning neural networks into low power embedded hardware accelerators. In this paper we propose a reduced precision, fixed point implementation that can reduce bandwidth and power consumption significantly. The results show that with an 8bit representation for more than 64% of the parameters less than 0.5% accuracy is lost. As expected, the error resilience varies from layer to layer and convolution kernel to convolution kernel. To cope with this variability and understand what parameter need what type of precision we have developed a Monte Carlo simulation tool that explores the decision space.
卷积神经网络今天为许多图像检测和图像识别问题提供了最好的结果。过去几年网络精度的提高是通过深度网络结构复杂性和参数数量的增加来实现的。内存带宽和功耗限制限制了这种最先进架构在低功耗嵌入式应用程序中的部署。降低位深度系数是将深度学习神经网络引入低功耗嵌入式硬件加速器中最常用的方法之一。在本文中,我们提出了一种降低精度的定点实现,可以显着降低带宽和功耗。结果表明,在使用8位表示的情况下,超过64%的参数的精度损失小于0.5%。正如预期的那样,错误恢复能力随层和卷积核的不同而变化。为了应对这种可变性并理解什么参数需要什么类型的精度,我们开发了一个蒙特卡罗模拟工具来探索决策空间。
{"title":"Monte Carlo method based precision analysis of deep convolution nets","authors":"Robert Krutsch, S. Naidu","doi":"10.1109/DASIP.2016.7853814","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853814","url":null,"abstract":"Convolution Neural Networks today provide the best results for many image detection and image recognition problems. The network accuracy increase in the past years is obtained through an increase in complexity of the structure and amount of parameters of the deep networks. Memory bandwidth and power consumption constraints are limiting the deployment of such state-of-the-art architecture in low power embedded applications. Reduced coefficient bit depth is one of the most frequently used approach to bring the deep learning neural networks into low power embedded hardware accelerators. In this paper we propose a reduced precision, fixed point implementation that can reduce bandwidth and power consumption significantly. The results show that with an 8bit representation for more than 64% of the parameters less than 0.5% accuracy is lost. As expected, the error resilience varies from layer to layer and convolution kernel to convolution kernel. To cope with this variability and understand what parameter need what type of precision we have developed a Monte Carlo simulation tool that explores the decision space.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"42 1","pages":"162-167"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78435175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A pipelined multi-softcore approach for the HOG algorithm HOG算法的流水线多软核方法
Pub Date : 2016-10-01 DOI: 10.1109/DASIP.2016.7853811
J. A. Holanda, João MP Cardoso, E. Marques
This paper describes the mapping and the acceleration of an object detection algorithm on a multiprocessor system based on an FPGA. We use HOG (Histogram of Oriented Gradients), one of the most popular algorithms for detection of different classes of objects and currently being used in smart embedded systems. The use of HOG on such systems requires efficient implementations in order to provide high performance possibly with low energy/power consumption budgets. Also, as variations and adaptations of this algorithm are needed to deal with different scenarios and classes of objects, programmability is required to allow greater development flexibility. In this paper we show our approach towards implementing the HOG algorithm into a multi-softcore Nios II based-system, bearing in mind high-performance and programmability issues. By applying source-to-source transformations we obtain speedups of 19× and by using pipelined processing we reduce the algorithms execution time 49×. We also show that improving the hardware with acceleration units can result in speedups of 72.4× compared to the embedded baseline application.
本文介绍了一种基于FPGA的多处理器系统中目标检测算法的映射和加速。我们使用HOG(定向梯度直方图),这是最流行的算法之一,用于检测不同类别的物体,目前正在智能嵌入式系统中使用。在这样的系统上使用HOG需要高效的实现,以便在低能耗/功耗预算的情况下提供高性能。此外,由于需要对该算法进行变化和调整以处理不同的场景和对象类,因此需要可编程性以允许更大的开发灵活性。在本文中,我们展示了将HOG算法实现到基于Nios II的多软核系统中的方法,同时考虑到高性能和可编程性问题。通过应用源到源转换,我们获得了19倍的加速,通过使用流水线处理,我们减少了49倍的算法执行时间。我们还表明,与嵌入式基线应用程序相比,使用加速单元改进硬件可以使速度提高72.4倍。
{"title":"A pipelined multi-softcore approach for the HOG algorithm","authors":"J. A. Holanda, João MP Cardoso, E. Marques","doi":"10.1109/DASIP.2016.7853811","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853811","url":null,"abstract":"This paper describes the mapping and the acceleration of an object detection algorithm on a multiprocessor system based on an FPGA. We use HOG (Histogram of Oriented Gradients), one of the most popular algorithms for detection of different classes of objects and currently being used in smart embedded systems. The use of HOG on such systems requires efficient implementations in order to provide high performance possibly with low energy/power consumption budgets. Also, as variations and adaptations of this algorithm are needed to deal with different scenarios and classes of objects, programmability is required to allow greater development flexibility. In this paper we show our approach towards implementing the HOG algorithm into a multi-softcore Nios II based-system, bearing in mind high-performance and programmability issues. By applying source-to-source transformations we obtain speedups of 19× and by using pipelined processing we reduce the algorithms execution time 49×. We also show that improving the hardware with acceleration units can result in speedups of 72.4× compared to the embedded baseline application.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"21 1","pages":"146-153"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77474897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Special session 1 automotive parallel computing challenges - architectures, applications and tricks 特别会议1汽车并行计算的挑战-架构,应用和技巧
Pub Date : 2016-10-01 DOI: 10.1109/DASIP.2016.7853813
W. Stechele, T. Kryjak, L. Lacassagne, D. Houzet, M. Danek
The focus of this special session is on computational challenges and solutions related to automotive parallel computing. The five papers cover aspects of machine learning, FPGA-based hardware acceleration, memory optimization, and multi-core systems. Application areas include image and radar processing, as well as AUTOSAR applications.
本次特别会议的重点是与汽车并行计算相关的计算挑战和解决方案。这五篇论文涵盖了机器学习、基于fpga的硬件加速、内存优化和多核系统等方面。应用领域包括图像和雷达处理,以及AUTOSAR应用。
{"title":"Special session 1 automotive parallel computing challenges - architectures, applications and tricks","authors":"W. Stechele, T. Kryjak, L. Lacassagne, D. Houzet, M. Danek","doi":"10.1109/DASIP.2016.7853813","DOIUrl":"https://doi.org/10.1109/DASIP.2016.7853813","url":null,"abstract":"The focus of this special session is on computational challenges and solutions related to automotive parallel computing. The five papers cover aspects of machine learning, FPGA-based hardware acceleration, memory optimization, and multi-core systems. Application areas include image and radar processing, as well as AUTOSAR applications.","PeriodicalId":6494,"journal":{"name":"2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"13 1","pages":"161"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84443384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1