首页 > 最新文献

2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)最新文献

英文 中文
Mapping and Frequency Joint Optimization for Energy Efficient Execution of Multiple Applications on Multicore Systems 多核系统多应用节能执行的映射与频率联合优化
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049177
Simei Yang, S. L. Nours, M. M. Real, S. Pillement
Run-time resource managers are essential components to optimize energy consumption in cluster-based multicore architectures. However, with the ever increasing number of functionalities supported by these architectures, it is also necessary to optimize the usage of processing resources while guaranteeing that applications' timing constraints are met. In this paper, we present a new run-time management strategy that includes both processing resource allocation and frequency tuning to optimize clusters energy consumption when multiple applications are executed concurrently. The proposed hybrid allocation process minimizes the number of used processing cores while meeting the latency constraint of each application. This approach offers a good trade-off between efficiency and complexity. The achieved energy saving has been demonstrated through various case-studies with different sets of active applications. Results show an improvement of energy saving up to 206% when compared to the literature.
在基于集群的多核架构中,运行时资源管理器是优化能耗的重要组件。然而,随着这些体系结构支持的功能数量的不断增加,在保证满足应用程序的时间约束的同时,还需要优化处理资源的使用。在本文中,我们提出了一种新的运行时管理策略,该策略包括处理资源分配和频率调优,以在并发执行多个应用程序时优化集群能耗。提出的混合分配流程在满足每个应用程序的延迟约束的同时,最大限度地减少了使用的处理内核数量。这种方法在效率和复杂性之间进行了很好的权衡。通过不同主动应用的案例研究,证明了所实现的节能效果。结果表明,与文献相比,节能提高了206%。
{"title":"Mapping and Frequency Joint Optimization for Energy Efficient Execution of Multiple Applications on Multicore Systems","authors":"Simei Yang, S. L. Nours, M. M. Real, S. Pillement","doi":"10.1109/DASIP48288.2019.9049177","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049177","url":null,"abstract":"Run-time resource managers are essential components to optimize energy consumption in cluster-based multicore architectures. However, with the ever increasing number of functionalities supported by these architectures, it is also necessary to optimize the usage of processing resources while guaranteeing that applications' timing constraints are met. In this paper, we present a new run-time management strategy that includes both processing resource allocation and frequency tuning to optimize clusters energy consumption when multiple applications are executed concurrently. The proposed hybrid allocation process minimizes the number of used processing cores while meeting the latency constraint of each application. This approach offers a good trade-off between efficiency and complexity. The achieved energy saving has been demonstrated through various case-studies with different sets of active applications. Results show an improvement of energy saving up to 206% when compared to the literature.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129673372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Speeding-up CNN inference through dimensionality reduction 通过降维加速CNN推理
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049204
Lucas Fernández Brillet, N. Leclaire, S. Mancini, Sébastien Cleyet-Merle, M. Nicolas, Jean-Paul Henriques, C. Delnondedieu
Computational complexity of CNNs makes their integration in embedded systems with low power consumption requirements a challenging task, which requires the joint design and adaptation of hardware and algorithms. In this paper, we propose a new general CNN compression method, allowing to reduce both the number of parameters and operations. This method is applied to a binary face detection network which is then implemented and evaluated on hardware.
cnn的计算复杂性使得其在低功耗嵌入式系统中的集成成为一项具有挑战性的任务,这需要硬件和算法的联合设计和适应。在本文中,我们提出了一种新的通用CNN压缩方法,允许减少参数和操作的数量。将该方法应用于一个二值人脸检测网络,并在硬件上进行了实现和评估。
{"title":"Speeding-up CNN inference through dimensionality reduction","authors":"Lucas Fernández Brillet, N. Leclaire, S. Mancini, Sébastien Cleyet-Merle, M. Nicolas, Jean-Paul Henriques, C. Delnondedieu","doi":"10.1109/DASIP48288.2019.9049204","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049204","url":null,"abstract":"Computational complexity of CNNs makes their integration in embedded systems with low power consumption requirements a challenging task, which requires the joint design and adaptation of hardware and algorithms. In this paper, we propose a new general CNN compression method, allowing to reduce both the number of parameters and operations. This method is applied to a binary face detection network which is then implemented and evaluated on hardware.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124797918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using Time-of-Flight Sensors for People Counting Applications 使用飞行时间传感器计数应用
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049169
Michal Stec, Viktor Herrmann, B. Stabernack
Precisely detecting and counting people who are using public transportation is one of the key methods for predicting and planning an efficient use of buses, trams and trains. Providing an effective, well-planned public transportation service is not only important for economic reasons. It also helps to tackle a variety of environmental problems and contributes to a reduction of traffic congestion in urban areas. A couple of such systems had been developed in the past. Those were not sufficiently precise, however. In most cases, these systems rely on data processing generated by one particular type of a 2D image sensor. In this paper we present a robust people counting application, which runs on embedded systems with reasonable requirements as far as computational power is concerned and relies on the processing of 3D data generated by a Time-of-Flight (ToF) sensor. Processing of time-of-flight data requires a couple of preprocessing steps, which is crucial for the subsequent people detection, tracking and counting algorithms. The influence of these preprocessing steps and the effect on the developed detection algorithm are presented. Methods of avoiding misinterpretations by the detection algorithms are discussed. A detailed description of the core algorithms which were developed to process 3D data is provided. An overview will be given on how this method could be further enhanced for the purpose of detecting and differentiating vital and non-vital objects.
准确地检测和统计使用公共交通工具的人数是预测和规划有效使用公共汽车、电车和火车的关键方法之一。提供一个有效的,精心规划的公共交通服务不仅是经济原因的重要。它还有助于解决各种环境问题,并有助于减少城市地区的交通拥堵。过去已经开发了几个这样的系统。然而,这些还不够精确。在大多数情况下,这些系统依赖于由一种特定类型的2D图像传感器生成的数据处理。在本文中,我们提出了一个健壮的人员计数应用程序,该应用程序运行在具有合理计算能力要求的嵌入式系统上,并依赖于由飞行时间(ToF)传感器生成的三维数据的处理。飞行时间数据的处理需要几个预处理步骤,这对后续的人员检测、跟踪和计数算法至关重要。介绍了这些预处理步骤对所开发的检测算法的影响。讨论了利用检测算法避免误读的方法。详细描述了为处理三维数据而开发的核心算法。将概述如何进一步加强这种方法,以便检测和区分生命和非生命物体。
{"title":"Using Time-of-Flight Sensors for People Counting Applications","authors":"Michal Stec, Viktor Herrmann, B. Stabernack","doi":"10.1109/DASIP48288.2019.9049169","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049169","url":null,"abstract":"Precisely detecting and counting people who are using public transportation is one of the key methods for predicting and planning an efficient use of buses, trams and trains. Providing an effective, well-planned public transportation service is not only important for economic reasons. It also helps to tackle a variety of environmental problems and contributes to a reduction of traffic congestion in urban areas. A couple of such systems had been developed in the past. Those were not sufficiently precise, however. In most cases, these systems rely on data processing generated by one particular type of a 2D image sensor. In this paper we present a robust people counting application, which runs on embedded systems with reasonable requirements as far as computational power is concerned and relies on the processing of 3D data generated by a Time-of-Flight (ToF) sensor. Processing of time-of-flight data requires a couple of preprocessing steps, which is crucial for the subsequent people detection, tracking and counting algorithms. The influence of these preprocessing steps and the effect on the developed detection algorithm are presented. Methods of avoiding misinterpretations by the detection algorithms are discussed. A detailed description of the core algorithms which were developed to process 3D data is provided. An overview will be given on how this method could be further enhanced for the purpose of detecting and differentiating vital and non-vital objects.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131089085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
FPGA-Based Acceleration of Expectation Maximization Algorithm Using High-Level Synthesis 基于fpga的期望最大化加速高级综合算法
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049183
M. A. Momen, Mohammed A. S. Khalid, Mohammad Abdul Moin Oninda
Expectation Maximization (EM) is a soft clustering algorithm which partitions data iteratively into M clusters. It is one of the most popular data mining algorithms that uses Gaussian Mixture Models (GMM) for probability density modeling and is widely used in applications such as signal processing and Machine Learning (ML). EM requires high computation time when dealing with large data sets. This paper presents an optimized implementation of EM algorithm on Stratix V and Arria 10 FPGAs using Intel FPGA Software Development Kit (SDK) for Open Computing Language (OpenCL). Comparison of performance and power consumption between Central Processing Unit (CPU), Graphics Processing Unit (GPU) and FPGA is presented for various dimension and cluster sizes. Compared to Intel® Xeon® CPU E5-2637, our fully optimized OpenCL model for EM targeting Arria 10 FPGA achieved up to 1000x speedup in terms of throughput (T) and 5395x speedup in terms of throughput per unit of power consumed (T/P). Compared to previous research on EM-GMM implementation on GPUs, Arria 10 FPGA obtained up to 64.74x speedup (T) and 486.78x speedup (T/P).
期望最大化(EM)是一种将数据迭代划分为M个聚类的软聚类算法。它是使用高斯混合模型(GMM)进行概率密度建模的最流行的数据挖掘算法之一,广泛应用于信号处理和机器学习(ML)等应用。EM在处理大型数据集时需要很高的计算时间。本文利用Intel面向开放计算语言(OpenCL)的FPGA软件开发工具包(SDK)在Stratix V和Arria 10 FPGA上优化实现了EM算法。在不同的维数和簇大小下,对中央处理器(CPU)、图形处理器(GPU)和FPGA的性能和功耗进行了比较。与Intel®至强®CPU E5-2637相比,我们针对EM的完全优化的OpenCL模型针对Arria 10 FPGA实现了高达1000倍的吞吐量加速(T)和5395倍的单位功耗吞吐量加速(T/P)。与之前在gpu上实现EM-GMM的研究相比,Arria 10 FPGA获得高达64.74倍的加速(T)和486.78倍的加速(T/P)。
{"title":"FPGA-Based Acceleration of Expectation Maximization Algorithm Using High-Level Synthesis","authors":"M. A. Momen, Mohammed A. S. Khalid, Mohammad Abdul Moin Oninda","doi":"10.1109/DASIP48288.2019.9049183","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049183","url":null,"abstract":"Expectation Maximization (EM) is a soft clustering algorithm which partitions data iteratively into M clusters. It is one of the most popular data mining algorithms that uses Gaussian Mixture Models (GMM) for probability density modeling and is widely used in applications such as signal processing and Machine Learning (ML). EM requires high computation time when dealing with large data sets. This paper presents an optimized implementation of EM algorithm on Stratix V and Arria 10 FPGAs using Intel FPGA Software Development Kit (SDK) for Open Computing Language (OpenCL). Comparison of performance and power consumption between Central Processing Unit (CPU), Graphics Processing Unit (GPU) and FPGA is presented for various dimension and cluster sizes. Compared to Intel® Xeon® CPU E5-2637, our fully optimized OpenCL model for EM targeting Arria 10 FPGA achieved up to 1000x speedup in terms of throughput (T) and 5395x speedup in terms of throughput per unit of power consumed (T/P). Compared to previous research on EM-GMM implementation on GPUs, Arria 10 FPGA obtained up to 64.74x speedup (T) and 486.78x speedup (T/P).","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134283604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A New Real-Time Embedded Video Denoising Algorithm 一种新的实时嵌入式视频去噪算法
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049189
Andrea Petreto, Thomas Romera, F. Lemaitre, I. Masliah, B. Gaillard, Manuel Bouyer, Quentin L. Meunier, L. Lacassagne
Many embedded applications rely on video processing or on video visualization. Noisy video is thus a major issue for such applications. However, video denoising requires a lot of computational effort and most of the state-of-the-art algorithms cannot be run in real-time at camera framerate. This article introduces a new real-time video denoising algorithm for embedded platforms called RTE-VD. We first compare its denoising capabilities with other online and offline algorithms. We show that RTE-VD can achieve real-time performance (25 frames per second) for qHD video (960⨯540 pixels) on embedded CPUs and the output image quality is comparable to state-of-the-art algorithms. In order to reach real-time denoising, we applied several high-level transforms and optimizations (SIMDization, multi-core parallelization, operator fusion and pipelining). We study the relation between computation time and power consumption on several embedded CPUs and show that it is possible to determine different frequency and core configurations in order to minimize either the computation time or the energy.
许多嵌入式应用程序依赖于视频处理或视频可视化。因此,噪声视频是此类应用的主要问题。然而,视频去噪需要大量的计算量,而且大多数最先进的算法不能以摄像机帧率实时运行。本文介绍了一种新的嵌入式平台实时视频去噪算法RTE-VD。我们首先将其去噪能力与其他在线和离线算法进行比较。我们表明,RTE-VD可以在嵌入式cpu上实现qHD视频(960像素)的实时性能(每秒25帧),输出图像质量可与最先进的算法相媲美。为了达到实时去噪,我们应用了几个高级转换和优化(SIMDization,多核并行化,算子融合和流水线)。我们研究了几种嵌入式cpu的计算时间和功耗之间的关系,并表明可以确定不同的频率和核心配置,以最小化计算时间或能量。
{"title":"A New Real-Time Embedded Video Denoising Algorithm","authors":"Andrea Petreto, Thomas Romera, F. Lemaitre, I. Masliah, B. Gaillard, Manuel Bouyer, Quentin L. Meunier, L. Lacassagne","doi":"10.1109/DASIP48288.2019.9049189","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049189","url":null,"abstract":"Many embedded applications rely on video processing or on video visualization. Noisy video is thus a major issue for such applications. However, video denoising requires a lot of computational effort and most of the state-of-the-art algorithms cannot be run in real-time at camera framerate. This article introduces a new real-time video denoising algorithm for embedded platforms called RTE-VD. We first compare its denoising capabilities with other online and offline algorithms. We show that RTE-VD can achieve real-time performance (25 frames per second) for qHD video (960⨯540 pixels) on embedded CPUs and the output image quality is comparable to state-of-the-art algorithms. In order to reach real-time denoising, we applied several high-level transforms and optimizations (SIMDization, multi-core parallelization, operator fusion and pipelining). We study the relation between computation time and power consumption on several embedded CPUs and show that it is possible to determine different frequency and core configurations in order to minimize either the computation time or the energy.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114957376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
POLYCiNN: Multiclass Binary Inference Engine using Convolutional Decision Forests POLYCiNN:使用卷积决策森林的多类二元推理引擎
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049176
A. Abdelsalam, A. Elsheikh, J. David, Pierre Langlois
Convolutional Neural Networks (CNNs) have achieved significant success in image classification. One of the main reasons that CNNs achieve state-of-the-art accuracy is using many multi-scale learnable windowed feature detectors called kernels. Fetching of kernel feature weights from memory and performing the associated multiply and accumulate computations consume massive amount of energy. This hinders the widespread usage of CNNs, especially in embedded devices. In comparison with CNNs, decision forests are computationally efficient since they are composed of decision trees, which are binary classifiers by nature and can be implemented using AND-OR gates instead of costly multiply and accumulate units. In this paper, we investigate the migration of CNNs to decision forests as one of the promising approaches for reducing both execution time and power consumption while achieving acceptable accuracy. We introduce POLYCiNN, an architecture composed of a stack of decision forests. Each decision forest classifies one of the overlapped sub-images of the original image. Then, all decision forest classifications are fused together to classify the input image. In POLYCiNN, each decision tree is implemented in a single 6-input Look-Up Table and requires no memory access. Therefore, POLYCiNN can be efficiently mapped to simple and densely parallel hardware designs. We validate the performance of POLYCiNN on the benchmark image classification tasks of the MNIST, CIFAR-10 and SVHN datasets.
卷积神经网络(cnn)在图像分类方面取得了显著的成功。cnn达到最先进精度的主要原因之一是使用了许多称为核的多尺度可学习的窗口特征检测器。从内存中提取内核特征权重并执行相关的乘法和累加计算会消耗大量的能量。这阻碍了cnn的广泛使用,特别是在嵌入式设备中。与cnn相比,决策森林的计算效率更高,因为它们是由决策树组成的,决策树本质上是二分类器,可以使用and或门来实现,而不是昂贵的乘法和累积单元。在本文中,我们研究了cnn向决策森林的迁移,作为减少执行时间和功耗同时达到可接受精度的有前途的方法之一。我们介绍POLYCiNN,一个由决策森林堆栈组成的架构。每个决策森林对原始图像的一个重叠子图像进行分类。然后,将所有决策森林分类融合在一起对输入图像进行分类。在POLYCiNN中,每个决策树在单个6输入查找表中实现,并且不需要内存访问。因此,POLYCiNN可以有效地映射到简单且密集并行的硬件设计中。我们在MNIST、CIFAR-10和SVHN数据集的基准图像分类任务上验证了POLYCiNN的性能。
{"title":"POLYCiNN: Multiclass Binary Inference Engine using Convolutional Decision Forests","authors":"A. Abdelsalam, A. Elsheikh, J. David, Pierre Langlois","doi":"10.1109/DASIP48288.2019.9049176","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049176","url":null,"abstract":"Convolutional Neural Networks (CNNs) have achieved significant success in image classification. One of the main reasons that CNNs achieve state-of-the-art accuracy is using many multi-scale learnable windowed feature detectors called kernels. Fetching of kernel feature weights from memory and performing the associated multiply and accumulate computations consume massive amount of energy. This hinders the widespread usage of CNNs, especially in embedded devices. In comparison with CNNs, decision forests are computationally efficient since they are composed of decision trees, which are binary classifiers by nature and can be implemented using AND-OR gates instead of costly multiply and accumulate units. In this paper, we investigate the migration of CNNs to decision forests as one of the promising approaches for reducing both execution time and power consumption while achieving acceptable accuracy. We introduce POLYCiNN, an architecture composed of a stack of decision forests. Each decision forest classifies one of the overlapped sub-images of the original image. Then, all decision forest classifications are fused together to classify the input image. In POLYCiNN, each decision tree is implemented in a single 6-input Look-Up Table and requires no memory access. Therefore, POLYCiNN can be efficiently mapped to simple and densely parallel hardware designs. We validate the performance of POLYCiNN on the benchmark image classification tasks of the MNIST, CIFAR-10 and SVHN datasets.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128860285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Prototyping Methodology for Rapid System Validation in HW/SW Co-Design 硬件/软件协同设计中快速系统验证的混合原型方法
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049195
Arief Wicaksana, A. Charif, Caaliph Andriamisaina, N. Ventroux
As the System-on-Chip (SoC) complexity increases, hardware/software co-design plays an important role to improve design productivity, reduce time to market, and optimize the overall results. Consequently, there is a high interest in providing rapid system validation in such a paradigm to achieve the aforementioned objectives. There exist in previous works prototyping techniques related to the development phase. FPGA-based prototyping has the benefits of enabling HW/SW integration and system validation after the Register Transfer Level (RTL) implementation is available while virtual platforms provide capabilities to accelerate software development with higher level functional models, e.g. Transaction Level Modeling (TLM). In this paper, we propose a hybrid prototyping methodology which takes advantage of virtual and FPGA-based prototyping in a single framework. We aim to provide a rapid and flexible system validation solution for HW/SW co-design at various stages of development based on the availability of TLM and RTL implementations. The proposed methodology allows online and offline performance analysis and debugging for early feedback in HW/SW architecture exploration. This was evaluated in the experiments with a neural network processor as a case study.
随着片上系统(SoC)复杂性的增加,硬件/软件协同设计在提高设计效率、缩短上市时间和优化整体结果方面发挥着重要作用。因此,在这种范例中提供快速的系统验证以实现上述目标是非常有兴趣的。在以前的作品中存在与开发阶段相关的原型技术。基于fpga的原型设计在注册传输层(RTL)实现后能够实现硬件/软件集成和系统验证,而虚拟平台提供了使用更高级别功能模型加速软件开发的能力,例如事务级建模(TLM)。在本文中,我们提出了一种混合原型方法,该方法在单一框架中利用虚拟和基于fpga的原型。我们的目标是基于TLM和RTL实现的可用性,为不同开发阶段的硬件/软件协同设计提供快速灵活的系统验证解决方案。提出的方法允许在线和离线的性能分析和调试,以便在硬件/软件架构探索中获得早期反馈。以神经网络处理器为例进行了实验。
{"title":"Hybrid Prototyping Methodology for Rapid System Validation in HW/SW Co-Design","authors":"Arief Wicaksana, A. Charif, Caaliph Andriamisaina, N. Ventroux","doi":"10.1109/DASIP48288.2019.9049195","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049195","url":null,"abstract":"As the System-on-Chip (SoC) complexity increases, hardware/software co-design plays an important role to improve design productivity, reduce time to market, and optimize the overall results. Consequently, there is a high interest in providing rapid system validation in such a paradigm to achieve the aforementioned objectives. There exist in previous works prototyping techniques related to the development phase. FPGA-based prototyping has the benefits of enabling HW/SW integration and system validation after the Register Transfer Level (RTL) implementation is available while virtual platforms provide capabilities to accelerate software development with higher level functional models, e.g. Transaction Level Modeling (TLM). In this paper, we propose a hybrid prototyping methodology which takes advantage of virtual and FPGA-based prototyping in a single framework. We aim to provide a rapid and flexible system validation solution for HW/SW co-design at various stages of development based on the availability of TLM and RTL implementations. The proposed methodology allows online and offline performance analysis and debugging for early feedback in HW/SW architecture exploration. This was evaluated in the experiments with a neural network processor as a case study.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133484526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Real-Time Implementation of Adaptive Correlation Filter Tracking for 4K Video Stream in Zynq UltraScale+ MPSoC 在Zynq UltraScale+ MPSoC中实时实现4K视频流的自适应相关滤波器跟踪
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049203
M. Kowalczyk, Dominika Przewlocka, T. Kryjak
In this paper a hardware-software implementation of adaptive correlation filter tracking for a 3840 ⨯ 2160 @ 60 fps video stream in a Zynq UltraScale+ MPSoC is discussed. Correlation filters gained popularity in recent years because of their efficiency and good results in the VOT (Visual Object Tracking) challenge. An implementation of the MOSSE (Minimum Output Sum of Squared Error) algorithm is presented. It utilizes 2-dimensional FFT for computing correlation and updates filter coefficients in every frame. The initial filter coefficients are computed on the ARM processor in the PS (Processing System), while all other operations are preformed in PL (Programmable Logic). The presented architecture was described with the use of Verilog hardware description language.
本文讨论了在Zynq UltraScale+ MPSoC中3840 @ 60fps视频流的自适应相关滤波器跟踪的软硬件实现。近年来,相关滤波器因其在视觉目标跟踪(VOT)挑战中的高效和良好效果而受到广泛欢迎。给出了最小输出误差平方和(MOSSE)算法的实现。它利用二维FFT计算相关性,并在每帧中更新滤波器系数。初始滤波器系数在PS(处理系统)中的ARM处理器上计算,而所有其他操作都在PL(可编程逻辑)中执行。采用Verilog硬件描述语言对所提出的体系结构进行了描述。
{"title":"Real-Time Implementation of Adaptive Correlation Filter Tracking for 4K Video Stream in Zynq UltraScale+ MPSoC","authors":"M. Kowalczyk, Dominika Przewlocka, T. Kryjak","doi":"10.1109/DASIP48288.2019.9049203","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049203","url":null,"abstract":"In this paper a hardware-software implementation of adaptive correlation filter tracking for a 3840 ⨯ 2160 @ 60 fps video stream in a Zynq UltraScale+ MPSoC is discussed. Correlation filters gained popularity in recent years because of their efficiency and good results in the VOT (Visual Object Tracking) challenge. An implementation of the MOSSE (Minimum Output Sum of Squared Error) algorithm is presented. It utilizes 2-dimensional FFT for computing correlation and updates filter coefficients in every frame. The initial filter coefficients are computed on the ARM processor in the PS (Processing System), while all other operations are preformed in PL (Programmable Logic). The presented architecture was described with the use of Verilog hardware description language.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130532002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Run-Time Coarse-Grained Hardware Mitigation for Multiple Faults on VLIW Processors 针对VLIW处理器多故障的运行时粗粒度硬件缓解
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049194
Rafail Psiakis, A. Kritikakou, O. Sentieys, E. Casseau
As transistors scale down, processors are more vulnerable to radiation that can cause multiple transient faults in function units. Rather than excluding these units from execution, performance overhead of VLIW processors can be reduced when fault-free components of these affected units are still used. In the proposed approach, the function units are enhanced with coarse-grained fault detectors. A re-scheduling of the instructions is performed at run-time to use not only the healthy function units, but also the fault-free components of the faulty function units. The scheduling window of the proposed mechanism is two instruction bundles being able to explore mitigation solutions in the current and the next instruction execution. Experiments show that the proposed approach can mitigate a large number of faults with low performance and area overheads.
随着晶体管的缩小,处理器更容易受到辐射的影响,而辐射会导致功能单元出现多次瞬态故障。当仍然使用这些受影响单元的无故障组件时,VLIW处理器的性能开销可以降低,而不是将这些单元排除在执行之外。在该方法中,使用粗粒度故障检测器增强了功能单元。在运行时执行指令的重新调度,不仅使用健康的功能单元,而且使用故障功能单元的无故障组件。所提议机制的调度窗口是两个指令包,它们能够在当前和下一个指令执行中探索缓解解决方案。实验结果表明,该方法能有效地缓解大量故障,但性能较低,占用的面积较小。
{"title":"Run-Time Coarse-Grained Hardware Mitigation for Multiple Faults on VLIW Processors","authors":"Rafail Psiakis, A. Kritikakou, O. Sentieys, E. Casseau","doi":"10.1109/DASIP48288.2019.9049194","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049194","url":null,"abstract":"As transistors scale down, processors are more vulnerable to radiation that can cause multiple transient faults in function units. Rather than excluding these units from execution, performance overhead of VLIW processors can be reduced when fault-free components of these affected units are still used. In the proposed approach, the function units are enhanced with coarse-grained fault detectors. A re-scheduling of the instructions is performed at run-time to use not only the healthy function units, but also the fault-free components of the faulty function units. The scheduling window of the proposed mechanism is two instruction bundles being able to explore mitigation solutions in the current and the next instruction execution. Experiments show that the proposed approach can mitigate a large number of faults with low performance and area overheads.","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124401245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distilling the knowledge in CNN for WCE screening tool 提炼CNN中的知识用于WCE筛选工具
Pub Date : 2019-10-01 DOI: 10.1109/DASIP48288.2019.9049201
Thomas Garbay, Orlando Chuquimia, A. Pinna, H. Sahbi, X. Dray, B. Granado
A way to improve the early detection of colorectal cancer is screening. Polyps are a marker of colorectal cancer and the best modality to detect them is the image. In 2003 Wireless Capsule Endoscopy was introduced and opened a way to integrate automatic image processing to realize a screening tool. Moreover, the capacity to detect polyp with Convolutional Neural Network was shown in many scientific studies, but one issue is the integration of these networks. In this article, we present our works to integrate CNN or image processing based on a CNN inside a WCE to realize a powerful screening tool. We apply the knowledge distillation method. We prove that knowledge distillation is efficient from VGG16 to Squeezenet in polyp detection context
提高结肠直肠癌早期发现的一种方法是筛查。息肉是结直肠癌的标志,最好的检测方法是影像学检查。2003年推出了无线胶囊内窥镜,开辟了一种集成自动图像处理实现筛选工具的途径。此外,卷积神经网络检测息肉的能力已在许多科学研究中得到证明,但其中一个问题是这些网络的集成。在本文中,我们介绍了我们的工作,将CNN或基于CNN的图像处理集成到WCE中,以实现强大的筛选工具。我们采用了知识蒸馏的方法。在息肉检测中,我们证明了从VGG16到Squeezenet的知识蒸馏是有效的
{"title":"Distilling the knowledge in CNN for WCE screening tool","authors":"Thomas Garbay, Orlando Chuquimia, A. Pinna, H. Sahbi, X. Dray, B. Granado","doi":"10.1109/DASIP48288.2019.9049201","DOIUrl":"https://doi.org/10.1109/DASIP48288.2019.9049201","url":null,"abstract":"A way to improve the early detection of colorectal cancer is screening. Polyps are a marker of colorectal cancer and the best modality to detect them is the image. In 2003 Wireless Capsule Endoscopy was introduced and opened a way to integrate automatic image processing to realize a screening tool. Moreover, the capacity to detect polyp with Convolutional Neural Network was shown in many scientific studies, but one issue is the integration of these networks. In this article, we present our works to integrate CNN or image processing based on a CNN inside a WCE to realize a powerful screening tool. We apply the knowledge distillation method. We prove that knowledge distillation is efficient from VGG16 to Squeezenet in polyp detection context","PeriodicalId":120855,"journal":{"name":"2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132375636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1