首页 > 最新文献

Journal of Signal Processing Systems for Signal Image and Video Technology最新文献

英文 中文
Video Compression for Screen Recorded Sequences Following Eye Movements. 眼球运动后屏幕记录序列的视频压缩。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2021-01-01 Epub Date: 2021-11-23 DOI: 10.1007/s11265-021-01719-2
Diego Jesus Serrano-Carrasco, Antonio Jesus Diaz-Honrubia, Pedro Cuenca

With the advent of smartphones and tablets, video traffic on the Internet has increased enormously. With this in mind, in 2013 the High Efficiency Video Coding (HEVC) standard was released with the aim of reducing the bit rate (at the same quality) by 50% with respect to its predecessor. However, new contents with greater resolutions and requirements appear every day, making it necessary to further reduce the bit rate. Perceptual video coding has recently been recognized as a promising approach to achieving high-performance video compression and eye tracking data can be used to create and verify these models. In this paper, we present a new algorithm for the bit rate reduction of screen recorded sequences based on the visual perception of videos. An eye tracking system is used during the recording to locate the fixation point of the viewer. Then, the area around that point is encoded with the base quantization parameter (QP) value, which increases when moving away from it. The results show that up to 31.3% of the bit rate may be saved when compared with the original HEVC-encoded sequence, without a significant impact on the perceived quality.

随着智能手机和平板电脑的出现,互联网上的视频流量大幅增加。考虑到这一点,2013年发布了高效视频编码(HEVC)标准,其目标是将比特率(在相同质量下)降低50%。但是,每天都有更高分辨率和更高要求的新内容出现,因此有必要进一步降低比特率。感知视频编码最近被认为是实现高性能视频压缩的一种有前途的方法,眼动追踪数据可用于创建和验证这些模型。本文提出了一种基于视频视觉感知的屏幕记录序列比特率降低算法。在记录过程中使用眼动追踪系统来定位观看者的注视点。然后,用基本量化参数(QP)值对该点周围的区域进行编码,该值在远离该点时增加。结果表明,与原始hevc编码序列相比,可以节省高达31.3%的比特率,而对感知质量没有明显影响。
{"title":"Video Compression for Screen Recorded Sequences Following Eye Movements.","authors":"Diego Jesus Serrano-Carrasco,&nbsp;Antonio Jesus Diaz-Honrubia,&nbsp;Pedro Cuenca","doi":"10.1007/s11265-021-01719-2","DOIUrl":"https://doi.org/10.1007/s11265-021-01719-2","url":null,"abstract":"<p><p>With the advent of smartphones and tablets, video traffic on the Internet has increased enormously. With this in mind, in 2013 the <i>High Efficiency Video Coding</i> (HEVC) standard was released with the aim of reducing the bit rate (at the same quality) by 50% with respect to its predecessor. However, new contents with greater resolutions and requirements appear every day, making it necessary to further reduce the bit rate. Perceptual video coding has recently been recognized as a promising approach to achieving high-performance video compression and eye tracking data can be used to create and verify these models. In this paper, we present a new algorithm for the bit rate reduction of screen recorded sequences based on the visual perception of videos. An eye tracking system is used during the recording to locate the fixation point of the viewer. Then, the area around that point is encoded with the base <i>quantization parameter</i> (QP) value, which increases when moving away from it. The results show that up to 31.3% of the bit rate may be saved when compared with the original HEVC-encoded sequence, without a significant impact on the perceived quality.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8610366/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39673552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications. 基于帧的编程,基于流的医学图像处理应用。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2019-01-01 Epub Date: 2019-01-04 DOI: 10.1007/s11265-018-1422-3
Joost Hoozemans, Rob de Jong, Steven van der Vlugt, Jeroen Van Straten, Uttam Kumar Elango, Zaid Al-Ars

This paper presents and evaluates an approach to deploy image and video processing pipelines that are developed frame-oriented on a hardware platform that is stream-oriented, such as an FPGA. First, this calls for a specialized streaming memory hierarchy and accompanying software framework that transparently moves image segments between stages in the image processing pipeline. Second, we use softcore VLIW processors, that are targetable by a C compiler and have hardware debugging capabilities, to evaluate and debug the software before moving to a High-Level Synthesis flow. The algorithm development phase, including debugging and optimizing on the target platform, is often a very time consuming step in the development of a new product. Our proposed platform allows both software developers and hardware designers to test iterations in a matter of seconds (compilation time) instead of hours (synthesis or circuit simulation time).

本文提出并评估了一种在面向流的硬件平台(如FPGA)上部署面向帧的图像和视频处理管道的方法。首先,这需要一个专门的流存储器层次结构和伴随的软件框架,它可以透明地在图像处理管道的各个阶段之间移动图像段。其次,我们使用软核VLIW处理器,它可以被C编译器定位并具有硬件调试功能,在移动到高级合成流程之前评估和调试软件。算法开发阶段,包括在目标平台上的调试和优化,通常是新产品开发中非常耗时的步骤。我们提出的平台允许软件开发人员和硬件设计人员在几秒钟(编译时间)内测试迭代,而不是几小时(合成或电路模拟时间)。
{"title":"Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications.","authors":"Joost Hoozemans,&nbsp;Rob de Jong,&nbsp;Steven van der Vlugt,&nbsp;Jeroen Van Straten,&nbsp;Uttam Kumar Elango,&nbsp;Zaid Al-Ars","doi":"10.1007/s11265-018-1422-3","DOIUrl":"https://doi.org/10.1007/s11265-018-1422-3","url":null,"abstract":"<p><p>This paper presents and evaluates an approach to deploy image and video processing pipelines that are developed frame-oriented on a hardware platform that is stream-oriented, such as an FPGA. First, this calls for a specialized streaming memory hierarchy and accompanying software framework that transparently moves image segments between stages in the image processing pipeline. Second, we use softcore VLIW processors, that are targetable by a C compiler and have hardware debugging capabilities, to evaluate and debug the software before moving to a High-Level Synthesis flow. The algorithm development phase, including debugging and optimizing on the target platform, is often a very time consuming step in the development of a new product. Our proposed platform allows both software developers and hardware designers to test iterations in a matter of seconds (compilation time) instead of hours (synthesis or circuit simulation time).</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-018-1422-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37057759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Monotonic Optimization of Dataflow Buffer Sizes. 数据流缓冲区大小的单调优化。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2019-01-01 Epub Date: 2018-10-23 DOI: 10.1007/s11265-018-1415-2
Martijn Hendriks, Hadi Alizadeh Ara, Marc Geilen, Twan Basten, Ruben Guerra Marin, Rob de Jong, Steven van der Vlugt

Many high data-rate video-processing applications are subject to a trade-off between throughput and the sizes of buffers in the system (the storage distribution). These applications have strict requirements with respect to throughput as this directly relates to the functional correctness. Furthermore, the size of the storage distribution relates to resource usage which should be minimized in many practical cases. The computation kernels of high data-rate video-processing applications can often be specified by cyclo-static dataflow graphs. We therefore study the problem of minimization of the total (weighted) size of the storage distribution under a throughput constraint for cyclo-static dataflow graphs. By combining ideas from the area of monotonic optimization with the causal dependency analysis from a state-of-the-art storage optimization approach, we create an algorithm that scales better than the state-of-the-art approach. Our algorithm can provide a solution and a bound on the suboptimality of this solution at any time, and it iteratively improves this until the optimal solution is found. We evaluate our algorithm using several models from the literature, and on models of a high data-rate video-processing application from the healthcare domain. Our experiments show performance increases up to several orders of magnitude.

许多高数据速率视频处理应用程序都需要在吞吐量和系统中的缓冲区大小(存储分布)之间进行权衡。这些应用程序对吞吐量有严格的要求,因为这直接关系到功能的正确性。此外,存储分布的大小与资源使用有关,在许多实际情况下应尽量减少资源使用。高数据速率视频处理应用的计算内核通常可以通过循环静态数据流图来指定。因此,我们研究了循环静态数据流图在吞吐量约束下存储分布的总(加权)大小的最小化问题。通过将单调优化领域的思想与最先进的存储优化方法的因果依赖分析相结合,我们创建了一种比最先进的方法更具可扩展性的算法。我们的算法可以在任何时候提供一个解和这个解的次优性的界,并迭代改进这个解,直到找到最优解。我们使用文献中的几个模型和医疗保健领域的高数据速率视频处理应用模型来评估我们的算法。我们的实验表明,性能提高了几个数量级。
{"title":"Monotonic Optimization of Dataflow Buffer Sizes.","authors":"Martijn Hendriks,&nbsp;Hadi Alizadeh Ara,&nbsp;Marc Geilen,&nbsp;Twan Basten,&nbsp;Ruben Guerra Marin,&nbsp;Rob de Jong,&nbsp;Steven van der Vlugt","doi":"10.1007/s11265-018-1415-2","DOIUrl":"https://doi.org/10.1007/s11265-018-1415-2","url":null,"abstract":"<p><p>Many high data-rate video-processing applications are subject to a trade-off between throughput and the sizes of buffers in the system (the storage distribution). These applications have strict requirements with respect to throughput as this directly relates to the functional correctness. Furthermore, the size of the storage distribution relates to resource usage which should be minimized in many practical cases. The computation kernels of high data-rate video-processing applications can often be specified by cyclo-static dataflow graphs. We therefore study the problem of minimization of the total (weighted) size of the storage distribution under a throughput constraint for cyclo-static dataflow graphs. By combining ideas from the area of monotonic optimization with the causal dependency analysis from a state-of-the-art storage optimization approach, we create an algorithm that scales better than the state-of-the-art approach. Our algorithm can provide a solution and a bound on the suboptimality of this solution at any time, and it iteratively improves this until the optimal solution is found. We evaluate our algorithm using several models from the literature, and on models of a high data-rate video-processing application from the healthcare domain. Our experiments show performance increases up to several orders of magnitude.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-018-1415-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37057758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ALMARVI Execution Platform: Heterogeneous Video Processing SoC Platform on FPGA. ALMARVI执行平台:基于FPGA的异构视频处理SoC平台。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2019-01-01 Epub Date: 2019-01-02 DOI: 10.1007/s11265-018-1424-1
Joost Hoozemans, Jeroen van Straten, Timo Viitanen, Aleksi Tervo, Jiri Kadlec, Zaid Al-Ars

The proliferation of processing hardware alternatives allows developers to use various customized computing platforms to run their applications in an optimal way. However, porting application code on custom hardware requires a lot of development and porting effort. This paper describes a heterogeneous computational platform (the ALMARVI execution platform) comprising of multiple communicating processors that allow easy programmability through an interface to OpenCL. The ALMARVI platform uses processing elements based on both VLIW and Transport Triggered Architectures (ρ-VEX and TCE cores, respectively). It can be implemented on Zynq devices such as the ZedBoard, and supports OpenCL by means of the pocl (Portable OpenCL) project and our ALMAIF interface specification. This allows developers to execute kernels transparently on either processing elements, thereby allowing to optimize execution time with minimal design and development effort.

处理硬件替代品的激增允许开发人员使用各种定制的计算平台以最佳方式运行他们的应用程序。然而,在定制硬件上移植应用程序代码需要大量的开发和移植工作。本文描述了一个异构计算平台(ALMARVI执行平台),该平台由多个通信处理器组成,可通过OpenCL接口轻松编程。ALMARVI平台使用基于VLIW和传输触发架构(分别为ρ-VEX和TCE核心)的处理元素。它可以在Zynq设备上实现,如ZedBoard,并通过pocl (Portable OpenCL)项目和我们的ALMAIF接口规范支持OpenCL。这允许开发人员透明地在任何一个处理元素上执行内核,从而允许以最小的设计和开发工作来优化执行时间。
{"title":"ALMARVI Execution Platform: Heterogeneous Video Processing SoC Platform on FPGA.","authors":"Joost Hoozemans,&nbsp;Jeroen van Straten,&nbsp;Timo Viitanen,&nbsp;Aleksi Tervo,&nbsp;Jiri Kadlec,&nbsp;Zaid Al-Ars","doi":"10.1007/s11265-018-1424-1","DOIUrl":"https://doi.org/10.1007/s11265-018-1424-1","url":null,"abstract":"<p><p>The proliferation of processing hardware alternatives allows developers to use various customized computing platforms to run their applications in an optimal way. However, porting application code on custom hardware requires a lot of development and porting effort. This paper describes a heterogeneous computational platform (the ALMARVI execution platform) comprising of multiple communicating processors that allow easy programmability through an interface to OpenCL. The ALMARVI platform uses processing elements based on both VLIW and Transport Triggered Architectures (<i>ρ</i>-VEX and TCE cores, respectively). It can be implemented on Zynq devices such as the ZedBoard, and supports OpenCL by means of the pocl (Portable OpenCL) project and our ALMAIF interface specification. This allows developers to execute kernels transparently on either processing elements, thereby allowing to optimize execution time with minimal design and development effort.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-018-1424-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37057760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Non-Uniform Microphone Arrays for Robust Speech Source Localization for Smartphone-Assisted Hearing Aid Devices. 智能手机辅助助听器语音源鲁棒定位的非均匀麦克风阵列。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2018-10-01 Epub Date: 2017-11-09 DOI: 10.1007/s11265-017-1297-8
Anshuman Ganguly, Issa Panahi

Robust speech source localization (SSL) is an important component of the speech processing pipeline for hearing aid devices (HADs). SSL via time direction of arrival (TDOA) estimation has been known to improve performance of HADs in noisy environments, thereby providing better listening experience for hearing aid users. Smartphones now possess the capability to connect to the HADs through wired or wireless channel. In this paper, we present our findings about the non-uniform non-linear microphone array (NUNLA) geometry for improving SSL for HADs using an L-shaped three-element microphone array available on modern smartphones. The proposed method is implemented on a frame-based TDOA estimation algorithm using a modified Dictionary-based singular value decomposition method (SVD) method for localizing single speech sources under very low signal to noise ratios (SNR). Unlike most methods developed for uniform microphone arrays, the proposed method has low spatial aliasing as well as low spatial ambiguity while providing a robust low-error with 360° DOA scanning capability. We present the comparison among different types of microphone arrays, as well as compare their performance using the proposed method.

鲁棒语音源定位(SSL)是助听器语音处理流程的重要组成部分。已知通过时间到达方向(TDOA)估计的SSL可以改善助听器在嘈杂环境中的性能,从而为助听器用户提供更好的聆听体验。智能手机现在可以通过有线或无线信道连接到掌上电脑。在本文中,我们介绍了我们关于非均匀非线性麦克风阵列(NUNLA)几何形状的研究结果,该几何形状用于使用现代智能手机上可用的l形三元素麦克风阵列来改善HADs的SSL。该方法基于基于帧的TDOA估计算法,采用改进的基于字典的奇异值分解(SVD)方法在极低信噪比(SNR)下定位单个语音源。与大多数针对均匀麦克风阵列开发的方法不同,该方法具有低空间混叠和低空间模糊性,同时提供了具有360°DOA扫描能力的鲁棒低误差。我们给出了不同类型的麦克风阵列之间的比较,以及使用所提出的方法比较它们的性能。
{"title":"Non-Uniform Microphone Arrays for Robust Speech Source Localization for Smartphone-Assisted Hearing Aid Devices.","authors":"Anshuman Ganguly,&nbsp;Issa Panahi","doi":"10.1007/s11265-017-1297-8","DOIUrl":"https://doi.org/10.1007/s11265-017-1297-8","url":null,"abstract":"<p><p>Robust speech source localization (SSL) is an important component of the speech processing pipeline for hearing aid devices (HADs). SSL via time direction of arrival (TDOA) estimation has been known to improve performance of HADs in noisy environments, thereby providing better listening experience for hearing aid users. Smartphones now possess the capability to connect to the HADs through wired or wireless channel. In this paper, we present our findings about the non-uniform non-linear microphone array (NUNLA) geometry for improving SSL for HADs using an L-shaped three-element microphone array available on modern smartphones. The proposed method is implemented on a frame-based TDOA estimation algorithm using a modified Dictionary-based singular value decomposition method (SVD) method for localizing single speech sources under very low signal to noise ratios (SNR). Unlike most methods developed for uniform microphone arrays, the proposed method has low spatial aliasing as well as low spatial ambiguity while providing a robust low-error with 360° DOA scanning capability. We present the comparison among different types of microphone arrays, as well as compare their performance using the proposed method.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-017-1297-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36564393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Run-time Reconfigurable Acceleration for Genetic Programming Fitness Evaluation in Trading Strategies. 交易策略遗传规划适应度评价的运行时可重构加速。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2018-01-01 Epub Date: 2017-05-08 DOI: 10.1007/s11265-017-1244-8
Andreea-Ingrid Funie, Paul Grigoras, Pavel Burovskiy, Wayne Luk, Mark Salmon

Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns.

遗传规划可以用来识别金融市场的复杂模式,这可能导致更先进的交易策略。然而,遗传规划的计算密集型特性使得它难以应用于现实世界的问题,特别是在实时约束的情况下。在这项工作中,我们提出使用现场可编程门阵列技术来加速适应度评估步骤,这是遗传规划中计算量最大的操作之一。我们建议开发一种全流水线的混合精度设计,使用运行时重构来加速适应度评估。我们表明,在某些配置上,与以前的解决方案相比,运行时重新配置可以将资源消耗减少2倍。所提出的设计比优化的多线程软件实现快22倍,同时获得相当的财务回报。
{"title":"Run-time Reconfigurable Acceleration for Genetic Programming Fitness Evaluation in Trading Strategies.","authors":"Andreea-Ingrid Funie,&nbsp;Paul Grigoras,&nbsp;Pavel Burovskiy,&nbsp;Wayne Luk,&nbsp;Mark Salmon","doi":"10.1007/s11265-017-1244-8","DOIUrl":"https://doi.org/10.1007/s11265-017-1244-8","url":null,"abstract":"<p><p>Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-017-1244-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37593928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows. 用于高性能图像处理工作流的混合任务图调度器。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2017-12-01 Epub Date: 2017-07-19 DOI: 10.1007/s11265-017-1262-6
Timothy Blattner, Walid Keyrouz, Shuvra S Bhattacharyya, Milton Halem, Mary Brady

Designing applications for scalability is key to improving their performance in hybrid and cluster computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) improves programmer productivity when implementing hybrid workflows for multi-core and multi-GPU systems. The Hybrid Task Graph Scheduler (HTGS) is an abstract execution model, framework, and API that increases programmer productivity when implementing hybrid workflows for such systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, keeps multiple GPUs occupied, and uses all available compute resources. Through these abstractions, data motion and memory are explicit; this makes data locality decisions more accessible. To demonstrate the HTGS application program interface (API), we present implementations of two example algorithms: (1) a matrix multiplication that shows how easily task graphs can be used; and (2) a hybrid implementation of microscopy image stitching that reduces code size by ≈ 43% compared to a manually coded hybrid workflow implementation and showcases the minimal overhead of task graphs in HTGS. Both of the HTGS-based implementations show good performance. In image stitching the HTGS implementation achieves similar performance to the hybrid workflow implementation. Matrix multiplication with HTGS achieves 1.3× and 1.8× speedup over the multi-threaded OpenBLAS library for 16k × 16k and 32k × 32k size matrices, respectively.

设计具有可伸缩性的应用程序是提高混合计算和集群计算性能的关键。调度代码以利用并行性是困难的,特别是在处理数据依赖性、内存管理、数据移动和处理器占用时。混合任务图调度器(HTGS)在为多核和多gpu系统实现混合工作流时提高了程序员的生产力。混合任务图调度器(HTGS)是一种抽象的执行模型、框架和API,可以在为此类系统实现混合工作流时提高程序员的工作效率。HTGS管理任务之间的依赖关系,独立表示CPU和GPU内存,与磁盘I/O和内存传输重叠计算,保持多个GPU占用,并使用所有可用的计算资源。通过这些抽象,数据的运动和存储是显式的;这使得数据位置决策更容易获得。为了演示HTGS应用程序接口(API),我们给出了两个示例算法的实现:(1)矩阵乘法,它显示了任务图的使用是多么容易;(2)显微镜图像拼接的混合实现,与手动编码的混合工作流实现相比,减少了约43%的代码大小,并展示了HTGS中任务图的最小开销。这两种基于html的实现都显示出良好的性能。在图像拼接方面,HTGS实现实现了与混合工作流实现相似的性能。对于16k × 16k和32k × 32k大小的矩阵,HTGS的矩阵乘法比多线程OpenBLAS库分别实现了1.3倍和1.8倍的加速。
{"title":"A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows.","authors":"Timothy Blattner,&nbsp;Walid Keyrouz,&nbsp;Shuvra S Bhattacharyya,&nbsp;Milton Halem,&nbsp;Mary Brady","doi":"10.1007/s11265-017-1262-6","DOIUrl":"https://doi.org/10.1007/s11265-017-1262-6","url":null,"abstract":"<p><p>Designing applications for scalability is key to improving their performance in hybrid and cluster computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) improves programmer productivity when implementing hybrid workflows for multi-core and multi-GPU systems. The Hybrid Task Graph Scheduler (HTGS) is an abstract execution model, framework, and API that increases programmer productivity when implementing hybrid workflows for such systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, keeps multiple GPUs occupied, and uses all available compute resources. Through these abstractions, data motion and memory are explicit; this makes data locality decisions more accessible. To demonstrate the HTGS application program interface (API), we present implementations of two example algorithms: (1) a matrix multiplication that shows how easily task graphs can be used; and (2) a hybrid implementation of microscopy image stitching that reduces code size by ≈ 43% compared to a manually coded hybrid workflow implementation and showcases the minimal overhead of task graphs in HTGS. Both of the HTGS-based implementations show good performance. In image stitching the HTGS implementation achieves similar performance to the hybrid workflow implementation. Matrix multiplication with HTGS achieves 1.3× and 1.8× speedup over the multi-threaded OpenBLAS library for 16k × 16k and 32k × 32k size matrices, respectively.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-017-1262-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35226253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FPGA-Based Soft-Core Processors for Image Processing Applications. 图像处理应用的基于fpga的软核处理器。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2017-01-01 Epub Date: 2016-10-10 DOI: 10.1007/s11265-016-1185-7
Moslem Amiri, Fahad Manzoor Siddiqui, Colm Kelly, Roger Woods, Karen Rafferty, Burak Bardak

With security and surveillance, there is an increasing need to process image data efficiently and effectively either at source or in a large data network. Whilst a Field-Programmable Gate Array has been seen as a key technology for enabling this, the design process has been viewed as problematic in terms of the time and effort needed for implementation and verification. The work here proposes a different approach of using optimized FPGA-based soft-core processors which allows the user to exploit the task and data level parallelism to achieve the quality of dedicated FPGA implementations whilst reducing design time. The paper also reports some preliminary progress on the design flow to program the structure. An implementation for a Histogram of Gradients algorithm is also reported which shows that a performance of 328 fps can be achieved with this design approach, whilst avoiding the long design time, verification and debugging steps associated with conventional FPGA implementations.

随着安全和监控的发展,越来越需要在数据源或大型数据网络中高效和有效地处理图像数据。虽然现场可编程门阵列被视为实现这一目标的关键技术,但在实施和验证所需的时间和精力方面,设计过程一直被视为存在问题。这里的工作提出了一种使用优化的基于FPGA的软核处理器的不同方法,该方法允许用户利用任务和数据级并行性来实现专用FPGA实现的质量,同时减少设计时间。本文还报道了结构编程的设计流程的一些初步进展。本文还报道了梯度直方图算法的实现,表明使用这种设计方法可以实现328 fps的性能,同时避免了与传统FPGA实现相关的长设计时间、验证和调试步骤。
{"title":"FPGA-Based Soft-Core Processors for Image Processing Applications.","authors":"Moslem Amiri,&nbsp;Fahad Manzoor Siddiqui,&nbsp;Colm Kelly,&nbsp;Roger Woods,&nbsp;Karen Rafferty,&nbsp;Burak Bardak","doi":"10.1007/s11265-016-1185-7","DOIUrl":"https://doi.org/10.1007/s11265-016-1185-7","url":null,"abstract":"<p><p>With security and surveillance, there is an increasing need to process image data efficiently and effectively either at source or in a large data network. Whilst a Field-Programmable Gate Array has been seen as a key technology for enabling this, the design process has been viewed as problematic in terms of the time and effort needed for implementation and verification. The work here proposes a different approach of using optimized FPGA-based soft-core processors which allows the user to exploit the task and data level parallelism to achieve the quality of dedicated FPGA implementations whilst reducing design time. The paper also reports some preliminary progress on the design flow to program the structure. An implementation for a Histogram of Gradients algorithm is also reported which shows that a performance of 328 fps can be achieved with this design approach, whilst avoiding the long design time, verification and debugging steps associated with conventional FPGA implementations.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-016-1185-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37782990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Extending the Generalised Pareto Distribution for Novelty Detection in High-Dimensional Spaces. 扩展广义帕累托分布,实现高维空间中的新颖性检测
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2014-01-01 Epub Date: 2013-08-16 DOI: 10.1007/s11265-013-0835-2
David A Clifton, Lei Clifton, Samuel Hugueny, Lionel Tarassenko

Novelty detection involves the construction of a "model of normality", and then classifies test data as being either "normal" or "abnormal" with respect to that model. For this reason, it is often termed one-class classification. The approach is suitable for cases in which examples of "normal" behaviour are commonly available, but in which cases of "abnormal" data are comparatively rare. When performing novelty detection, we are typically most interested in the tails of the normal model, because it is in these tails that a decision boundary between "normal" and "abnormal" areas of data space usually lies. Extreme value statistics provides an appropriate theoretical framework for modelling the tails of univariate (or low-dimensional) distributions, using the generalised Pareto distribution (GPD), which can be demonstrated to be the limiting distribution for data occurring within the tails of most practically-encountered probability distributions. This paper provides an extension of the GPD, allowing the modelling of probability distributions of arbitrarily high dimension, such as occurs when using complex, multimodel, multivariate distributions for performing novelty detection in most real-life cases. We demonstrate our extension to the GPD using examples from patient physiological monitoring, in which we have acquired data from hospital patients in large clinical studies of high-acuity wards, and in which we wish to determine "abnormal" patient data, such that early warning of patient physiological deterioration may be provided.

新颖性检测涉及构建一个 "正态性模型",然后根据该模型将测试数据分为 "正常 "或 "异常 "两类。因此,这种方法通常被称为单类分类法。这种方法适用于 "正常 "行为的例子很常见,而 "异常 "数据的例子相对较少的情况。在进行新颖性检测时,我们通常对正态模型的尾部最感兴趣,因为数据空间中 "正常 "和 "异常 "区域的判定边界通常就在这些尾部。极值统计为单变量(或低维)分布的尾部建模提供了一个合适的理论框架,它使用广义帕累托分布(GPD),可以证明它是大多数实际遇到的概率分布尾部数据的极限分布。本文对 GPD 进行了扩展,允许对任意高维度的概率分布进行建模,例如在大多数实际案例中使用复杂、多模型、多变量分布进行新颖性检测时出现的情况。我们以病人生理监测为例,展示了我们对 GPD 的扩展。我们在高危病房的大型临床研究中获取了医院病人的数据,我们希望确定 "异常 "病人数据,以便提供病人生理恶化的早期预警。
{"title":"Extending the Generalised Pareto Distribution for Novelty Detection in High-Dimensional Spaces.","authors":"David A Clifton, Lei Clifton, Samuel Hugueny, Lionel Tarassenko","doi":"10.1007/s11265-013-0835-2","DOIUrl":"10.1007/s11265-013-0835-2","url":null,"abstract":"<p><p>Novelty detection involves the construction of a \"model of normality\", and then classifies test data as being either \"normal\" or \"abnormal\" with respect to that model. For this reason, it is often termed one-class classification. The approach is suitable for cases in which examples of \"normal\" behaviour are commonly available, but in which cases of \"abnormal\" data are comparatively rare. When performing novelty detection, we are typically most interested in the tails of the normal model, because it is in these tails that a decision boundary between \"normal\" and \"abnormal\" areas of data space usually lies. Extreme value statistics provides an appropriate theoretical framework for modelling the tails of univariate (or low-dimensional) distributions, using the generalised Pareto distribution (GPD), which can be demonstrated to be the limiting distribution for data occurring within the tails of most practically-encountered probability distributions. This paper provides an extension of the GPD, allowing the modelling of probability distributions of arbitrarily high dimension, such as occurs when using complex, multimodel, multivariate distributions for performing novelty detection in most real-life cases. We demonstrate our extension to the GPD using examples from patient physiological monitoring, in which we have acquired data from hospital patients in large clinical studies of high-acuity wards, and in which we wish to determine \"abnormal\" patient data, such that early warning of patient physiological deterioration may be provided.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3963457/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32220354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fully Implantable, Programmable and Multimodal Neuroprocessor for Wireless, Cortically Controlled Brain-Machine Interface Applications. 一种完全可植入、可编程和多模态的神经处理器,用于无线、皮质控制的脑机接口应用。
IF 1.8 4区 计算机科学 Q2 Mathematics Pub Date : 2012-12-01 Epub Date: 2011-06-15 DOI: 10.1007/s11265-012-0670-x
Fei Zhang, Mehdi Aghagolzadeh, Karim Oweiss

Reliability, scalability and clinical viability are of utmost importance in the design of wireless Brain Machine Interface systems (BMIs). This paper reports on the design and implementation of a neuroprocessor for conditioning raw extracellular neural signals recorded through microelectrode arrays chronically implanted in the brain of awake behaving rats. The neuroprocessor design exploits a sparse representation of the neural signals to combat the limited wireless telemetry bandwidth. We demonstrate a multimodal processing capability (monitoring, compression, and spike sorting) inherent in the neuroprocessor to support a wide range of scenarios in real experimental conditions. A wireless transmission link with rate-dependent compression strategy is shown to preserve information fidelity in the neural data. At 32 channels, the neuroprocessor has been fully implemented on a 5mm×5mm nano-FPGA, and the prototyping resulted in 5.19 mW power consumption, bringing its performance within the power-size constraints for clinical use. The optimal design for compression and sorting performance was evaluated for multiple sampling frequencies, wavelet basis choice and power consumption.

可靠性、可扩展性和临床可行性是无线脑机接口系统(bmi)设计的关键。本文报道了一种神经处理器的设计和实现,用于调节通过长期植入清醒行为大鼠大脑的微电极阵列记录的原始细胞外神经信号。神经处理器的设计利用神经信号的稀疏表示来对抗有限的无线遥测带宽。我们展示了神经处理器固有的多模态处理能力(监测、压缩和尖峰排序),以支持实际实验条件下的广泛场景。提出了一种采用速率相关压缩策略的无线传输链路,以保持神经数据的信息保真度。在32通道时,神经处理器已经完全实现在5mm×5mm纳米fpga上,原型设计的功耗为5.19 mW,使其性能符合临床使用的功率尺寸限制。从多采样频率、小波基选择和功耗等方面对优化设计的压缩和排序性能进行了评价。
{"title":"A Fully Implantable, Programmable and Multimodal Neuroprocessor for Wireless, Cortically Controlled Brain-Machine Interface Applications.","authors":"Fei Zhang,&nbsp;Mehdi Aghagolzadeh,&nbsp;Karim Oweiss","doi":"10.1007/s11265-012-0670-x","DOIUrl":"https://doi.org/10.1007/s11265-012-0670-x","url":null,"abstract":"<p><p>Reliability, scalability and clinical viability are of utmost importance in the design of wireless Brain Machine Interface systems (BMIs). This paper reports on the design and implementation of a neuroprocessor for conditioning raw extracellular neural signals recorded through microelectrode arrays chronically implanted in the brain of awake behaving rats. The neuroprocessor design exploits a sparse representation of the neural signals to combat the limited wireless telemetry bandwidth. We demonstrate a multimodal processing capability (monitoring, compression, and spike sorting) inherent in the neuroprocessor to support a wide range of scenarios in real experimental conditions. A wireless transmission link with rate-dependent compression strategy is shown to preserve information fidelity in the neural data. At 32 channels, the neuroprocessor has been fully implemented on a 5mm×5mm nano-FPGA, and the prototyping resulted in 5.19 mW power consumption, bringing its performance within the power-size constraints for clinical use. The optimal design for compression and sorting performance was evaluated for multiple sampling frequencies, wavelet basis choice and power consumption.</p>","PeriodicalId":50050,"journal":{"name":"Journal of Signal Processing Systems for Signal Image and Video Technology","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s11265-012-0670-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30963910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
Journal of Signal Processing Systems for Signal Image and Video Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1