2011 IEEE 9th Symposium on Application Specific Processors (SASP)最新文献

英文中文

USHA: Unified software and hardware architecture for video decoding USHA:用于视频解码的统一软硬件架构

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941074

Adarsha Rao, S. Nandy, Hristo Nikolov, E. Deprettere

Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.

新兴应用程序中使用的视频解码器需要灵活地处理各种视频格式，并提供可扩展的性能，以处理工作负载的广泛变化。本文提出了一种统一的视频解码软硬件架构，以实现可扩展性和灵活性。我们架构中的轻量级处理器模块和可重构硬件模块使软件和硬件实现能够共存，而可编程互连使模块能够动态互连。我们的面向过程网络的编译流程实现了与实现无关的应用程序分区，并实现了视频解码器在单处理器、多处理器、半硬件和全硬件实现之间的无缝迁移。应用程序服务质量感知调度器监视和控制整个系统的操作。我们通过现成的FPGA上的架构原型证明了这一概念。FPGA原型显示了从QCIF到1080p分辨率的性能缩放，分为四个步骤。我们还证明了重新配置时间足够短，可以从一种配置迁移到另一种配置而不会有任何帧丢失。

{"title":"USHA: Unified software and hardware architecture for video decoding","authors":"Adarsha Rao, S. Nandy, Hristo Nikolov, E. Deprettere","doi":"10.1109/SASP.2011.5941074","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941074","url":null,"abstract":"Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115850435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Integrating formal verification and high-level processor pipeline synthesis 集成形式化验证和高级处理器管道合成

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941073

E. Nurvitadhi, J. Hoe, T. Kam, Shih-Lien Lu

When a processor implementation is synthesized from a specification using an automatic framework, this implementation still should be verified against its specification to ensure the automatic framework introduced no error. This paper presents our effort in integrating fully automated formal verification with a high-level processor pipeline synthesis framework. As an integral part of the pipeline synthesis, our framework also emits SMV models for checking the functional equivalence between the output pipelined processor implementation and its input non-pipelined specification. Well known compositional model checking techniques are automatically applied to curtail state explosion during model checking. The paper reports case studies of applying this integrated framework to synthesize and formally verify pipelined RISC and CISC processors.

当使用自动框架从规范合成处理器实现时，仍应根据其规范验证该实现，以确保自动框架不会引入错误。本文介绍了我们在将全自动形式化验证与高级处理器管道合成框架集成方面所做的努力。作为流水线综合的一个组成部分，我们的框架还发布了SMV模型，用于检查输出流水线处理器实现与其输入非流水线规范之间的功能等效性。在模型检测过程中，自动应用了常用的构件模型检测技术来抑制状态爆炸。本文报告了应用该集成框架综合和正式验证流水线RISC和CISC处理器的案例研究。

引用次数: 1

Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration GPU加速器框架:使用2D/3D图像配准的综合评估

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941083

Richard Membarth, Frank Hannig, J. Teich, M. Körner, Wieland Eckert

In the last decade, there has been a dramatic growth in research and development of massively parallel many-core architectures like graphics hardware, both in academia and industry. This changed also the way programs are written in order to leverage the processing power of a multitude of cores on the same hardware. In the beginning, programmers had to use special graphics programming interfaces to express general purpose computations on graphics hardware. Today, several frameworks exist to relieve the programmer from such tasks. In this paper, we present five frameworks for parallelization on GPU Accelerators, namely RapidMind, PGI Accelerator, HMPP Workbench, OpenCL, and CUDA. To evaluate these frameworks, a real world application from medical imaging is investigated, the 2D/3D image registration.

在过去的十年中，学术界和工业界对大规模并行多核架构(如图形硬件)的研究和开发都有了显著的增长。这也改变了程序的编写方式，以便在同一硬件上利用多个核心的处理能力。一开始，程序员必须使用特殊的图形编程接口来表达图形硬件上的通用计算。今天，有几个框架可以将程序员从这些任务中解脱出来。在本文中，我们提出了五个GPU加速器上的并行化框架，即RapidMind, PGI Accelerator, HMPP Workbench, OpenCL和CUDA。为了评估这些框架，研究了医学成像的实际应用，即二维/三维图像配准。

引用次数: 20

A massively parallel implementation of QC-LDPC decoder on GPU QC-LDPC解码器在GPU上的大规模并行实现

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941084

Guohui Wang, Michael Wu, Yang Sun, Joseph R. Cavallaro

The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU's computational resources to significantly boost the performance. Moreover, several efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.

图形处理器单元(GPU)能够为高性能计算提供低成本和灵活的基于软件的多核架构。然而，如何有效地将现实世界的应用映射到GPU上，并充分利用GPU的计算能力，仍然是一个非常具有挑战性的问题。作为一个案例研究，我们提出了一个基于gpu的现实世界数字信号处理(DSP)应用的实现:低密度奇偶校验(LDPC)解码器。本文展示了我们为将算法映射到GPU的大规模并行架构上所做的努力，并充分利用GPU的计算资源来显着提高性能。此外，还提出了几种有效的数据结构来降低内存访问延迟和内存带宽需求。实验结果表明，基于GPU的LDPC译码加速器可以充分利用GPU的多核计算能力，实现高达100.3Mbps的高吞吐量。

引用次数: 60

FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm 基于FPGA并行架构的堆叠误差扩散算法实现

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941080

R. Venugopal, J. Heath, D. Lau

Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This manuscript focuses on the development, design and Hardware Description Language (HDL) functional and performance simulation validation of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. A CMYK printer, utilizing the high quality error diffusion algorithm, would be required to execute error diffusion 16 times per pixel, resulting in a potentially high computational cost. The algorithm, originally described in ‘C’, requires a significant processing time when implemented on a conventional single Central Processing Unit (CPU) based computer system. Thus, a new scalable high performance parallel hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a single Programmable Logic Device (PLD) based Field Programmable Gate Array (FPGA) chip. There is a significant decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU based system.

数字半调是一种在数字打印机中用于将连续色调图像转换成黑白点图案的关键技术。使用半调是因为打印机的可用油墨有限，并且不能在连续图像中再现所有的颜色强度。误差扩散是一种半调算法，它以邻域依赖的方式迭代量化像素。本文着重于开发，设计和硬件描述语言(HDL)功能和性能仿真验证的并行可扩展硬件架构，用于高性能实现高质量的堆叠误差扩散算法。使用高质量错误扩散算法的CMYK打印机需要每像素执行16次错误扩散，从而导致潜在的高计算成本。该算法最初在C语言中描述，在基于传统的单一中央处理器(CPU)的计算机系统上实现时需要大量的处理时间。因此，开发了一种新的可扩展的高性能并行硬件处理器架构来实现该算法，并在基于现场可编程门阵列(FPGA)芯片的单个可编程逻辑器件(PLD)上实现和测试。与在基于单一CPU的系统上执行相比，在基于FPGA技术的新提出的并行架构上运行该算法的运行时间显着减少。

{"title":"FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm","authors":"R. Venugopal, J. Heath, D. Lau","doi":"10.1109/SASP.2011.5941080","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941080","url":null,"abstract":"Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This manuscript focuses on the development, design and Hardware Description Language (HDL) functional and performance simulation validation of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. A CMYK printer, utilizing the high quality error diffusion algorithm, would be required to execute error diffusion 16 times per pixel, resulting in a potentially high computational cost. The algorithm, originally described in ‘C’, requires a significant processing time when implemented on a conventional single Central Processing Unit (CPU) based computer system. Thus, a new scalable high performance parallel hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a single Programmable Logic Device (PLD) based Field Programmable Gate Array (FPGA) chip. There is a significant decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU based system.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132609094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ISIS: An accelerator for Sphinx speech recognition ISIS:斯芬克斯语音识别加速器

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941078

A. Chun, Jenny X. Chang, Zhen Fang, Ravishankar R. Iyer, M. Deisher

The ability to naturally interact with devices is becoming increasingly important. Speech recognition is one well-known solution to provide easy, hands-free user-device interaction. However, speech recognition has significant computation and memory bandwidth requirements, making it challenging to offer at high performance, real-time and ultra-low power for handheld devices. In this paper, we present a speech recognition accelerator called ISIS. We show the overall execution flow of the accelerated speech recognition solution along with optimizations and the key metrics of performance, area and power.

与设备自然交互的能力正变得越来越重要。语音识别是一种众所周知的解决方案，它提供了简单、免提的用户设备交互。然而，语音识别具有显着的计算和内存带宽要求，使得为手持设备提供高性能，实时和超低功耗具有挑战性。本文提出了一种名为ISIS的语音识别加速器。我们展示了加速语音识别解决方案的整体执行流程，以及优化和性能，面积和功耗的关键指标。

引用次数: 8

Memory-efficient volume ray tracing on GPU for radiotherapy 基于GPU的高内存体积射线追踪

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941076

Bo Zhou, X. Hu, D. Chen

Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage of the GPU computational capacity. This paper introduces a simple yet effective ray tracing technique aiming to improve the memory bandwidth utilization of GPU for processing a massive number of rays. The idea is to exploit the coherent relationship between the rays and match the ray tracing behavior with the underlying characteristics of the GPU memory system. The proposed method has been evaluated on 4 phantom setups using randomly generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, the new method provides a speedup in the range of 1.8–2.7X for the given phantom settings. Major performance factors such as ray origins, phantom sizes, and pyramid sizes are also analyzed. The proposed technique was also shown to lead to a speedup of 1.3–1.6X over the original GPU implementation of the CCCS algorithm.

均匀网格体积内的射线追踪是放射治疗中许多放射剂量计算方法经常用到的基本过程。图形处理单元(GPU)的最新进展使实时剂量计算成为一个可实现的目标。然而，已知的GPU体射线追踪方法的性能都受到内存吞吐量的限制，这导致了GPU计算能力的低效使用。本文介绍了一种简单而有效的光线跟踪技术，旨在提高GPU处理大量光线时的内存带宽利用率。这个想法是利用光线之间的相干关系，并将光线追踪行为与GPU存储系统的基本特征相匹配。该方法已在4个随机生成射线的幻影装置上进行了评估。采用/不采用本文提出的方法实现了坍缩锥卷积/叠加(CCCS)剂量计算方法，以验证本文方法的可行性。与目前流行的3DDDA算法的直接GPU实现相比，对于给定的幻像设置，新方法提供了1.8 - 2.7倍的加速范围。主要的性能因素，如射线原点，幻影尺寸和金字塔尺寸也进行了分析。所提出的技术也被证明比CCCS算法的原始GPU实现的速度提高1.3 - 1.6倍。

{"title":"Memory-efficient volume ray tracing on GPU for radiotherapy","authors":"Bo Zhou, X. Hu, D. Chen","doi":"10.1109/SASP.2011.5941076","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941076","url":null,"abstract":"Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage of the GPU computational capacity. This paper introduces a simple yet effective ray tracing technique aiming to improve the memory bandwidth utilization of GPU for processing a massive number of rays. The idea is to exploit the coherent relationship between the rays and match the ray tracing behavior with the underlying characteristics of the GPU memory system. The proposed method has been evaluated on 4 phantom setups using randomly generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, the new method provides a speedup in the range of 1.8–2.7X for the given phantom settings. Major performance factors such as ray origins, phantom sizes, and pyramid sizes are also analyzed. The proposed technique was also shown to lead to a speedup of 1.3–1.6X over the original GPU implementation of the CCCS algorithm.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127756601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A parallel accelerator for semantic search 语义搜索的并行加速器

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941090

Abhinandan Majumdar, S. Cadambi, S. Chakradhar, H. Graf

Semantic text analysis is a technique used in advertisement placement, cognitive databases and search engines. With increasing amounts of data and stringent response-time requirements, improving the underlying implementation of semantic analysis becomes critical. To this end, we look at Supervised Semantic Indexing (SSI), a recently proposed algorithm for semantic analysis. SSI ranks a large number of documents based on their semantic similarity to a text query. For each query, it computes millions of dot products on unstructured data, generates a large intermediate result, and then performs ranking. SSI underperforms on both state-of-the-art multi-cores as well as GPUs. Its performance scalability on multi-cores is hampered by their limited support for fine-grained data parallelism. GPUs, though beat multi-cores by running thousands of threads, cannot handle large intermediate data because of their small on-chip memory. Motivated by this, we present an FPGA-based hardware accelerator for semantic analysis. As a key feature, the accelerator combines hundreds of simple processing elements together with in-memory processing to simultaneously generate and process (consume) the large intermediate data. It also supports “dynamic parallelism” - a feature that configures the PEs differently for full utilization of the available processin logic after the FPGA is programmed. Our FPGA prototype is 10–13x faster than a 2.5 GHz quad-core Xeon, and 1.5–5x faster than a 240 core 1.3 GHz Tesla GPU, despite operating at a modest frequency of 125 MHz.

语义文本分析是一种用于广告投放、认知数据库和搜索引擎的技术。随着数据量的增加和严格的响应时间需求，改进语义分析的底层实现变得至关重要。为此，我们研究了监督语义索引(SSI)，这是最近提出的一种语义分析算法。SSI根据与文本查询的语义相似度对大量文档进行排序。对于每个查询，它在非结构化数据上计算数百万个点积，生成一个大的中间结果，然后执行排序。SSI在最先进的多核和gpu上都表现不佳。它在多核上的性能可伸缩性由于对细粒度数据并行性的有限支持而受到限制。gpu虽然通过运行数千个线程胜过多核，但由于其片上内存较小，无法处理大型中间数据。基于此，我们提出了一种基于fpga的语义分析硬件加速器。作为一个关键特性，加速器将数百个简单的处理元素与内存处理结合在一起，以同时生成和处理(消费)大型中间数据。它还支持“动态并行性”——这是一种对pe进行不同配置的特性，以便在FPGA编程后充分利用可用的处理逻辑。我们的FPGA原型比2.5 GHz四核至强处理器快10 - 13倍，比240核1.3 GHz特斯拉GPU快1.5 - 5倍，尽管工作频率为125 MHz。

{"title":"A parallel accelerator for semantic search","authors":"Abhinandan Majumdar, S. Cadambi, S. Chakradhar, H. Graf","doi":"10.1109/SASP.2011.5941090","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941090","url":null,"abstract":"Semantic text analysis is a technique used in advertisement placement, cognitive databases and search engines. With increasing amounts of data and stringent response-time requirements, improving the underlying implementation of semantic analysis becomes critical. To this end, we look at Supervised Semantic Indexing (SSI), a recently proposed algorithm for semantic analysis. SSI ranks a large number of documents based on their semantic similarity to a text query. For each query, it computes millions of dot products on unstructured data, generates a large intermediate result, and then performs ranking. SSI underperforms on both state-of-the-art multi-cores as well as GPUs. Its performance scalability on multi-cores is hampered by their limited support for fine-grained data parallelism. GPUs, though beat multi-cores by running thousands of threads, cannot handle large intermediate data because of their small on-chip memory. Motivated by this, we present an FPGA-based hardware accelerator for semantic analysis. As a key feature, the accelerator combines hundreds of simple processing elements together with in-memory processing to simultaneously generate and process (consume) the large intermediate data. It also supports “dynamic parallelism” - a feature that configures the PEs differently for full utilization of the available processin logic after the FPGA is programmed. Our FPGA prototype is 10–13x faster than a 2.5 GHz quad-core Xeon, and 1.5–5x faster than a 240 core 1.3 GHz Tesla GPU, despite operating at a modest frequency of 125 MHz.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124671914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications GPU和fpga上的三维递归高斯IIR -一个加速带宽限制应用的案例

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941081

J. Cong, Muhuan Huang, Yi Zou

GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU.

GPU设备通常比基于fpga的系统具有更高的片外带宽。因此，通常GPU应该在带宽有限的大规模并行应用程序中表现更好。本文介绍了在多核CPU、多核GPU和多fpga平台上实现三维递归高斯IIR的方法。我们在CPU上的基线实现具有最小的算术计算(每个维度2个madd)。虽然这个应用程序明显有带宽限制，但内存子系统的差异转化为不同的带宽优化技术。我们在GPU和FPGA平台上的实现分别比CPU上优化的单线程代码加速了26X和33X。

引用次数: 11

Scalable object detection accelerators on FPGAs using custom design space exploration 可扩展的目标检测加速器在fpga上使用定制设计空间探索

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941089

Chen-Chun Huang, F. Vahid

We discuss FPGA implementations of object (such as face) detectors in video streams using the accurate Haar-feature based algorithm. Rather than creating one implementation for one FPGA, we develop a method to generate a series of implementations that have different size and performance to target different FPGA devices. The automatic generation was enabled by custom design space exploration on a particular design problem relating to the communication architecture used to support different numbers of image classifiers. The exploration algorithm uses content information in each feature set to optimize and generate a scalable communication architecture. We generated fully-working implementations for Xilinx Virtex5 LX50T, LX110T, and LX155T FPGA devices, using various amounts of available device capacity, leading to speedups ranging from 0.6x to 25x compared to a 3.0 GHz Pentium 4 desktop machine. Automated generators that include custom design space exploration may become more necessary when creating hardware accelerators intended for use across a wide range of existing and future FPGA devices.

我们讨论了使用基于精确haar特征的算法在视频流中实现对象(如人脸)检测器的FPGA。我们不是为一个FPGA创建一个实现，而是开发一种方法来生成一系列具有不同尺寸和性能的实现，以针对不同的FPGA设备。通过对与用于支持不同数量图像分类器的通信体系结构相关的特定设计问题进行自定义设计空间探索，可以实现自动生成。探索算法利用每个特征集中的内容信息来优化和生成可扩展的通信体系结构。我们为Xilinx Virtex5 LX50T、LX110T和LX155T FPGA设备生成了完全工作的实现，使用了不同数量的可用设备容量，与3.0 GHz的Pentium 4台式机相比，速度提高了0.6倍到25倍。在创建硬件加速器时，包括定制设计空间探索的自动生成器可能变得更加必要，这些硬件加速器旨在用于广泛的现有和未来的FPGA设备。

引用次数: 12

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀