Parallel Processing on FPGAs: The Effect of Profiling on Performance

2006 6th International Workshop on System on Chip for Real Time Applications Pub Date : 2006-12-01 DOI:10.1109/IWSOC.2006.348232

Xiaoguang Li, S. Areibi, R. Dony

{"title":"Parallel Processing on FPGAs: The Effect of Profiling on Performance","authors":"Xiaoguang Li, S. Areibi, R. Dony","doi":"10.1109/IWSOC.2006.348232","DOIUrl":null,"url":null,"abstract":"The processing elements, logic resources, and on-chip block RAMs of modern FPGAs can not only be used for prototyping custom hardware modules, but also for parallel processing purposes by implementing multiple processors for a single task. This paper compares the performance of a single-processor implementation with two types of dual-processor implementations for a widely used radix-2 n-point FFT algorithm (Kooley and Tuckey, 1965) in terms of processing speed and FPGA resource utilization. In the first dual-processor implementation, the partitioning is performed based on the computation complexity - O(nlog(n)) of the radix-2 FFT algorithm. In the second implementation, the partitioning is based on a detailed profiling procedure applied to each line of the code in the single-processor implementation. Results obtained show that the speedup of the first dual-processor implementation is on average 1.3times faster than the single-processor implementation, whereas the second dual-processor implementation is about 1.9times faster which is very close to the expected speedup. This result shows that detailed profiling is crucial in identifying the bottlenecks of an algorithm (i.e., all the factors are taken into consideration) and consequently the algorithm can be efficiently mapped on a multiprocessor system based on the correct decision","PeriodicalId":134742,"journal":{"name":"2006 6th International Workshop on System on Chip for Real Time Applications","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 6th International Workshop on System on Chip for Real Time Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWSOC.2006.348232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The processing elements, logic resources, and on-chip block RAMs of modern FPGAs can not only be used for prototyping custom hardware modules, but also for parallel processing purposes by implementing multiple processors for a single task. This paper compares the performance of a single-processor implementation with two types of dual-processor implementations for a widely used radix-2 n-point FFT algorithm (Kooley and Tuckey, 1965) in terms of processing speed and FPGA resource utilization. In the first dual-processor implementation, the partitioning is performed based on the computation complexity - O(nlog(n)) of the radix-2 FFT algorithm. In the second implementation, the partitioning is based on a detailed profiling procedure applied to each line of the code in the single-processor implementation. Results obtained show that the speedup of the first dual-processor implementation is on average 1.3times faster than the single-processor implementation, whereas the second dual-processor implementation is about 1.9times faster which is very close to the expected speedup. This result shows that detailed profiling is crucial in identifying the bottlenecks of an algorithm (i.e., all the factors are taken into consideration) and consequently the algorithm can be efficiently mapped on a multiprocessor system based on the correct decision

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

fpga的并行处理:分析对性能的影响

现代fpga的处理元件、逻辑资源和片上块ram不仅可以用于定制硬件模块的原型设计，还可以通过为单个任务实现多个处理器来实现并行处理目的。本文在处理速度和FPGA资源利用率方面比较了广泛使用的基数-2 n点FFT算法(Kooley和Tuckey, 1965)的单处理器实现与两种类型的双处理器实现的性能。在第一个双处理器实现中，分区是基于计算复杂度——基数-2 FFT算法的0 (nlog(n))来执行的。在第二个实现中，分区基于应用于单处理器实现中的每一行代码的详细分析过程。结果表明，第一种双处理器实现的平均加速速度是单处理器实现的1.3倍，而第二种双处理器实现的平均加速速度约为1.9倍，与预期的加速速度非常接近。该结果表明，详细的分析对于识别算法的瓶颈(即考虑到所有因素)至关重要，因此可以根据正确的决策将算法有效地映射到多处理器系统上

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2006 6th International Workshop on System on Chip for Real Time Applications

自引率

0.00%

发文量

期刊最新文献

FPGA-Based Low-level CAN Protocol Testing A generic method for fault injection in circuits SoC Design Quality, Cycletime, and Yield Improvement Through DfM A Benchmark Approach for Compilers in Reconfigurable Hardware Fragmentation Aware Placement in Reconfigurable Devices