Parallel Processing on FPGAs: The Effect of Profiling on Performance

Xiaoguang Li, S. Areibi, R. Dony
{"title":"Parallel Processing on FPGAs: The Effect of Profiling on Performance","authors":"Xiaoguang Li, S. Areibi, R. Dony","doi":"10.1109/IWSOC.2006.348232","DOIUrl":null,"url":null,"abstract":"The processing elements, logic resources, and on-chip block RAMs of modern FPGAs can not only be used for prototyping custom hardware modules, but also for parallel processing purposes by implementing multiple processors for a single task. This paper compares the performance of a single-processor implementation with two types of dual-processor implementations for a widely used radix-2 n-point FFT algorithm (Kooley and Tuckey, 1965) in terms of processing speed and FPGA resource utilization. In the first dual-processor implementation, the partitioning is performed based on the computation complexity - O(nlog(n)) of the radix-2 FFT algorithm. In the second implementation, the partitioning is based on a detailed profiling procedure applied to each line of the code in the single-processor implementation. Results obtained show that the speedup of the first dual-processor implementation is on average 1.3times faster than the single-processor implementation, whereas the second dual-processor implementation is about 1.9times faster which is very close to the expected speedup. This result shows that detailed profiling is crucial in identifying the bottlenecks of an algorithm (i.e., all the factors are taken into consideration) and consequently the algorithm can be efficiently mapped on a multiprocessor system based on the correct decision","PeriodicalId":134742,"journal":{"name":"2006 6th International Workshop on System on Chip for Real Time Applications","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 6th International Workshop on System on Chip for Real Time Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWSOC.2006.348232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The processing elements, logic resources, and on-chip block RAMs of modern FPGAs can not only be used for prototyping custom hardware modules, but also for parallel processing purposes by implementing multiple processors for a single task. This paper compares the performance of a single-processor implementation with two types of dual-processor implementations for a widely used radix-2 n-point FFT algorithm (Kooley and Tuckey, 1965) in terms of processing speed and FPGA resource utilization. In the first dual-processor implementation, the partitioning is performed based on the computation complexity - O(nlog(n)) of the radix-2 FFT algorithm. In the second implementation, the partitioning is based on a detailed profiling procedure applied to each line of the code in the single-processor implementation. Results obtained show that the speedup of the first dual-processor implementation is on average 1.3times faster than the single-processor implementation, whereas the second dual-processor implementation is about 1.9times faster which is very close to the expected speedup. This result shows that detailed profiling is crucial in identifying the bottlenecks of an algorithm (i.e., all the factors are taken into consideration) and consequently the algorithm can be efficiently mapped on a multiprocessor system based on the correct decision
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
fpga的并行处理:分析对性能的影响
现代fpga的处理元件、逻辑资源和片上块ram不仅可以用于定制硬件模块的原型设计,还可以通过为单个任务实现多个处理器来实现并行处理目的。本文在处理速度和FPGA资源利用率方面比较了广泛使用的基数-2 n点FFT算法(Kooley和Tuckey, 1965)的单处理器实现与两种类型的双处理器实现的性能。在第一个双处理器实现中,分区是基于计算复杂度——基数-2 FFT算法的0 (nlog(n))来执行的。在第二个实现中,分区基于应用于单处理器实现中的每一行代码的详细分析过程。结果表明,第一种双处理器实现的平均加速速度是单处理器实现的1.3倍,而第二种双处理器实现的平均加速速度约为1.9倍,与预期的加速速度非常接近。该结果表明,详细的分析对于识别算法的瓶颈(即考虑到所有因素)至关重要,因此可以根据正确的决策将算法有效地映射到多处理器系统上
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
FPGA-Based Low-level CAN Protocol Testing A generic method for fault injection in circuits SoC Design Quality, Cycletime, and Yield Improvement Through DfM A Benchmark Approach for Compilers in Reconfigurable Hardware Fragmentation Aware Placement in Reconfigurable Devices
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1