Charles Eric LaForest, J. Anderson, J. Gregory Steffan
{"title":"Approaching overhead-free execution on FPGA soft-processors","authors":"Charles Eric LaForest, J. Anderson, J. Gregory Steffan","doi":"10.1109/FPT.2014.7082760","DOIUrl":null,"url":null,"abstract":"Implementing systems on FPGA soft-processors, rather than as custom hardware, eases and accelerates the development process, but at the cost of a great reduction in performance. Orthogonal to limitations in parallelism or clock frequency, this reduction in performance primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved FPGA soft-processor architecture which statically overlaps \"overhead\" computations and executes them in parallel with the \"useful\" computations, significantly reducing the number of processor cycles needed to execute sequential programs, while reducing maximum clock frequency to 0.939x of its original value. In addition to eliminating almost all overhead computations, the proposed soft-processor can operate at 500 MHz on the Altera Stratix IV FPGA - 0.909x of the absolute maximum rating. Combined, the high speed and execution efficiency increase the range of FPGA designs amenable to soft-processors rather than custom hardware. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for looping code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"6 1","pages":"99-106"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2014.7082760","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Implementing systems on FPGA soft-processors, rather than as custom hardware, eases and accelerates the development process, but at the cost of a great reduction in performance. Orthogonal to limitations in parallelism or clock frequency, this reduction in performance primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved FPGA soft-processor architecture which statically overlaps "overhead" computations and executes them in parallel with the "useful" computations, significantly reducing the number of processor cycles needed to execute sequential programs, while reducing maximum clock frequency to 0.939x of its original value. In addition to eliminating almost all overhead computations, the proposed soft-processor can operate at 500 MHz on the Altera Stratix IV FPGA - 0.909x of the absolute maximum rating. Combined, the high speed and execution efficiency increase the range of FPGA designs amenable to soft-processors rather than custom hardware. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for looping code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.
在FPGA软处理器上实现系统,而不是作为定制硬件,简化并加速了开发过程,但代价是性能大大降低。与并行性或时钟频率的限制无关,这种性能的降低主要源于标量微处理器的固有寻址和流量控制开销,在实际有用的工作中,它们在交叉地址计算和分支决策中花费了相当多的周期。我们提出了一种改进的FPGA软处理器架构,它静态地重叠“开销”计算,并与“有用”计算并行执行,显著减少执行顺序程序所需的处理器周期数,同时将最大时钟频率降低到原始值的0.939x。除了消除几乎所有的开销计算外,所提出的软处理器可以在Altera Stratix IV FPGA上以500 MHz的频率工作-绝对最大额定的0.909倍。结合起来,高速度和执行效率增加了适合软处理器而不是定制硬件的FPGA设计范围。我们用多个基准测试来评估我们的循环计数改进,实现了从重控制代码的1.07倍到循环代码的1.92倍的加速,性能从来没有比原始顺序代码差,并且总是比完全展开的循环表现得更好。