Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082822
J. Olivito, A. Delmas, J. Resano
This article presents a hardware design of a specific processor for Blokus Duo game. This design is an evolution of our previous work presented in the ICFPT'13 Design Competition. In order to improve its performance we have designed parallel hardware blocks to speed up the most time-consuming tasks, and included additional techniques to reduce the search space. As a consequence we can process a board six times faster than in our previous version and we prune the game-tree much more efficiently.
{"title":"An improved FPGA-based specific processor for Blokus Duo","authors":"J. Olivito, A. Delmas, J. Resano","doi":"10.1109/FPT.2014.7082822","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082822","url":null,"abstract":"This article presents a hardware design of a specific processor for Blokus Duo game. This design is an evolution of our previous work presented in the ICFPT'13 Design Competition. In order to improve its performance we have designed parallel hardware blocks to speed up the most time-consuming tasks, and included additional techniques to reduce the search space. As a consequence we can process a board six times faster than in our previous version and we prune the game-tree much more efficiently.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"3 1","pages":"366-369"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74361358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082784
Mrinal J. Sarmah, Syed Azeemuddin
Channel bonding is a mechanism deployed to synchronize serial communication channels of higher data rate and bandwidth applications. Application that demands higher bandwidth, for example 400G Ethernet, it is not possible to achieve such massive rate using single high speed serial IO channel and aggregating multiple communication-links as a single communication channel makes such ultra-high speed realizable. One challenge that is faced in aggregating communication links is elimination of serial data skew introduced by non-identical trace length of the serial links. Various techniques exist to de-skew lanes in the receive side of the high speed serial transceiver. This paper presents a novel approach to channel bonding that optimizes area, power, initialization time and yields better performance. The idea discussed here is based on a delay based model and explores the possibility of performing channel bonding in a centralized way rather than a distributed way.
{"title":"A circuit to synchronize high speed serial communication channel","authors":"Mrinal J. Sarmah, Syed Azeemuddin","doi":"10.1109/FPT.2014.7082784","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082784","url":null,"abstract":"Channel bonding is a mechanism deployed to synchronize serial communication channels of higher data rate and bandwidth applications. Application that demands higher bandwidth, for example 400G Ethernet, it is not possible to achieve such massive rate using single high speed serial IO channel and aggregating multiple communication-links as a single communication channel makes such ultra-high speed realizable. One challenge that is faced in aggregating communication links is elimination of serial data skew introduced by non-identical trace length of the serial links. Various techniques exist to de-skew lanes in the receive side of the high speed serial transceiver. This paper presents a novel approach to channel bonding that optimizes area, power, initialization time and yields better performance. The idea discussed here is based on a delay based model and explores the possibility of performing channel bonding in a centralized way rather than a distributed way.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"239-242"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76260845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082809
Anthony Milton, D. Kearney, S. Wong, S. Lemmo
The FPGA platform is increasingly faced with a multitude of competitor parallel computing architectures such as GPUs and various multicore variants. These competitor parallel platforms are attractive because they involve a software based development flow, resulting in greater developer productivity. While it has been argued that FPGA applications written in traditional hardware description languages (HDLs) may require nearly an order of magnitude more development time than corresponding parallel software development (PSD) for multi-core CPU or GPU, there are modern approaches to hardware design that drastically increase development productivity that are beginning to gain traction. One approach adopted in this work is use of the high-level HDL Bluespec. This paper compares Bluespec FPGA development with PSD for multi-core CPU and GPU, by detailing the experiences of a project that involved developing various components of a complex multi-object visual tracking algorithm for each of these platforms. We found that the development time using Bluespec was competitive with the combined development time for the CPU and GPU versions, but that limitations with the Bluespec development chain (such as lack of native floating-point support) and component integration issues with the FPGA design were areas of significant weakness for the FPGA platform. Finally, we present performance results for the various implementations of the visual tracking algorithm developed in this work, and show that the FPGA platform has the potential to exceed the performance of the CPU and GPU platforms when implementation issues can be overcome for this application.
{"title":"Development productivity in implementing a complex heterogeneous computing application","authors":"Anthony Milton, D. Kearney, S. Wong, S. Lemmo","doi":"10.1109/FPT.2014.7082809","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082809","url":null,"abstract":"The FPGA platform is increasingly faced with a multitude of competitor parallel computing architectures such as GPUs and various multicore variants. These competitor parallel platforms are attractive because they involve a software based development flow, resulting in greater developer productivity. While it has been argued that FPGA applications written in traditional hardware description languages (HDLs) may require nearly an order of magnitude more development time than corresponding parallel software development (PSD) for multi-core CPU or GPU, there are modern approaches to hardware design that drastically increase development productivity that are beginning to gain traction. One approach adopted in this work is use of the high-level HDL Bluespec. This paper compares Bluespec FPGA development with PSD for multi-core CPU and GPU, by detailing the experiences of a project that involved developing various components of a complex multi-object visual tracking algorithm for each of these platforms. We found that the development time using Bluespec was competitive with the combined development time for the CPU and GPU versions, but that limitations with the Bluespec development chain (such as lack of native floating-point support) and component integration issues with the FPGA design were areas of significant weakness for the FPGA platform. Finally, we present performance results for the various implementations of the visual tracking algorithm developed in this work, and show that the FPGA platform has the potential to exceed the performance of the CPU and GPU platforms when implementation issues can be overcome for this application.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"59 1","pages":"322-325"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77733568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082795
Jian Yan, Junqi Yuan, Y. Wang, P. Leong, Lingli Wang
This paper presents a parameterized system-level design framework, which enables rapid and powerful research for hybrid multicore architecture exploration and hardware/software co-design. The framework comprises the component-based hardware design and application compiler, which make it easy for a designer to build stream-oriented applications with FPGA-based hybrid multicore architectures. The high modularity and parameterization of the framework supports fast multicore architecture exploration of different topologies, routing schemes, processor types, customized hardware processing units and memory system organizations. The compiler tool chain is used to map C/C++ based applications onto the soft processing units. Experimental results targeting the JPEG encoding application demonstrate the feasibility and performance improvement of this framework.
{"title":"Design space exploration for FPGA-based hybrid multicore architecture","authors":"Jian Yan, Junqi Yuan, Y. Wang, P. Leong, Lingli Wang","doi":"10.1109/FPT.2014.7082795","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082795","url":null,"abstract":"This paper presents a parameterized system-level design framework, which enables rapid and powerful research for hybrid multicore architecture exploration and hardware/software co-design. The framework comprises the component-based hardware design and application compiler, which make it easy for a designer to build stream-oriented applications with FPGA-based hybrid multicore architectures. The high modularity and parameterization of the framework supports fast multicore architecture exploration of different topologies, routing schemes, processor types, customized hardware processing units and memory system organizations. The compiler tool chain is used to map C/C++ based applications onto the soft processing units. Experimental results targeting the JPEG encoding application demonstrate the feasibility and performance improvement of this framework.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"280-281"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80644688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082768
Jiliang Zhang, G. Qu
This survey reviews the security and trust issues related to FPGA-based systems from the market perspective. For each party involved in FPGA supply and demand, we show the security and trust problems they need to be aware of and the solutions that are available.
{"title":"A survey on security and trust of FPGA-based systems","authors":"Jiliang Zhang, G. Qu","doi":"10.1109/FPT.2014.7082768","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082768","url":null,"abstract":"This survey reviews the security and trust issues related to FPGA-based systems from the market perspective. For each party involved in FPGA supply and demand, we show the security and trust problems they need to be aware of and the solutions that are available.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"74 1","pages":"147-152"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88049592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082802
Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, Shakaib A. Gursal
In this work, we propose an efficient scheduler and intelligent memory manager known as AMMC (Advanced Multi-Core Memory Controller), which proficiently handles data movement and computational tasks. The proposed AMMC system improves performance by managing complex data transfers at run-time and scheduling multi-cores without the intervention of a control processor nor an operating system. AMMC has been coupled with a heterogeneous system that provides both general-purpose cores and application specific accelerators. The AMMC system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the AMMC based multi-core system consumes 48% less hardware resources, 27.9% less on-chip power and achieves 6.8x of speed-up compared to the MicroBlaze-based multi-core system.
{"title":"AMMC: Advanced Multi-Core Memory Controller","authors":"Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, Shakaib A. Gursal","doi":"10.1109/FPT.2014.7082802","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082802","url":null,"abstract":"In this work, we propose an efficient scheduler and intelligent memory manager known as AMMC (Advanced Multi-Core Memory Controller), which proficiently handles data movement and computational tasks. The proposed AMMC system improves performance by managing complex data transfers at run-time and scheduling multi-cores without the intervention of a control processor nor an operating system. AMMC has been coupled with a heterogeneous system that provides both general-purpose cores and application specific accelerators. The AMMC system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the AMMC based multi-core system consumes 48% less hardware resources, 27.9% less on-chip power and achieves 6.8x of speed-up compared to the MicroBlaze-based multi-core system.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"292-295"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84260645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082811
Guohao Dai, Yi Shan, Fei Chen, Yu Wang, Kun Wang, Huazhong Yang
The popularization and application of Cloud Computing have provided a new approach for users to get computing resources in recent years. Meanwhile, due to the advantages including programmability and power-efficiency, FPGAs have been applied to custom computing in many domains. Previous work has made resources of FPGA available under the cloud environment. However, the effective usage of FPGAs in the cloud requires efficient online task scheduling: to properly assign as many tasks from different tenants as possible to the FPGAs. In this paper, we propose a benefit-based scheduling metric to evaluate the task assignment Based on the metric, we accelerate task execution according to our benefit-based scheduling algorithms. By applying our benefit-based scheduling metric to a real OpenStack-based cloud environment, 60.32% computing resources are saved compared with the conventional throughput-based metric. Furthermore, a Replacement-Considering algorithm, which considers the task replacement, is proposed taking the characteristics of cloud into account. The results show that our FPGA accelerated cloud system is 1.386 times faster than using the previous algorithm.
{"title":"Online scheduling for FPGA computation in the Cloud","authors":"Guohao Dai, Yi Shan, Fei Chen, Yu Wang, Kun Wang, Huazhong Yang","doi":"10.1109/FPT.2014.7082811","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082811","url":null,"abstract":"The popularization and application of Cloud Computing have provided a new approach for users to get computing resources in recent years. Meanwhile, due to the advantages including programmability and power-efficiency, FPGAs have been applied to custom computing in many domains. Previous work has made resources of FPGA available under the cloud environment. However, the effective usage of FPGAs in the cloud requires efficient online task scheduling: to properly assign as many tasks from different tenants as possible to the FPGAs. In this paper, we propose a benefit-based scheduling metric to evaluate the task assignment Based on the metric, we accelerate task execution according to our benefit-based scheduling algorithms. By applying our benefit-based scheduling metric to a real OpenStack-based cloud environment, 60.32% computing resources are saved compared with the conventional throughput-based metric. Furthermore, a Replacement-Considering algorithm, which considers the task replacement, is proposed taking the characteristics of cloud into account. The results show that our FPGA accelerated cloud system is 1.386 times faster than using the previous algorithm.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"330-333"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81996199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082823
Ehsan Qasemi, Amir Samadi, Mohammad H. Shadmehr, Bardia Azizian, Sajjad Mozaffari, Amir Shirian, B. Alizadeh
In this paper we present our hardware architecture on a highly scalable, shared-memory, Monte-Carlo Tree Search (MCTS) based Blokus-Duo solver. In the proposed architecture each MCTS solver module contains a centralized MCTS controller which can also be implemented using soft-cores with a true dual-port access to a shared memory called main memory, and multitude number of MCTS engines each containing several simulation cores. Consequently, this highly flexible architecture guaranties the optimized performance of the solver regardless of the actual FPGA platform used. Our design has been inspired from parallel MCTS algorithms and is potentially capable of obtaining maximum possible parallelism from MCTS algorithm. On the other hand, in our design we combine MCTS with pruning heuristics to increase both the memory and LE utilizations. The results show that our architecture can run up to 50MHz on DE2-115 platform, where each Simulation core requires 11K LEs and MCTS controller requires 10KLEs.
{"title":"Highly scalable, shared-memory, Monte-Carlo tree search based Blokus Duo Solver on FPGA","authors":"Ehsan Qasemi, Amir Samadi, Mohammad H. Shadmehr, Bardia Azizian, Sajjad Mozaffari, Amir Shirian, B. Alizadeh","doi":"10.1109/FPT.2014.7082823","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082823","url":null,"abstract":"In this paper we present our hardware architecture on a highly scalable, shared-memory, Monte-Carlo Tree Search (MCTS) based Blokus-Duo solver. In the proposed architecture each MCTS solver module contains a centralized MCTS controller which can also be implemented using soft-cores with a true dual-port access to a shared memory called main memory, and multitude number of MCTS engines each containing several simulation cores. Consequently, this highly flexible architecture guaranties the optimized performance of the solver regardless of the actual FPGA platform used. Our design has been inspired from parallel MCTS algorithms and is potentially capable of obtaining maximum possible parallelism from MCTS algorithm. On the other hand, in our design we combine MCTS with pruning heuristics to increase both the memory and LE utilizations. The results show that our architecture can run up to 50MHz on DE2-115 platform, where each Simulation core requires 11K LEs and MCTS controller requires 10KLEs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"370-373"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82782051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082806
Young-Hwan Park, K. Prasad, Yeonbok Lee, Kitaek Bae, Ho Yang
In this paper, we propose an architecture of scalable radio processor targeting an OFDM based wireless modem. The architecture is based on the coarse-grained reconfigurable array (CGRA), which provides programmable and flexible accelerators by reconfiguring hardware resources at run time. On the other hand, the architecture maximizes the data parallelism by implementing 32-way SEVTD operations. Other features considered in the current implementation include mini-core structure, dedicated vector memory, and simplified datapath. The proposed architecture is compared to the precedent 4×4 CGRA processor, and evaluated with several communication kernels in terms of cycle, area and power. The implementation result shows that the proposed architecture has 3.6 times better in cycle performance with 2 times better scheduling but with double area penalty, resulting in 1495 cycles for complex 2K-FFT, to the best of our knowledge, that is the best DSP cycles reported until today. The synthesized results with 32nm library also show that the proposed architecture is operational at 800MHz, which is capable of running maximum 128 GOPS of wireless applications.
{"title":"Scalable radio processor architecture for modern wireless communications","authors":"Young-Hwan Park, K. Prasad, Yeonbok Lee, Kitaek Bae, Ho Yang","doi":"10.1109/FPT.2014.7082806","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082806","url":null,"abstract":"In this paper, we propose an architecture of scalable radio processor targeting an OFDM based wireless modem. The architecture is based on the coarse-grained reconfigurable array (CGRA), which provides programmable and flexible accelerators by reconfiguring hardware resources at run time. On the other hand, the architecture maximizes the data parallelism by implementing 32-way SEVTD operations. Other features considered in the current implementation include mini-core structure, dedicated vector memory, and simplified datapath. The proposed architecture is compared to the precedent 4×4 CGRA processor, and evaluated with several communication kernels in terms of cycle, area and power. The implementation result shows that the proposed architecture has 3.6 times better in cycle performance with 2 times better scheduling but with double area penalty, resulting in 1495 cycles for complex 2K-FFT, to the best of our knowledge, that is the best DSP cycles reported until today. The synthesized results with 32nm library also show that the proposed architecture is operational at 800MHz, which is capable of running maximum 128 GOPS of wireless applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"310-313"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79618071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082760
Charles Eric LaForest, J. Anderson, J. Gregory Steffan
Implementing systems on FPGA soft-processors, rather than as custom hardware, eases and accelerates the development process, but at the cost of a great reduction in performance. Orthogonal to limitations in parallelism or clock frequency, this reduction in performance primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved FPGA soft-processor architecture which statically overlaps "overhead" computations and executes them in parallel with the "useful" computations, significantly reducing the number of processor cycles needed to execute sequential programs, while reducing maximum clock frequency to 0.939x of its original value. In addition to eliminating almost all overhead computations, the proposed soft-processor can operate at 500 MHz on the Altera Stratix IV FPGA - 0.909x of the absolute maximum rating. Combined, the high speed and execution efficiency increase the range of FPGA designs amenable to soft-processors rather than custom hardware. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for looping code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.
在FPGA软处理器上实现系统,而不是作为定制硬件,简化并加速了开发过程,但代价是性能大大降低。与并行性或时钟频率的限制无关,这种性能的降低主要源于标量微处理器的固有寻址和流量控制开销,在实际有用的工作中,它们在交叉地址计算和分支决策中花费了相当多的周期。我们提出了一种改进的FPGA软处理器架构,它静态地重叠“开销”计算,并与“有用”计算并行执行,显著减少执行顺序程序所需的处理器周期数,同时将最大时钟频率降低到原始值的0.939x。除了消除几乎所有的开销计算外,所提出的软处理器可以在Altera Stratix IV FPGA上以500 MHz的频率工作-绝对最大额定的0.909倍。结合起来,高速度和执行效率增加了适合软处理器而不是定制硬件的FPGA设计范围。我们用多个基准测试来评估我们的循环计数改进,实现了从重控制代码的1.07倍到循环代码的1.92倍的加速,性能从来没有比原始顺序代码差,并且总是比完全展开的循环表现得更好。
{"title":"Approaching overhead-free execution on FPGA soft-processors","authors":"Charles Eric LaForest, J. Anderson, J. Gregory Steffan","doi":"10.1109/FPT.2014.7082760","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082760","url":null,"abstract":"Implementing systems on FPGA soft-processors, rather than as custom hardware, eases and accelerates the development process, but at the cost of a great reduction in performance. Orthogonal to limitations in parallelism or clock frequency, this reduction in performance primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved FPGA soft-processor architecture which statically overlaps \"overhead\" computations and executes them in parallel with the \"useful\" computations, significantly reducing the number of processor cycles needed to execute sequential programs, while reducing maximum clock frequency to 0.939x of its original value. In addition to eliminating almost all overhead computations, the proposed soft-processor can operate at 500 MHz on the Altera Stratix IV FPGA - 0.909x of the absolute maximum rating. Combined, the high speed and execution efficiency increase the range of FPGA designs amenable to soft-processors rather than custom hardware. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for looping code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"6 1","pages":"99-106"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76106202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}