首页 > 最新文献

2014 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
An improved FPGA-based specific processor for Blokus Duo 改进的基于fpga的Blokus Duo专用处理器
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082822
J. Olivito, A. Delmas, J. Resano
This article presents a hardware design of a specific processor for Blokus Duo game. This design is an evolution of our previous work presented in the ICFPT'13 Design Competition. In order to improve its performance we have designed parallel hardware blocks to speed up the most time-consuming tasks, and included additional techniques to reduce the search space. As a consequence we can process a board six times faster than in our previous version and we prune the game-tree much more efficiently.
本文介绍了一种专为Blokus Duo游戏设计的处理器的硬件设计。这个设计是我们之前在ICFPT'13设计竞赛中展示的作品的演变。为了提高其性能,我们设计了并行硬件块来加速最耗时的任务,并包含了额外的技术来减少搜索空间。结果便是我们能够以比之前版本快6倍的速度处理一个棋盘,并且我们能够更有效地修剪游戏树。
{"title":"An improved FPGA-based specific processor for Blokus Duo","authors":"J. Olivito, A. Delmas, J. Resano","doi":"10.1109/FPT.2014.7082822","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082822","url":null,"abstract":"This article presents a hardware design of a specific processor for Blokus Duo game. This design is an evolution of our previous work presented in the ICFPT'13 Design Competition. In order to improve its performance we have designed parallel hardware blocks to speed up the most time-consuming tasks, and included additional techniques to reduce the search space. As a consequence we can process a board six times faster than in our previous version and we prune the game-tree much more efficiently.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"3 1","pages":"366-369"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74361358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A circuit to synchronize high speed serial communication channel 一种同步高速串行通信通道的电路
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082784
Mrinal J. Sarmah, Syed Azeemuddin
Channel bonding is a mechanism deployed to synchronize serial communication channels of higher data rate and bandwidth applications. Application that demands higher bandwidth, for example 400G Ethernet, it is not possible to achieve such massive rate using single high speed serial IO channel and aggregating multiple communication-links as a single communication channel makes such ultra-high speed realizable. One challenge that is faced in aggregating communication links is elimination of serial data skew introduced by non-identical trace length of the serial links. Various techniques exist to de-skew lanes in the receive side of the high speed serial transceiver. This paper presents a novel approach to channel bonding that optimizes area, power, initialization time and yields better performance. The idea discussed here is based on a delay based model and explores the possibility of performing channel bonding in a centralized way rather than a distributed way.
通道绑定是一种用于同步高数据速率和高带宽应用的串行通信通道的机制。对于要求更高带宽的应用,例如400G以太网,使用单个高速串行IO通道无法实现如此大的速率,而将多个通信链路聚合为单个通信通道使得这种超高速成为可能。在通信链路聚合中面临的一个挑战是消除由于串行链路的走线长度不相同而引起的串行数据倾斜。在高速串行收发器的接收端存在各种技术来消除斜线。本文提出了一种新的通道键合方法,该方法优化了面积、功耗、初始化时间并获得了更好的性能。这里讨论的想法是基于基于延迟的模型,并探索以集中方式而不是分布式方式执行通道绑定的可能性。
{"title":"A circuit to synchronize high speed serial communication channel","authors":"Mrinal J. Sarmah, Syed Azeemuddin","doi":"10.1109/FPT.2014.7082784","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082784","url":null,"abstract":"Channel bonding is a mechanism deployed to synchronize serial communication channels of higher data rate and bandwidth applications. Application that demands higher bandwidth, for example 400G Ethernet, it is not possible to achieve such massive rate using single high speed serial IO channel and aggregating multiple communication-links as a single communication channel makes such ultra-high speed realizable. One challenge that is faced in aggregating communication links is elimination of serial data skew introduced by non-identical trace length of the serial links. Various techniques exist to de-skew lanes in the receive side of the high speed serial transceiver. This paper presents a novel approach to channel bonding that optimizes area, power, initialization time and yields better performance. The idea discussed here is based on a delay based model and explores the possibility of performing channel bonding in a centralized way rather than a distributed way.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"239-242"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76260845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Development productivity in implementing a complex heterogeneous computing application 开发实现复杂异构计算应用程序的生产力
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082809
Anthony Milton, D. Kearney, S. Wong, S. Lemmo
The FPGA platform is increasingly faced with a multitude of competitor parallel computing architectures such as GPUs and various multicore variants. These competitor parallel platforms are attractive because they involve a software based development flow, resulting in greater developer productivity. While it has been argued that FPGA applications written in traditional hardware description languages (HDLs) may require nearly an order of magnitude more development time than corresponding parallel software development (PSD) for multi-core CPU or GPU, there are modern approaches to hardware design that drastically increase development productivity that are beginning to gain traction. One approach adopted in this work is use of the high-level HDL Bluespec. This paper compares Bluespec FPGA development with PSD for multi-core CPU and GPU, by detailing the experiences of a project that involved developing various components of a complex multi-object visual tracking algorithm for each of these platforms. We found that the development time using Bluespec was competitive with the combined development time for the CPU and GPU versions, but that limitations with the Bluespec development chain (such as lack of native floating-point support) and component integration issues with the FPGA design were areas of significant weakness for the FPGA platform. Finally, we present performance results for the various implementations of the visual tracking algorithm developed in this work, and show that the FPGA platform has the potential to exceed the performance of the CPU and GPU platforms when implementation issues can be overcome for this application.
FPGA平台越来越多地面临着gpu和各种多核变体等众多并行计算架构的竞争。这些竞争对手的并行平台很有吸引力,因为它们涉及到基于软件的开发流程,从而提高了开发人员的生产力。虽然有人认为用传统硬件描述语言(hdl)编写的FPGA应用程序可能需要比多核CPU或GPU的相应并行软件开发(PSD)多一个数量级的开发时间,但有一些现代的硬件设计方法可以大大提高开发效率,并开始获得吸引力。在这项工作中采用的一种方法是使用高级HDL Bluespec。本文通过详细介绍一个项目的经验,比较了Bluespec FPGA与PSD在多核CPU和GPU上的开发,该项目涉及为每个平台开发复杂的多目标视觉跟踪算法的各个组件。我们发现,使用Bluespec的开发时间与CPU和GPU版本的联合开发时间具有竞争力,但Bluespec开发链的局限性(例如缺乏原生浮点支持)和FPGA设计的组件集成问题是FPGA平台的重大弱点。最后,我们展示了在这项工作中开发的视觉跟踪算法的各种实现的性能结果,并表明FPGA平台在可以克服该应用的实现问题时具有超过CPU和GPU平台性能的潜力。
{"title":"Development productivity in implementing a complex heterogeneous computing application","authors":"Anthony Milton, D. Kearney, S. Wong, S. Lemmo","doi":"10.1109/FPT.2014.7082809","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082809","url":null,"abstract":"The FPGA platform is increasingly faced with a multitude of competitor parallel computing architectures such as GPUs and various multicore variants. These competitor parallel platforms are attractive because they involve a software based development flow, resulting in greater developer productivity. While it has been argued that FPGA applications written in traditional hardware description languages (HDLs) may require nearly an order of magnitude more development time than corresponding parallel software development (PSD) for multi-core CPU or GPU, there are modern approaches to hardware design that drastically increase development productivity that are beginning to gain traction. One approach adopted in this work is use of the high-level HDL Bluespec. This paper compares Bluespec FPGA development with PSD for multi-core CPU and GPU, by detailing the experiences of a project that involved developing various components of a complex multi-object visual tracking algorithm for each of these platforms. We found that the development time using Bluespec was competitive with the combined development time for the CPU and GPU versions, but that limitations with the Bluespec development chain (such as lack of native floating-point support) and component integration issues with the FPGA design were areas of significant weakness for the FPGA platform. Finally, we present performance results for the various implementations of the visual tracking algorithm developed in this work, and show that the FPGA platform has the potential to exceed the performance of the CPU and GPU platforms when implementation issues can be overcome for this application.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"59 1","pages":"322-325"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77733568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design space exploration for FPGA-based hybrid multicore architecture 基于fpga的混合多核架构的设计空间探索
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082795
Jian Yan, Junqi Yuan, Y. Wang, P. Leong, Lingli Wang
This paper presents a parameterized system-level design framework, which enables rapid and powerful research for hybrid multicore architecture exploration and hardware/software co-design. The framework comprises the component-based hardware design and application compiler, which make it easy for a designer to build stream-oriented applications with FPGA-based hybrid multicore architectures. The high modularity and parameterization of the framework supports fast multicore architecture exploration of different topologies, routing schemes, processor types, customized hardware processing units and memory system organizations. The compiler tool chain is used to map C/C++ based applications onto the soft processing units. Experimental results targeting the JPEG encoding application demonstrate the feasibility and performance improvement of this framework.
本文提出了一种参数化的系统级设计框架,为混合多核架构探索和软硬件协同设计提供了快速有力的研究。该框架包括基于组件的硬件设计和应用编译器,这使得设计人员可以轻松地使用基于fpga的混合多核架构构建面向流的应用程序。该框架的高度模块化和参数化支持对不同拓扑、路由方案、处理器类型、定制硬件处理单元和存储系统组织的快速多核架构探索。编译器工具链用于将基于C/ c++的应用程序映射到软处理单元。针对JPEG编码应用的实验结果证明了该框架的可行性和性能的提高。
{"title":"Design space exploration for FPGA-based hybrid multicore architecture","authors":"Jian Yan, Junqi Yuan, Y. Wang, P. Leong, Lingli Wang","doi":"10.1109/FPT.2014.7082795","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082795","url":null,"abstract":"This paper presents a parameterized system-level design framework, which enables rapid and powerful research for hybrid multicore architecture exploration and hardware/software co-design. The framework comprises the component-based hardware design and application compiler, which make it easy for a designer to build stream-oriented applications with FPGA-based hybrid multicore architectures. The high modularity and parameterization of the framework supports fast multicore architecture exploration of different topologies, routing schemes, processor types, customized hardware processing units and memory system organizations. The compiler tool chain is used to map C/C++ based applications onto the soft processing units. Experimental results targeting the JPEG encoding application demonstrate the feasibility and performance improvement of this framework.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"280-281"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80644688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey on security and trust of FPGA-based systems 基于fpga的系统安全与信任研究综述
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082768
Jiliang Zhang, G. Qu
This survey reviews the security and trust issues related to FPGA-based systems from the market perspective. For each party involved in FPGA supply and demand, we show the security and trust problems they need to be aware of and the solutions that are available.
本调查从市场的角度回顾了与基于fpga的系统相关的安全和信任问题。对于参与FPGA供需的各方,我们展示了他们需要意识到的安全和信任问题以及可用的解决方案。
{"title":"A survey on security and trust of FPGA-based systems","authors":"Jiliang Zhang, G. Qu","doi":"10.1109/FPT.2014.7082768","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082768","url":null,"abstract":"This survey reviews the security and trust issues related to FPGA-based systems from the market perspective. For each party involved in FPGA supply and demand, we show the security and trust problems they need to be aware of and the solutions that are available.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"74 1","pages":"147-152"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88049592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
AMMC: Advanced Multi-Core Memory Controller 高级多核内存控制器
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082802
Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, Shakaib A. Gursal
In this work, we propose an efficient scheduler and intelligent memory manager known as AMMC (Advanced Multi-Core Memory Controller), which proficiently handles data movement and computational tasks. The proposed AMMC system improves performance by managing complex data transfers at run-time and scheduling multi-cores without the intervention of a control processor nor an operating system. AMMC has been coupled with a heterogeneous system that provides both general-purpose cores and application specific accelerators. The AMMC system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the AMMC based multi-core system consumes 48% less hardware resources, 27.9% less on-chip power and achieves 6.8x of speed-up compared to the MicroBlaze-based multi-core system.
在这项工作中,我们提出了一个高效的调度程序和智能内存管理器,称为AMMC(高级多核内存控制器),它熟练地处理数据移动和计算任务。提出的AMMC系统通过在运行时管理复杂的数据传输和调度多核来提高性能,而无需控制处理器或操作系统的干预。AMMC与一个异构系统相结合,该系统提供通用核心和特定于应用程序的加速器。AMMC系统在Xilinx ML505评估FPGA板上进行了实现和测试。将该系统的性能与集成了Xilkernel操作系统的基于微处理器的系统进行了比较。结果表明,与基于microblaze的多核系统相比,基于AMMC的多核系统消耗的硬件资源减少48%,片上功耗减少27.9%,速度提升6.8倍。
{"title":"AMMC: Advanced Multi-Core Memory Controller","authors":"Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, Shakaib A. Gursal","doi":"10.1109/FPT.2014.7082802","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082802","url":null,"abstract":"In this work, we propose an efficient scheduler and intelligent memory manager known as AMMC (Advanced Multi-Core Memory Controller), which proficiently handles data movement and computational tasks. The proposed AMMC system improves performance by managing complex data transfers at run-time and scheduling multi-cores without the intervention of a control processor nor an operating system. AMMC has been coupled with a heterogeneous system that provides both general-purpose cores and application specific accelerators. The AMMC system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the AMMC based multi-core system consumes 48% less hardware resources, 27.9% less on-chip power and achieves 6.8x of speed-up compared to the MicroBlaze-based multi-core system.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"292-295"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84260645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Online scheduling for FPGA computation in the Cloud 云环境下FPGA计算的在线调度
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082811
Guohao Dai, Yi Shan, Fei Chen, Yu Wang, Kun Wang, Huazhong Yang
The popularization and application of Cloud Computing have provided a new approach for users to get computing resources in recent years. Meanwhile, due to the advantages including programmability and power-efficiency, FPGAs have been applied to custom computing in many domains. Previous work has made resources of FPGA available under the cloud environment. However, the effective usage of FPGAs in the cloud requires efficient online task scheduling: to properly assign as many tasks from different tenants as possible to the FPGAs. In this paper, we propose a benefit-based scheduling metric to evaluate the task assignment Based on the metric, we accelerate task execution according to our benefit-based scheduling algorithms. By applying our benefit-based scheduling metric to a real OpenStack-based cloud environment, 60.32% computing resources are saved compared with the conventional throughput-based metric. Furthermore, a Replacement-Considering algorithm, which considers the task replacement, is proposed taking the characteristics of cloud into account. The results show that our FPGA accelerated cloud system is 1.386 times faster than using the previous algorithm.
近年来,云计算的普及和应用为用户获取计算资源提供了新的途径。同时,由于fpga具有可编程性和低功耗等优点,在许多领域的定制计算中得到了应用。以前的工作已经使FPGA的资源可以在云环境下使用。然而,在云中有效地使用fpga需要高效的在线任务调度:将尽可能多的来自不同租户的任务适当地分配给fpga。在本文中,我们提出了一个基于利益的调度指标来评估任务分配,并在此基础上根据我们的基于利益的调度算法加速任务的执行。将基于效益的调度指标应用于真实的基于openstack的云环境,与传统的基于吞吐量的调度指标相比,节省了60.32%的计算资源。在此基础上,考虑云计算的特点,提出了一种考虑任务替换的替换-考虑算法。结果表明,我们的FPGA加速云系统比使用先前算法快1.386倍。
{"title":"Online scheduling for FPGA computation in the Cloud","authors":"Guohao Dai, Yi Shan, Fei Chen, Yu Wang, Kun Wang, Huazhong Yang","doi":"10.1109/FPT.2014.7082811","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082811","url":null,"abstract":"The popularization and application of Cloud Computing have provided a new approach for users to get computing resources in recent years. Meanwhile, due to the advantages including programmability and power-efficiency, FPGAs have been applied to custom computing in many domains. Previous work has made resources of FPGA available under the cloud environment. However, the effective usage of FPGAs in the cloud requires efficient online task scheduling: to properly assign as many tasks from different tenants as possible to the FPGAs. In this paper, we propose a benefit-based scheduling metric to evaluate the task assignment Based on the metric, we accelerate task execution according to our benefit-based scheduling algorithms. By applying our benefit-based scheduling metric to a real OpenStack-based cloud environment, 60.32% computing resources are saved compared with the conventional throughput-based metric. Furthermore, a Replacement-Considering algorithm, which considers the task replacement, is proposed taking the characteristics of cloud into account. The results show that our FPGA accelerated cloud system is 1.386 times faster than using the previous algorithm.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"330-333"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81996199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Highly scalable, shared-memory, Monte-Carlo tree search based Blokus Duo Solver on FPGA 高度可扩展,共享内存,基于蒙特卡洛树搜索的Blokus Duo求解器在FPGA上
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082823
Ehsan Qasemi, Amir Samadi, Mohammad H. Shadmehr, Bardia Azizian, Sajjad Mozaffari, Amir Shirian, B. Alizadeh
In this paper we present our hardware architecture on a highly scalable, shared-memory, Monte-Carlo Tree Search (MCTS) based Blokus-Duo solver. In the proposed architecture each MCTS solver module contains a centralized MCTS controller which can also be implemented using soft-cores with a true dual-port access to a shared memory called main memory, and multitude number of MCTS engines each containing several simulation cores. Consequently, this highly flexible architecture guaranties the optimized performance of the solver regardless of the actual FPGA platform used. Our design has been inspired from parallel MCTS algorithms and is potentially capable of obtaining maximum possible parallelism from MCTS algorithm. On the other hand, in our design we combine MCTS with pruning heuristics to increase both the memory and LE utilizations. The results show that our architecture can run up to 50MHz on DE2-115 platform, where each Simulation core requires 11K LEs and MCTS controller requires 10KLEs.
在本文中,我们提出了一个高度可扩展的、共享内存的、基于蒙特卡罗树搜索(MCTS)的Blokus-Duo求解器的硬件架构。在提出的架构中,每个MCTS求解器模块包含一个集中的MCTS控制器,该控制器也可以使用具有真正双端口访问称为主存的共享内存的软核来实现,以及多个MCTS引擎,每个引擎包含多个仿真核心。因此,无论实际使用的FPGA平台如何,这种高度灵活的架构都保证了求解器的最佳性能。我们的设计受到并行MCTS算法的启发,并有可能从MCTS算法中获得最大可能的并行性。另一方面,在我们的设计中,我们将MCTS与剪枝启发式结合起来,以增加内存和LE利用率。结果表明,我们的架构可以在DE2-115平台上运行高达50MHz,其中每个仿真核心需要11K的LEs, MCTS控制器需要10k的LEs。
{"title":"Highly scalable, shared-memory, Monte-Carlo tree search based Blokus Duo Solver on FPGA","authors":"Ehsan Qasemi, Amir Samadi, Mohammad H. Shadmehr, Bardia Azizian, Sajjad Mozaffari, Amir Shirian, B. Alizadeh","doi":"10.1109/FPT.2014.7082823","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082823","url":null,"abstract":"In this paper we present our hardware architecture on a highly scalable, shared-memory, Monte-Carlo Tree Search (MCTS) based Blokus-Duo solver. In the proposed architecture each MCTS solver module contains a centralized MCTS controller which can also be implemented using soft-cores with a true dual-port access to a shared memory called main memory, and multitude number of MCTS engines each containing several simulation cores. Consequently, this highly flexible architecture guaranties the optimized performance of the solver regardless of the actual FPGA platform used. Our design has been inspired from parallel MCTS algorithms and is potentially capable of obtaining maximum possible parallelism from MCTS algorithm. On the other hand, in our design we combine MCTS with pruning heuristics to increase both the memory and LE utilizations. The results show that our architecture can run up to 50MHz on DE2-115 platform, where each Simulation core requires 11K LEs and MCTS controller requires 10KLEs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"370-373"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82782051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scalable radio processor architecture for modern wireless communications 用于现代无线通信的可扩展无线电处理器架构
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082806
Young-Hwan Park, K. Prasad, Yeonbok Lee, Kitaek Bae, Ho Yang
In this paper, we propose an architecture of scalable radio processor targeting an OFDM based wireless modem. The architecture is based on the coarse-grained reconfigurable array (CGRA), which provides programmable and flexible accelerators by reconfiguring hardware resources at run time. On the other hand, the architecture maximizes the data parallelism by implementing 32-way SEVTD operations. Other features considered in the current implementation include mini-core structure, dedicated vector memory, and simplified datapath. The proposed architecture is compared to the precedent 4×4 CGRA processor, and evaluated with several communication kernels in terms of cycle, area and power. The implementation result shows that the proposed architecture has 3.6 times better in cycle performance with 2 times better scheduling but with double area penalty, resulting in 1495 cycles for complex 2K-FFT, to the best of our knowledge, that is the best DSP cycles reported until today. The synthesized results with 32nm library also show that the proposed architecture is operational at 800MHz, which is capable of running maximum 128 GOPS of wireless applications.
本文针对基于OFDM的无线调制解调器,提出了一种可扩展的无线电处理器架构。该体系结构基于粗粒度可重构阵列(CGRA),它通过在运行时重新配置硬件资源来提供可编程和灵活的加速器。另一方面,该体系结构通过实现32路SEVTD操作来最大化数据并行性。当前实现中考虑的其他特性包括微核结构、专用矢量内存和简化的数据路径。将该架构与现有的4×4 CGRA处理器进行了比较,并从周期、面积和功耗等方面对多个通信内核进行了评估。实现结果表明,所提出的架构具有3.6倍的周期性能和2倍的调度,但具有双倍的面积损失,导致复杂的2K-FFT的1495个周期,据我们所知,这是迄今为止报道的最佳DSP周期。32nm库的综合结果也表明,该架构可在800MHz下运行,能够运行最大128 GOPS的无线应用。
{"title":"Scalable radio processor architecture for modern wireless communications","authors":"Young-Hwan Park, K. Prasad, Yeonbok Lee, Kitaek Bae, Ho Yang","doi":"10.1109/FPT.2014.7082806","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082806","url":null,"abstract":"In this paper, we propose an architecture of scalable radio processor targeting an OFDM based wireless modem. The architecture is based on the coarse-grained reconfigurable array (CGRA), which provides programmable and flexible accelerators by reconfiguring hardware resources at run time. On the other hand, the architecture maximizes the data parallelism by implementing 32-way SEVTD operations. Other features considered in the current implementation include mini-core structure, dedicated vector memory, and simplified datapath. The proposed architecture is compared to the precedent 4×4 CGRA processor, and evaluated with several communication kernels in terms of cycle, area and power. The implementation result shows that the proposed architecture has 3.6 times better in cycle performance with 2 times better scheduling but with double area penalty, resulting in 1495 cycles for complex 2K-FFT, to the best of our knowledge, that is the best DSP cycles reported until today. The synthesized results with 32nm library also show that the proposed architecture is operational at 800MHz, which is capable of running maximum 128 GOPS of wireless applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"310-313"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79618071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Approaching overhead-free execution on FPGA soft-processors 在FPGA软处理器上接近无开销执行
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082760
Charles Eric LaForest, J. Anderson, J. Gregory Steffan
Implementing systems on FPGA soft-processors, rather than as custom hardware, eases and accelerates the development process, but at the cost of a great reduction in performance. Orthogonal to limitations in parallelism or clock frequency, this reduction in performance primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved FPGA soft-processor architecture which statically overlaps "overhead" computations and executes them in parallel with the "useful" computations, significantly reducing the number of processor cycles needed to execute sequential programs, while reducing maximum clock frequency to 0.939x of its original value. In addition to eliminating almost all overhead computations, the proposed soft-processor can operate at 500 MHz on the Altera Stratix IV FPGA - 0.909x of the absolute maximum rating. Combined, the high speed and execution efficiency increase the range of FPGA designs amenable to soft-processors rather than custom hardware. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for looping code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.
在FPGA软处理器上实现系统,而不是作为定制硬件,简化并加速了开发过程,但代价是性能大大降低。与并行性或时钟频率的限制无关,这种性能的降低主要源于标量微处理器的固有寻址和流量控制开销,在实际有用的工作中,它们在交叉地址计算和分支决策中花费了相当多的周期。我们提出了一种改进的FPGA软处理器架构,它静态地重叠“开销”计算,并与“有用”计算并行执行,显著减少执行顺序程序所需的处理器周期数,同时将最大时钟频率降低到原始值的0.939x。除了消除几乎所有的开销计算外,所提出的软处理器可以在Altera Stratix IV FPGA上以500 MHz的频率工作-绝对最大额定的0.909倍。结合起来,高速度和执行效率增加了适合软处理器而不是定制硬件的FPGA设计范围。我们用多个基准测试来评估我们的循环计数改进,实现了从重控制代码的1.07倍到循环代码的1.92倍的加速,性能从来没有比原始顺序代码差,并且总是比完全展开的循环表现得更好。
{"title":"Approaching overhead-free execution on FPGA soft-processors","authors":"Charles Eric LaForest, J. Anderson, J. Gregory Steffan","doi":"10.1109/FPT.2014.7082760","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082760","url":null,"abstract":"Implementing systems on FPGA soft-processors, rather than as custom hardware, eases and accelerates the development process, but at the cost of a great reduction in performance. Orthogonal to limitations in parallelism or clock frequency, this reduction in performance primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved FPGA soft-processor architecture which statically overlaps \"overhead\" computations and executes them in parallel with the \"useful\" computations, significantly reducing the number of processor cycles needed to execute sequential programs, while reducing maximum clock frequency to 0.939x of its original value. In addition to eliminating almost all overhead computations, the proposed soft-processor can operate at 500 MHz on the Altera Stratix IV FPGA - 0.909x of the absolute maximum rating. Combined, the high speed and execution efficiency increase the range of FPGA designs amenable to soft-processors rather than custom hardware. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for looping code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"6 1","pages":"99-106"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76106202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2014 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1