首页 > 最新文献

2021 International Conference on Field-Programmable Technology (ICFPT)最新文献

英文 中文
A streaming hardware architecture for real-time SIFT feature extraction 一种实时SIFT特征提取的流硬件架构
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609932
Hector A. Li Sanchez, A. George
The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.
尺度不变特征变换(SIFT)是一种特征提取方法,是许多计算机视觉管道中的关键步骤。基于纯软件方法的实时操作通常是不可行的,但fpga可以用于并行执行并加速应用程序以满足延迟要求。在本研究中,我们提出了一种基于流的SIFT特征提取硬件加速架构。使用一种新的策略来存储描述符计算所需的像素,相对于以前的设计,生成SIFT描述符所需的执行时间大大提高。该策略还通过引入多个处理元素(pe)来并行计算多个SIFT描述符,从而进一步减少了执行时间。此外,所提出的体系结构支持任意数量的八度的关键点检测,并允许运行时对各种参数进行配置。针对Xilinx Zynq-7045系统级芯片(SoC)器件部署了FPGA实现,以证明所提出架构的效率。在目标硬件中,生成的系统能够以高达150 FPS的速度处理分辨率为1280 × 720像素的图像,同时保持适度的资源利用率。
{"title":"A streaming hardware architecture for real-time SIFT feature extraction","authors":"Hector A. Li Sanchez, A. George","doi":"10.1109/ICFPT52863.2021.9609932","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609932","url":null,"abstract":"The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays FastCGRA:一个大规模粗粒度可重构阵列的建模、评估和探索平台
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609928
Su Zheng, Kaisen Zhang, Yaoguang Tian, Wenbo Yin, Lingli Wang, Xuegong Zhou
Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.
粗粒度可重构阵列(CGRAs)具有较高的硬件效率,在特定领域的应用中具有足够的灵活性,适用于神经网络加速和边缘计算等快速发展的领域。为了满足快速演化的需求,我们提出了大规模CGRAs建模、制图和勘探平台FastCGRA。FastCGRA支持分层架构描述和自动生成交换模块。连接感知的包装和图划分算法旨在降低放置和路由的复杂性。FastCGRA中的图同态布局算法实现了大规模CGRAs的高效布局。所述打包和放置算法与基于协商的路由算法协同形成一个完整的映射过程。FastCGRA可以支持大规模CGRAs的建模和映射,具有比现有平台更高的放置和路由效率。自动生成交换模块的方法可以降低CGRA互连设计的复杂性。利用这些特性,FastCGRA可以促进大规模CGRAs的探索。
{"title":"FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays","authors":"Su Zheng, Kaisen Zhang, Yaoguang Tian, Wenbo Yin, Lingli Wang, Xuegong Zhou","doi":"10.1109/ICFPT52863.2021.9609928","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609928","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115757638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
In-Storage Computation of Histograms with differential privacy 差分隐私直方图的存储计算
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609899
Andrei Tosa, A. Hangan, G. Sebestyen, Z. István
Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.
网络连接的智能存储在数据分析应用中变得越来越普遍。它依赖于靠近存储介质的处理元件,如fpga,来卸载计算密集型操作,减少系统中分布式节点之间的数据移动。因此,它可以提供出色的性能和能源效率。现代数据分析系统不仅变得更加分散,而且越来越关注隐私政策的合规性。这意味着,在未来,智能存储将不得不卸载越来越多与隐私相关的处理。在这项工作中,我们探讨了如何将差分私有(DP)直方图的计算(隐私保护分析的基本构建块)卸载到fpga上。通过在存储端执行DP聚合,可以允许不受信任的客户端以聚合形式查询数据,而不会有泄露个人身份信息的风险。我们通过用三个新组件扩展基于fpga的分布式键值存储来实现我们的想法。首先是直方图模块,它以100Gbps的线速率处理值。其次,随机噪声发生器,根据DP规定的规则将噪声添加到最终的直方图中。第三,一种限制键值对在直方图中使用的速率的机制,以保持在DP隐私预算之内。
{"title":"In-Storage Computation of Histograms with differential privacy","authors":"Andrei Tosa, A. Hangan, G. Sebestyen, Z. István","doi":"10.1109/ICFPT52863.2021.9609899","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609899","url":null,"abstract":"Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases 分布式数据库中哈希连接的可扩展和灵活的高性能网络处理
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609804
John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch
Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.
可编程交换机允许将特定的处理任务卸载到网络中,并承诺多比特/秒的吞吐量。将计算转移到网络时的一个主要目标通常是减少网络流量,从而提高整体性能。通过这种方式,可编程交换机在研究和工业应用中越来越多地用于各种场景,包括统计收集,网络内共识协议等。然而,目前可用的可编程开关受到几个实际限制。一个重要的限制是可用内存的数量有限,这使得它们不适合分布式数据库中的有状态操作(如Hash join)。在之前的工作中,提出了基于fpga的网络散列连接加速器,最初使用DDR-DRAM来保持状态。在后来的迭代中,哈希表被移动到片上HBM-DRAM,以进一步提高性能。然而,尽管速度非常快,但这种设置中的连接的大小受到可用HBM数量相对较少的限制。在这项工作中,我们异构地结合了DDR-DRAM和HBM内存,以支持更大的连接,并受益于更快、更并行的HBM访问。通过这种方式,与之前基于hbm的工作相比,我们能够将性能提高3倍。我们还引入了额外的配置参数,支持更灵活地调整底层硬件体系结构,以适应具体用例所需的不同连接操作。
{"title":"Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases","authors":"John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609804","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609804","url":null,"abstract":"Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130240373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems 共享虚拟内存可重构计算系统中的高效物理页面迁移
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609831
Torben Kalkhof, Andreas Koch
Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.
共享虚拟内存(SVM)可以大大简化fpga加速计算机的应用程序开发,因为它允许跨硬件/软件边界无缝传递虚拟寻址指针。特别是操作复杂的基于指针的数据结构的应用程序可以从这种方法中获益,因为SVM通常可以避免将整个数据复制到FPGA内存中,同时在过程中执行指针重定位。许多fpga加速计算机,特别是在数据中心设置中,采用带有fpga本地存储器的pcie附加板,其形式为片上HBM或板上DRAM。访问这个本地内存比通过PCIe访问主机内存要快得多。因此,即使存在SVM,也希望能够将保存频繁访问数据的物理内存页移动到离对其进行操作的计算单元最近的位置。这种功能称为物理页面迁移。这项工作的主要贡献是一个开源框架,它为支持向量机提供了物理页面迁移功能到pcie附加的FPGA卡。我们对全自动按需和用户管理的显式迁移模式进行了基准测试,并表明,对于合适的用例,迁移的性能不仅可以与传统的基于复制的DMA加速操作相匹配,甚至可以通过重叠计算和迁移而超过它。
{"title":"Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems","authors":"Torben Kalkhof, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609831","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609831","url":null,"abstract":"Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Resource-saving FPGA Implementation of the Satisfiability Problem Solver: AmoebaSATslim 可满足性问题求解器的FPGA实现:AmoebaSATslim
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609882
Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai
The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.
布尔可满足性问题(SAT)是一个np完全组合优化问题,其中快速的SAT解对于各种智能社会应用非常有用。由于这些面向边缘的应用需要时间关键控制,在FPGA上的高速SAT求解器是一种很有前途的方法。在这里,作者提出了一种新颖的FPGA实现生物启发的随机局部搜索算法,称为“AmoebaSAT”在Zynq板上。先前对fpga - amoebasat的研究处理了相对较小的3-SAT实例,有几百个变量,并在几毫秒内找到了解决方案。然而,这些实现采用了特定于实例的方法,每次更改目标实例时都需要综合FPGA配置。在本文中,提出了AmoebaSAT的瘦身版本,名为“AmoebaSATslim”,它省略了变量之间交互中最消耗资源的部分。FPGA-AmoebaSATslim能够处理更大尺寸的3-SAT实例,接受30,000个变量和133,800个子句。它的执行速度比在x86服务器的CPU上实现的软件amoebasatslim快大约24倍。
{"title":"Resource-saving FPGA Implementation of the Satisfiability Problem Solver: AmoebaSATslim","authors":"Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai","doi":"10.1109/ICFPT52863.2021.9609882","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609882","url":null,"abstract":"The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130786852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Profiling-Based Control-Flow Reduction in High-Level Synthesis 高级合成中基于剖面的控制流量减少
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609816
Austin Liolli, Omar Ragheb, J. Anderson
Control flow in a program can be represented in a directed graph, called the control flow graph (CFG). Nodes in the graph represent straight-line segments of code, basic blocks, and directed edges between nodes correspond to transfers of control. We present a methodology to selectively reduce control flow by collapsing basic blocks into their parent blocks, revealing increased instruction-level parallelism to a high-level synthesis (HLS) scheduler, thereby raising circuit performance.We evaluate our approach within an HLS tool that allows a C-language software program to be automatically synthesized into a hardware circuit, using the CHStone benchmark suite [1], targeting an Intel Cyclone V FPGA. For individual benchmark circuits we observe cycle count reductions up to 20.7% and wall-clock time reductions up to 22.6%, and 6% on average.
程序中的控制流可以用有向图表示,称为控制流图(CFG)。图中的节点表示代码的直线段,基本块,节点之间的有向边对应于控制的转移。我们提出了一种方法,通过将基本块折叠到它们的父块中来选择性地减少控制流,从而向高级合成(HLS)调度程序揭示增加的指令级并行性,从而提高电路性能。我们在HLS工具中评估我们的方法,该工具允许c语言软件程序自动合成到硬件电路中,使用CHStone基准套件[1],针对英特尔Cyclone V FPGA。对于单个基准电路,我们观察到周期计数减少了20.7%,时钟时间减少了22.6%,平均减少了6%。
{"title":"Profiling-Based Control-Flow Reduction in High-Level Synthesis","authors":"Austin Liolli, Omar Ragheb, J. Anderson","doi":"10.1109/ICFPT52863.2021.9609816","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609816","url":null,"abstract":"Control flow in a program can be represented in a directed graph, called the control flow graph (CFG). Nodes in the graph represent straight-line segments of code, basic blocks, and directed edges between nodes correspond to transfers of control. We present a methodology to selectively reduce control flow by collapsing basic blocks into their parent blocks, revealing increased instruction-level parallelism to a high-level synthesis (HLS) scheduler, thereby raising circuit performance.We evaluate our approach within an HLS tool that allows a C-language software program to be automatically synthesized into a hardware circuit, using the CHStone benchmark suite [1], targeting an Intel Cyclone V FPGA. For individual benchmark circuits we observe cycle count reductions up to 20.7% and wall-clock time reductions up to 22.6%, and 6% on average.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116609884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing 时域神经处理的面积效率倍增累积架构与实现
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609809
Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh
In our work, a new area-efficient multiply-accumulation scheme for time-domain neural processing named differential multiply-accumulation is proposed. Our new scheme reduces hardware resources utilization of multiply-accumulation with suppressing the increasing computational time resulting from the time-multiplexing. As a result, 2,048 neurons of fully connected CBM and RC-CBM were synthesized for a single field-programmable gate array (FPGA).
在我们的工作中,提出了一种新的时域神经处理的面积高效的乘法-积累方案,称为微分乘法-积累。新方案通过抑制时间复用带来的计算时间的增加,降低了乘法累加的硬件资源利用率。结果表明,在单个现场可编程门阵列(FPGA)上合成了2,048个全连接CBM和RC-CBM神经元。
{"title":"An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing","authors":"Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh","doi":"10.1109/ICFPT52863.2021.9609809","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609809","url":null,"abstract":"In our work, a new area-efficient multiply-accumulation scheme for time-domain neural processing named differential multiply-accumulation is proposed. Our new scheme reduces hardware resources utilization of multiply-accumulation with suppressing the increasing computational time resulting from the time-multiplexing. As a result, 2,048 neurons of fully connected CBM and RC-CBM were synthesized for a single field-programmable gate array (FPGA).","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133764589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA 基于FPGA的统一变换矩阵的高效跨步Winograd卷积方法
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609907
Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong
Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.
Winograd算法可以有效地降低卷积运算的计算复杂度。有效地利用Winograd卷积算法的并行性可以有效地提高FPGA上加速器架构的性能。步幅表示在输入特征映射上扫描过滤器时窗口滑动的元素数量。以往研究中实现的跨步为2的Winograd算法将输入的特征映射分成多组Winograd算法来完成操作,导致额外的预计算和硬件资源开销。本文提出了一种新的Winograd卷积算法,其步幅为2。该方法采用统一的Winograd变换矩阵代替分组方法来完成计算。因此,本文提出的方法可以像步长为1的Winograd卷积算法一样,通过嵌套的1D Winograd卷积实现2D Winograd卷积和3D Winograd卷积。本文给出了核大小为3、5、7的Winograd变换矩阵。特别是对于核数为3的卷积,该方法将Winograd算法的加法运算减少了30.0%-31.5%,并且完全消除了不必要的移位运算。此外,通过模板设计实现了步长为2的Winograd卷积算法,实现了流水线和数据复用。与最先进的实现相比,所提出的方法的加速提高了1.24,并减少了资源使用。
{"title":"Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA","authors":"Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong","doi":"10.1109/ICFPT52863.2021.9609907","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609907","url":null,"abstract":"Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134072165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AMAH-Flex: A Modular and Highly Flexible Tool for Generating Relocatable Systems on FPGAs AMAH-Flex:用于在fpga上生成可重新定位系统的模块化和高度灵活的工具
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609948
Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer
In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.
在这项工作中,我们提出了一种解决fpga在动态、不断变化的环境中使用时遇到的常见问题的方法。即使使用动态函数交换来适应不断变化的工作负载,部分比特流通常也是不可重定位的。因此,运行时环境需要将所有可重构分区/可重构模块组合存储为单独的位流。我们提出了一个模块化和高度灵活的工具(AMAH-Flex),可以将任何静态和可重构的系统转换为二维动态可重新定位的系统。它还具有完全自动化的平面规划阶段,缩小了合成和位流重新定位之间的自动化差距。它集成了Xilinx Vivado工具链,支持FPGA架构,7系列和UltraScale+。此外,AMAH-Flex可以移植到任何赛灵思FPGA系列,从7系列开始。我们在四种不同FPGA系列的几种重新配置场景中演示了我们的工具的功能,并表明AMAH-Flex节省了高达80%的部分比特流。
{"title":"AMAH-Flex: A Modular and Highly Flexible Tool for Generating Relocatable Systems on FPGAs","authors":"Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer","doi":"10.1109/ICFPT52863.2021.9609948","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609948","url":null,"abstract":"In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133913788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2021 International Conference on Field-Programmable Technology (ICFPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1