2021 International Conference on Field-Programmable Technology (ICFPT)最新文献

A streaming hardware architecture for real-time SIFT feature extraction 一种实时SIFT特征提取的流硬件架构

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609932

Hector A. Li Sanchez, A. George

The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.

尺度不变特征变换(SIFT)是一种特征提取方法，是许多计算机视觉管道中的关键步骤。基于纯软件方法的实时操作通常是不可行的，但fpga可以用于并行执行并加速应用程序以满足延迟要求。在本研究中，我们提出了一种基于流的SIFT特征提取硬件加速架构。使用一种新的策略来存储描述符计算所需的像素，相对于以前的设计，生成SIFT描述符所需的执行时间大大提高。该策略还通过引入多个处理元素(pe)来并行计算多个SIFT描述符，从而进一步减少了执行时间。此外，所提出的体系结构支持任意数量的八度的关键点检测，并允许运行时对各种参数进行配置。针对Xilinx Zynq-7045系统级芯片(SoC)器件部署了FPGA实现，以证明所提出架构的效率。在目标硬件中，生成的系统能够以高达150 FPS的速度处理分辨率为1280 × 720像素的图像，同时保持适度的资源利用率。

{"title":"A streaming hardware architecture for real-time SIFT feature extraction","authors":"Hector A. Li Sanchez, A. George","doi":"10.1109/ICFPT52863.2021.9609932","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609932","url":null,"abstract":"The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays FastCGRA:一个大规模粗粒度可重构阵列的建模、评估和探索平台

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609928

Su Zheng, Kaisen Zhang, Yaoguang Tian, Wenbo Yin, Lingli Wang, Xuegong Zhou

Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.

粗粒度可重构阵列(CGRAs)具有较高的硬件效率，在特定领域的应用中具有足够的灵活性，适用于神经网络加速和边缘计算等快速发展的领域。为了满足快速演化的需求，我们提出了大规模CGRAs建模、制图和勘探平台FastCGRA。FastCGRA支持分层架构描述和自动生成交换模块。连接感知的包装和图划分算法旨在降低放置和路由的复杂性。FastCGRA中的图同态布局算法实现了大规模CGRAs的高效布局。所述打包和放置算法与基于协商的路由算法协同形成一个完整的映射过程。FastCGRA可以支持大规模CGRAs的建模和映射，具有比现有平台更高的放置和路由效率。自动生成交换模块的方法可以降低CGRA互连设计的复杂性。利用这些特性，FastCGRA可以促进大规模CGRAs的探索。

{"title":"FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays","authors":"Su Zheng, Kaisen Zhang, Yaoguang Tian, Wenbo Yin, Lingli Wang, Xuegong Zhou","doi":"10.1109/ICFPT52863.2021.9609928","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609928","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115757638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

In-Storage Computation of Histograms with differential privacy 差分隐私直方图的存储计算

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609899

Andrei Tosa, A. Hangan, G. Sebestyen, Z. István

Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.

网络连接的智能存储在数据分析应用中变得越来越普遍。它依赖于靠近存储介质的处理元件，如fpga，来卸载计算密集型操作，减少系统中分布式节点之间的数据移动。因此，它可以提供出色的性能和能源效率。现代数据分析系统不仅变得更加分散，而且越来越关注隐私政策的合规性。这意味着，在未来，智能存储将不得不卸载越来越多与隐私相关的处理。在这项工作中，我们探讨了如何将差分私有(DP)直方图的计算(隐私保护分析的基本构建块)卸载到fpga上。通过在存储端执行DP聚合，可以允许不受信任的客户端以聚合形式查询数据，而不会有泄露个人身份信息的风险。我们通过用三个新组件扩展基于fpga的分布式键值存储来实现我们的想法。首先是直方图模块，它以100Gbps的线速率处理值。其次，随机噪声发生器，根据DP规定的规则将噪声添加到最终的直方图中。第三，一种限制键值对在直方图中使用的速率的机制，以保持在DP隐私预算之内。

{"title":"In-Storage Computation of Histograms with differential privacy","authors":"Andrei Tosa, A. Hangan, G. Sebestyen, Z. István","doi":"10.1109/ICFPT52863.2021.9609899","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609899","url":null,"abstract":"Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases 分布式数据库中哈希连接的可扩展和灵活的高性能网络处理

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609804

John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch

Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.

可编程交换机允许将特定的处理任务卸载到网络中，并承诺多比特/秒的吞吐量。将计算转移到网络时的一个主要目标通常是减少网络流量，从而提高整体性能。通过这种方式，可编程交换机在研究和工业应用中越来越多地用于各种场景，包括统计收集，网络内共识协议等。然而，目前可用的可编程开关受到几个实际限制。一个重要的限制是可用内存的数量有限，这使得它们不适合分布式数据库中的有状态操作(如Hash join)。在之前的工作中，提出了基于fpga的网络散列连接加速器，最初使用DDR-DRAM来保持状态。在后来的迭代中，哈希表被移动到片上HBM-DRAM，以进一步提高性能。然而，尽管速度非常快，但这种设置中的连接的大小受到可用HBM数量相对较少的限制。在这项工作中，我们异构地结合了DDR-DRAM和HBM内存，以支持更大的连接，并受益于更快、更并行的HBM访问。通过这种方式，与之前基于hbm的工作相比，我们能够将性能提高3倍。我们还引入了额外的配置参数，支持更灵活地调整底层硬件体系结构，以适应具体用例所需的不同连接操作。

{"title":"Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases","authors":"John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609804","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609804","url":null,"abstract":"Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130240373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems 共享虚拟内存可重构计算系统中的高效物理页面迁移

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609831

Torben Kalkhof, Andreas Koch

Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.

共享虚拟内存(SVM)可以大大简化fpga加速计算机的应用程序开发，因为它允许跨硬件/软件边界无缝传递虚拟寻址指针。特别是操作复杂的基于指针的数据结构的应用程序可以从这种方法中获益，因为SVM通常可以避免将整个数据复制到FPGA内存中，同时在过程中执行指针重定位。许多fpga加速计算机，特别是在数据中心设置中，采用带有fpga本地存储器的pcie附加板，其形式为片上HBM或板上DRAM。访问这个本地内存比通过PCIe访问主机内存要快得多。因此，即使存在SVM，也希望能够将保存频繁访问数据的物理内存页移动到离对其进行操作的计算单元最近的位置。这种功能称为物理页面迁移。这项工作的主要贡献是一个开源框架，它为支持向量机提供了物理页面迁移功能到pcie附加的FPGA卡。我们对全自动按需和用户管理的显式迁移模式进行了基准测试，并表明，对于合适的用例，迁移的性能不仅可以与传统的基于复制的DMA加速操作相匹配，甚至可以通过重叠计算和迁移而超过它。

{"title":"Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems","authors":"Torben Kalkhof, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609831","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609831","url":null,"abstract":"Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Resource-saving FPGA Implementation of the Satisfiability Problem Solver: AmoebaSATslim 可满足性问题求解器的FPGA实现:AmoebaSATslim

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609882

Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai

The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.

布尔可满足性问题(SAT)是一个np完全组合优化问题，其中快速的SAT解对于各种智能社会应用非常有用。由于这些面向边缘的应用需要时间关键控制，在FPGA上的高速SAT求解器是一种很有前途的方法。在这里，作者提出了一种新颖的FPGA实现生物启发的随机局部搜索算法，称为“AmoebaSAT”在Zynq板上。先前对fpga - amoebasat的研究处理了相对较小的3-SAT实例，有几百个变量，并在几毫秒内找到了解决方案。然而，这些实现采用了特定于实例的方法，每次更改目标实例时都需要综合FPGA配置。在本文中，提出了AmoebaSAT的瘦身版本，名为“AmoebaSATslim”，它省略了变量之间交互中最消耗资源的部分。FPGA-AmoebaSATslim能够处理更大尺寸的3-SAT实例，接受30,000个变量和133,800个子句。它的执行速度比在x86服务器的CPU上实现的软件amoebasatslim快大约24倍。

{"title":"Resource-saving FPGA Implementation of the Satisfiability Problem Solver: AmoebaSATslim","authors":"Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai","doi":"10.1109/ICFPT52863.2021.9609882","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609882","url":null,"abstract":"The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130786852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Profiling-Based Control-Flow Reduction in High-Level Synthesis 高级合成中基于剖面的控制流量减少

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609816

Austin Liolli, Omar Ragheb, J. Anderson

Control flow in a program can be represented in a directed graph, called the control flow graph (CFG). Nodes in the graph represent straight-line segments of code, basic blocks, and directed edges between nodes correspond to transfers of control. We present a methodology to selectively reduce control flow by collapsing basic blocks into their parent blocks, revealing increased instruction-level parallelism to a high-level synthesis (HLS) scheduler, thereby raising circuit performance.We evaluate our approach within an HLS tool that allows a C-language software program to be automatically synthesized into a hardware circuit, using the CHStone benchmark suite [1], targeting an Intel Cyclone V FPGA. For individual benchmark circuits we observe cycle count reductions up to 20.7% and wall-clock time reductions up to 22.6%, and 6% on average.

程序中的控制流可以用有向图表示，称为控制流图(CFG)。图中的节点表示代码的直线段，基本块，节点之间的有向边对应于控制的转移。我们提出了一种方法，通过将基本块折叠到它们的父块中来选择性地减少控制流，从而向高级合成(HLS)调度程序揭示增加的指令级并行性，从而提高电路性能。我们在HLS工具中评估我们的方法，该工具允许c语言软件程序自动合成到硬件电路中，使用CHStone基准套件[1]，针对英特尔Cyclone V FPGA。对于单个基准电路，我们观察到周期计数减少了20.7%，时钟时间减少了22.6%，平均减少了6%。

引用次数: 0

An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing 时域神经处理的面积效率倍增累积架构与实现

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609809

Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh

In our work, a new area-efficient multiply-accumulation scheme for time-domain neural processing named differential multiply-accumulation is proposed. Our new scheme reduces hardware resources utilization of multiply-accumulation with suppressing the increasing computational time resulting from the time-multiplexing. As a result, 2,048 neurons of fully connected CBM and RC-CBM were synthesized for a single field-programmable gate array (FPGA).

在我们的工作中，提出了一种新的时域神经处理的面积高效的乘法-积累方案，称为微分乘法-积累。新方案通过抑制时间复用带来的计算时间的增加，降低了乘法累加的硬件资源利用率。结果表明，在单个现场可编程门阵列(FPGA)上合成了2,048个全连接CBM和RC-CBM神经元。

引用次数: 2

Increasing Memory Efficiency of Hash-Based Pattern Matching for High-Speed Networks 提高高速网络中基于哈希模式匹配的内存效率

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609859

Tomás Fukac, J. Matoušek, J. Korenek, Lukás Kekely

Increasing speed of network links continuously pushes up requirements on the performance of network security and monitoring systems, including their typical representative and its core function: an intrusion detection system (IDS) and pattern matching. To allow the operation of IDS applications like Snort and Suricata in networks supporting throughput of 100Gbps or even more, a recently proposed pre-filtering architecture approximates exact pattern matching using hash-based matching of short strings that represent a given set of patterns. This architecture can scale supported throughput by adjusting the number of parallel hash functions and on-chip memory blocks utilized in the implementation of a hash table. Since each hash function can address every memory block, scaling throughput also increases the total capacity of the hash table. Nevertheless, the original architecture utilizes the available capacity of the hash table inefficiently. We therefore propose three optimization techniques that either reduce the amount of information stored in the hash table or increase its achievable occupancy. Moreover, we also design modifications of the architecture that enable resource-efficient utilization of all three optimization techniques together in synergy. Compared to the original pre-filtering architecture, combined use of the proposed optimizations in the 100Gbps scenario increases the achievable capacity for short strings by three orders of magnitude. It also reduces the utilization of FPGA logic resources to only a third.

随着网络链路速度的不断提高，对网络安全和监控系统的性能要求也不断提高，其中包括入侵检测系统(IDS)和模式匹配，这是网络安全和监控系统的典型代表和核心功能。为了允许Snort和Suricata等IDS应用程序在支持100Gbps甚至更高吞吐量的网络中运行，最近提出的一种预过滤体系结构使用代表一组给定模式的短字符串的基于哈希的匹配来近似精确的模式匹配。这种体系结构可以通过调整并行哈希函数的数量和在哈希表的实现中使用的片上内存块来扩展支持的吞吐量。由于每个哈希函数都可以寻址每个内存块，因此扩展吞吐量也会增加哈希表的总容量。然而，原来的体系结构没有有效地利用哈希表的可用容量。因此，我们提出了三种优化技术，要么减少存储在哈希表中的信息量，要么增加其可实现的占用。此外，我们还设计了对体系结构的修改，使所有三种优化技术能够协同高效地利用资源。与原始的预滤波架构相比，在100Gbps场景中结合使用所提出的优化将短字符串的可实现容量提高了三个数量级。它还将FPGA逻辑资源的利用率降低到只有三分之一。

{"title":"Increasing Memory Efficiency of Hash-Based Pattern Matching for High-Speed Networks","authors":"Tomás Fukac, J. Matoušek, J. Korenek, Lukás Kekely","doi":"10.1109/ICFPT52863.2021.9609859","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609859","url":null,"abstract":"Increasing speed of network links continuously pushes up requirements on the performance of network security and monitoring systems, including their typical representative and its core function: an intrusion detection system (IDS) and pattern matching. To allow the operation of IDS applications like Snort and Suricata in networks supporting throughput of 100Gbps or even more, a recently proposed pre-filtering architecture approximates exact pattern matching using hash-based matching of short strings that represent a given set of patterns. This architecture can scale supported throughput by adjusting the number of parallel hash functions and on-chip memory blocks utilized in the implementation of a hash table. Since each hash function can address every memory block, scaling throughput also increases the total capacity of the hash table. Nevertheless, the original architecture utilizes the available capacity of the hash table inefficiently. We therefore propose three optimization techniques that either reduce the amount of information stored in the hash table or increase its achievable occupancy. Moreover, we also design modifications of the architecture that enable resource-efficient utilization of all three optimization techniques together in synergy. Compared to the original pre-filtering architecture, combined use of the proposed optimizations in the 100Gbps scenario increases the achievable capacity for short strings by three orders of magnitude. It also reduces the utilization of FPGA logic resources to only a third.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121990730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A unified accelerator design for LiDAR SLAM algorithms for low-end FPGAs 用于低端fpga的激光雷达SLAM算法的统一加速器设计

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609886

K. Sugiura, Hiroki Matsutani

A fast and reliable LiDAR (Light Detection and Ranging) SLAM (Simultaneous Localization and Mapping) system is the growing need for autonomous mobile robots, which are used for a variety of tasks such as indoor cleaning, navigation, and transportation. To bridge the gap between the limited processing power on such robots and the high computational requirement of the SLAM system, in this paper we propose a unified accelerator design for 2D SLAM algorithms on resource-limited FPGA devices. As scan matching is the heart of these algorithms, the proposed FPGA-based accelerator utilizes scan matching cores on the programmable logic part and users can switch the SLAM algorithms to adapt to performance requirements and environments without modifying and re-synthesizing the logic part. We integrate the accelerator into two representative SLAM algorithms, namely particle filter-based and graph-based SLAM. They are evaluated in terms of resource utilization, processing speed, and quality of output results with various real-world datasets, highlighting their algorithmic characteristics. Experiment results on a Pynq-Z2 board demonstrate that scan matching is accelerated by 13.67–14.84x, improving the overall performance of particle filter-based and graph-based SLAM by 4.03–4.67x and 3.09–4.00x respectively, while maintaining the accuracy comparable to their software counterparts and even state-of-the-art methods.

快速可靠的激光雷达(光探测和测距)SLAM(同步定位和测绘)系统是对自主移动机器人日益增长的需求，用于各种任务，如室内清洁，导航和运输。为了弥补此类机器人有限的处理能力与SLAM系统的高计算需求之间的差距，本文提出了一种基于资源有限的FPGA设备的二维SLAM算法的统一加速器设计。由于扫描匹配是这些算法的核心，所提出的基于fpga的加速器在可编程逻辑部分使用扫描匹配内核，用户可以切换SLAM算法以适应性能要求和环境，而无需修改和重新合成逻辑部分。我们将加速器集成到两种具有代表性的SLAM算法中，即基于粒子滤波的SLAM算法和基于图的SLAM算法。它们根据资源利用率、处理速度和各种真实世界数据集的输出结果质量进行评估，突出其算法特征。在Pynq-Z2板上的实验结果表明，扫描匹配速度提高了13.67 - 14.84倍，基于粒子滤波和基于图的SLAM的整体性能分别提高了4.03 - 4.67倍和3.09 - 4.00倍，同时保持了与软件甚至最先进方法相当的精度。

{"title":"A unified accelerator design for LiDAR SLAM algorithms for low-end FPGAs","authors":"K. Sugiura, Hiroki Matsutani","doi":"10.1109/ICFPT52863.2021.9609886","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609886","url":null,"abstract":"A fast and reliable LiDAR (Light Detection and Ranging) SLAM (Simultaneous Localization and Mapping) system is the growing need for autonomous mobile robots, which are used for a variety of tasks such as indoor cleaning, navigation, and transportation. To bridge the gap between the limited processing power on such robots and the high computational requirement of the SLAM system, in this paper we propose a unified accelerator design for 2D SLAM algorithms on resource-limited FPGA devices. As scan matching is the heart of these algorithms, the proposed FPGA-based accelerator utilizes scan matching cores on the programmable logic part and users can switch the SLAM algorithms to adapt to performance requirements and environments without modifying and re-synthesizing the logic part. We integrate the accelerator into two representative SLAM algorithms, namely particle filter-based and graph-based SLAM. They are evaluated in terms of resource utilization, processing speed, and quality of output results with various real-world datasets, highlighting their algorithmic characteristics. Experiment results on a Pynq-Z2 board demonstrate that scan matching is accelerated by 13.67–14.84x, improving the overall performance of particle filter-based and graph-based SLAM by 4.03–4.67x and 3.09–4.00x respectively, while maintaining the accuracy comparable to their software counterparts and even state-of-the-art methods.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127075435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2