Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609932
Hector A. Li Sanchez, A. George
The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.
{"title":"A streaming hardware architecture for real-time SIFT feature extraction","authors":"Hector A. Li Sanchez, A. George","doi":"10.1109/ICFPT52863.2021.9609932","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609932","url":null,"abstract":"The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.
{"title":"FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays","authors":"Su Zheng, Kaisen Zhang, Yaoguang Tian, Wenbo Yin, Lingli Wang, Xuegong Zhou","doi":"10.1109/ICFPT52863.2021.9609928","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609928","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115757638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609899
Andrei Tosa, A. Hangan, G. Sebestyen, Z. István
Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.
{"title":"In-Storage Computation of Histograms with differential privacy","authors":"Andrei Tosa, A. Hangan, G. Sebestyen, Z. István","doi":"10.1109/ICFPT52863.2021.9609899","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609899","url":null,"abstract":"Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609804
John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch
Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.
{"title":"Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases","authors":"John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609804","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609804","url":null,"abstract":"Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130240373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609831
Torben Kalkhof, Andreas Koch
Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.
{"title":"Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems","authors":"Torben Kalkhof, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609831","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609831","url":null,"abstract":"Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609882
Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai
The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.
{"title":"Resource-saving FPGA Implementation of the Satisfiability Problem Solver: AmoebaSATslim","authors":"Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai","doi":"10.1109/ICFPT52863.2021.9609882","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609882","url":null,"abstract":"The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130786852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609816
Austin Liolli, Omar Ragheb, J. Anderson
Control flow in a program can be represented in a directed graph, called the control flow graph (CFG). Nodes in the graph represent straight-line segments of code, basic blocks, and directed edges between nodes correspond to transfers of control. We present a methodology to selectively reduce control flow by collapsing basic blocks into their parent blocks, revealing increased instruction-level parallelism to a high-level synthesis (HLS) scheduler, thereby raising circuit performance.We evaluate our approach within an HLS tool that allows a C-language software program to be automatically synthesized into a hardware circuit, using the CHStone benchmark suite [1], targeting an Intel Cyclone V FPGA. For individual benchmark circuits we observe cycle count reductions up to 20.7% and wall-clock time reductions up to 22.6%, and 6% on average.
程序中的控制流可以用有向图表示,称为控制流图(CFG)。图中的节点表示代码的直线段,基本块,节点之间的有向边对应于控制的转移。我们提出了一种方法,通过将基本块折叠到它们的父块中来选择性地减少控制流,从而向高级合成(HLS)调度程序揭示增加的指令级并行性,从而提高电路性能。我们在HLS工具中评估我们的方法,该工具允许c语言软件程序自动合成到硬件电路中,使用CHStone基准套件[1],针对英特尔Cyclone V FPGA。对于单个基准电路,我们观察到周期计数减少了20.7%,时钟时间减少了22.6%,平均减少了6%。
{"title":"Profiling-Based Control-Flow Reduction in High-Level Synthesis","authors":"Austin Liolli, Omar Ragheb, J. Anderson","doi":"10.1109/ICFPT52863.2021.9609816","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609816","url":null,"abstract":"Control flow in a program can be represented in a directed graph, called the control flow graph (CFG). Nodes in the graph represent straight-line segments of code, basic blocks, and directed edges between nodes correspond to transfers of control. We present a methodology to selectively reduce control flow by collapsing basic blocks into their parent blocks, revealing increased instruction-level parallelism to a high-level synthesis (HLS) scheduler, thereby raising circuit performance.We evaluate our approach within an HLS tool that allows a C-language software program to be automatically synthesized into a hardware circuit, using the CHStone benchmark suite [1], targeting an Intel Cyclone V FPGA. For individual benchmark circuits we observe cycle count reductions up to 20.7% and wall-clock time reductions up to 22.6%, and 6% on average.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116609884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609809
Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh
In our work, a new area-efficient multiply-accumulation scheme for time-domain neural processing named differential multiply-accumulation is proposed. Our new scheme reduces hardware resources utilization of multiply-accumulation with suppressing the increasing computational time resulting from the time-multiplexing. As a result, 2,048 neurons of fully connected CBM and RC-CBM were synthesized for a single field-programmable gate array (FPGA).
{"title":"An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing","authors":"Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh","doi":"10.1109/ICFPT52863.2021.9609809","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609809","url":null,"abstract":"In our work, a new area-efficient multiply-accumulation scheme for time-domain neural processing named differential multiply-accumulation is proposed. Our new scheme reduces hardware resources utilization of multiply-accumulation with suppressing the increasing computational time resulting from the time-multiplexing. As a result, 2,048 neurons of fully connected CBM and RC-CBM were synthesized for a single field-programmable gate array (FPGA).","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133764589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.
{"title":"Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA","authors":"Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong","doi":"10.1109/ICFPT52863.2021.9609907","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609907","url":null,"abstract":"Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134072165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609948
Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer
In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.
{"title":"AMAH-Flex: A Modular and Highly Flexible Tool for Generating Relocatable Systems on FPGAs","authors":"Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer","doi":"10.1109/ICFPT52863.2021.9609948","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609948","url":null,"abstract":"In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133913788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}