Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857168
Tobias Kalb, D. Göhringer
In the past years dynamic partial reconfiguration (DPR) has been established as a well-known technique for systems featuring a field programmable gate array (FPGA). Systems-on-Chip (SoC) with an ARM processor ease the utilization of DPR and motivate its implementation to make use of the obvious advantages, such as the reduction of area, power and the acceleration of reconfiguring the FPGA. Nonetheless, the development process for SoCs is still a complex and time consuming task, especially for those designs using DPR. Xilinx counters this complexity with the introduction of their new high-level tools, namely the SDx Development Environment. The SDSoC Development Environment accelerates the development of designs running on Zynq 7000 devices by only using C/C++ applications as input. Unfortunately, this high-level workflow does not incorporate DPR. This paper shows an approach on how to use DPR in Xilinx SDSoC. Thus an application specific design can benefit from both the high-level workflow and the advantages of DPR. We show that our approach to DPR in SDSoC accelerates the overall design time and creates a more efficient embedded application. In our use case the dynamic and partial reconfiguration of hardware accelerators takes 10 ms and the hardware-related section of our embedded application is accelerated by a factor of 14 due to DPR.
{"title":"Enabling dynamic and partial reconfiguration in Xilinx SDSoC","authors":"Tobias Kalb, D. Göhringer","doi":"10.1109/ReConFig.2016.7857168","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857168","url":null,"abstract":"In the past years dynamic partial reconfiguration (DPR) has been established as a well-known technique for systems featuring a field programmable gate array (FPGA). Systems-on-Chip (SoC) with an ARM processor ease the utilization of DPR and motivate its implementation to make use of the obvious advantages, such as the reduction of area, power and the acceleration of reconfiguring the FPGA. Nonetheless, the development process for SoCs is still a complex and time consuming task, especially for those designs using DPR. Xilinx counters this complexity with the introduction of their new high-level tools, namely the SDx Development Environment. The SDSoC Development Environment accelerates the development of designs running on Zynq 7000 devices by only using C/C++ applications as input. Unfortunately, this high-level workflow does not incorporate DPR. This paper shows an approach on how to use DPR in Xilinx SDSoC. Thus an application specific design can benefit from both the high-level workflow and the advantages of DPR. We show that our approach to DPR in SDSoC accelerates the overall design time and creates a more efficient embedded application. In our use case the dynamic and partial reconfiguration of hardware accelerators takes 10 ms and the hardware-related section of our embedded application is accelerated by a factor of 14 due to DPR.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"295 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122836019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857162
Siavash Rezaei, César-Alejandro Hernández-Calderón, S. Mirzamohammadi, E. Bozorgzadeh, A. Veidenbaum, A. Nicolau, M. Prather
In heterogeneous architectures, FPGAs are not only expected to provide higher performance, but also to provide a more energy efficient solution for computationally intensive tasks. While parallelism and pipelining enhance performance on FPGA platforms, the data transfer rate from/to off-chip memory can cause performance degradation. We propose an automated high-level synthesis framework for FPGA-based acceleration of nested loops on large multidimensional input data sets. Given the high-level of parallelism in such applications, our proposed data prefetching algorithm determines the data rate for each parallel datapath. The empirical results on a case study in scientific computing show that FPGA mapping of such nested loops accelerates the application compared to traditional mapping on multicores. The FPGA-accelerated computation results in 3x speedup in runtime and 27x energy-delay-product savings compared to multicore computation.
{"title":"Data-rate-aware FPGA-based acceleration framework for streaming applications","authors":"Siavash Rezaei, César-Alejandro Hernández-Calderón, S. Mirzamohammadi, E. Bozorgzadeh, A. Veidenbaum, A. Nicolau, M. Prather","doi":"10.1109/ReConFig.2016.7857162","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857162","url":null,"abstract":"In heterogeneous architectures, FPGAs are not only expected to provide higher performance, but also to provide a more energy efficient solution for computationally intensive tasks. While parallelism and pipelining enhance performance on FPGA platforms, the data transfer rate from/to off-chip memory can cause performance degradation. We propose an automated high-level synthesis framework for FPGA-based acceleration of nested loops on large multidimensional input data sets. Given the high-level of parallelism in such applications, our proposed data prefetching algorithm determines the data rate for each parallel datapath. The empirical results on a case study in scientific computing show that FPGA mapping of such nested loops accelerates the application compared to traditional mapping on multicores. The FPGA-accelerated computation results in 3x speedup in runtime and 27x energy-delay-product savings compared to multicore computation.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114870496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857151
J. Ferreira, Jose Fonseca
Our work proposes a hardware architecture for a Long Short-Term Memory (LSTM) Neural Network, aiming to outperform software implementations, by exploiting its inherent parallelism. The main design decisions are presented, along with the proposed network architecture. A description of the main building blocks of the network is also presented. The network is synthesized for various sizes and platforms, and the performance results are presented and analyzed. Our synthesized network achieves a 251 times speed-up over a custom-built software network, running on an i7–3770k Desktop computer, proving the benefits of parallel computation for this kind of network.
{"title":"An FPGA implementation of a long short-term memory neural network","authors":"J. Ferreira, Jose Fonseca","doi":"10.1109/ReConFig.2016.7857151","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857151","url":null,"abstract":"Our work proposes a hardware architecture for a Long Short-Term Memory (LSTM) Neural Network, aiming to outperform software implementations, by exploiting its inherent parallelism. The main design decisions are presented, along with the proposed network architecture. A description of the main building blocks of the network is also presented. The network is synthesized for various sizes and platforms, and the performance results are presented and analyzed. Our synthesized network achieves a 251 times speed-up over a custom-built software network, running on an i7–3770k Desktop computer, proving the benefits of parallel computation for this kind of network.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121895809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857169
Kentaro Orimo, Kota Ando, Kodai Ueyoshi, M. Ikebe, T. Asai, M. Motomura
Deep learning is being widely used in various applications, and diverse neural networks have been proposed. A form of neural network, such as the novel feed-forward sequential memory network (FSMN), aims to forecast prospective data by extracting the time-series feature. FSMN is a standard feed-forward neural network equipped with time-domain filters, and it can forecast without recurrent feedback. In this paper, we propose a field-programmable gate-array (FPGA) architecture for this model, and exhibit that the resource does not increase exponentially as the network scale increases.
{"title":"FPGA architecture for feed-forward sequential memory network targeting long-term time-series forecasting","authors":"Kentaro Orimo, Kota Ando, Kodai Ueyoshi, M. Ikebe, T. Asai, M. Motomura","doi":"10.1109/ReConFig.2016.7857169","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857169","url":null,"abstract":"Deep learning is being widely used in various applications, and diverse neural networks have been proposed. A form of neural network, such as the novel feed-forward sequential memory network (FSMN), aims to forecast prospective data by extracting the time-series feature. FSMN is a standard feed-forward neural network equipped with time-domain filters, and it can forecast without recurrent feedback. In this paper, we propose a field-programmable gate-array (FPGA) architecture for this model, and exhibit that the resource does not increase exponentially as the network scale increases.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122129887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857170
Habib ul Hasan Khan, D. Göhringer
This paper presents an FPGA debugging methodology based upon a device start and stop (DSAS) approach. Using this approach, the design starts and stops a device under test (DUT) and saves the data to external memory without human interaction. The presented debugging circuit saves data on a trace buffer and once the trace buffer is full, it stops the DUT, saves the data to external memory through Ethernet and then starts the DUT again. Hence the quantity of the debug data is not limited. The contents stored on the external devices can be viewed by open-source waveform viewers or HDL simulators subsequently. The main benefits of the technique are an unlimited debug window, less use of scarce FPGA resources and no loss of debugging data. Neither an external emulation system nor user intervention is required to save the recorded data once the BRAMs are full.
{"title":"FPGA debugging by a device start and stop approach","authors":"Habib ul Hasan Khan, D. Göhringer","doi":"10.1109/ReConFig.2016.7857170","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857170","url":null,"abstract":"This paper presents an FPGA debugging methodology based upon a device start and stop (DSAS) approach. Using this approach, the design starts and stops a device under test (DUT) and saves the data to external memory without human interaction. The presented debugging circuit saves data on a trace buffer and once the trace buffer is full, it stops the DUT, saves the data to external memory through Ethernet and then starts the DUT again. Hence the quantity of the debug data is not limited. The contents stored on the external devices can be viewed by open-source waveform viewers or HDL simulators subsequently. The main benefits of the technique are an unlimited debug window, less use of scarce FPGA resources and no loss of debugging data. Neither an external emulation system nor user intervention is required to save the recorded data once the BRAMs are full.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123509294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857188
Wen Wang, Jakub Szefer, R. Niederhagen
This paper presents an efficient systolic line architecture for solving large systems of linear equations using Gaussian elimination on the coefficient matrix. Our architecture can also be used for solving matrix inversion problems and for computing the systematic form of matrices. These are common and important computational problems that appear in areas such as cryptography and cryptanalysis. Our architecture solves these problems efficiently for any large-sized matrix over GF(2), regardless of matrix size, shape or density. We implemented and synthesized our design for Altera and Xilinx FPGAs to obtain evaluation data. The results show sub-μs performance for the Gaussian elimination of medium-sized matrices and performance on the order of tens to hundreds of ms for large matrices. In addition, this is one of the first works addressing large-sized matrices of up to 4,000 × 8,000 elements and therefore is suitable for post-quantum cryptographic schemes that require handling such large matrices.
{"title":"Solving large systems of linear equations over GF(2) on FPGAs","authors":"Wen Wang, Jakub Szefer, R. Niederhagen","doi":"10.1109/ReConFig.2016.7857188","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857188","url":null,"abstract":"This paper presents an efficient systolic line architecture for solving large systems of linear equations using Gaussian elimination on the coefficient matrix. Our architecture can also be used for solving matrix inversion problems and for computing the systematic form of matrices. These are common and important computational problems that appear in areas such as cryptography and cryptanalysis. Our architecture solves these problems efficiently for any large-sized matrix over GF(2), regardless of matrix size, shape or density. We implemented and synthesized our design for Altera and Xilinx FPGAs to obtain evaluation data. The results show sub-μs performance for the Gaussian elimination of medium-sized matrices and performance on the order of tens to hundreds of ms for large matrices. In addition, this is one of the first works addressing large-sized matrices of up to 4,000 × 8,000 elements and therefore is suitable for post-quantum cryptographic schemes that require handling such large matrices.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126455758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857153
Hotaka Kusano, M. Ikebe, T. Asai, M. Motomura
The demand for light-weight and high-speed super resolution (SR) techniques is growing because super high-resolution displays, such as 4K/8K ultra high definition televisions (UHDTVs), have become common. We here propose an SR method using over up-sampling and anti-aliasing where no iteration process is required — unlike with conventional SR methods. Our method is able to attenuate jaggies in the edge of an enlarged image and does not need to preserve the entire enlarged image. Therefore, this method is suitable for hardware implementation, and the architecture requires five line buffers only (in the memory section). We implemented the proposed method on a field programmable gate array (FPGA) and demonstrated HDTV-to-4K and-8K SR processing in real time (60 frames per second).
{"title":"An FPGA-optimized architecture of anti-aliasing based super resolution for real-time HDTV to 4K- and 8K-UHD conversions","authors":"Hotaka Kusano, M. Ikebe, T. Asai, M. Motomura","doi":"10.1109/ReConFig.2016.7857153","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857153","url":null,"abstract":"The demand for light-weight and high-speed super resolution (SR) techniques is growing because super high-resolution displays, such as 4K/8K ultra high definition televisions (UHDTVs), have become common. We here propose an SR method using over up-sampling and anti-aliasing where no iteration process is required — unlike with conventional SR methods. Our method is able to attenuate jaggies in the edge of an enlarged image and does not need to preserve the entire enlarged image. Therefore, this method is suitable for hardware implementation, and the architecture requires five line buffers only (in the memory section). We implemented the proposed method on a field programmable gate array (FPGA) and demonstrated HDTV-to-4K and-8K SR processing in real time (60 frames per second).","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857163
Atil U. Ay, Erdinç Öztürk, F. Rodríguez-Henríquez, E. Savaş
In this paper we present a scalar multiplication hardware architecture that computes a constant-time variable-base point multiplication over the Galbraith-Lin-Scott (GLS) family of binary elliptic curves. Our hardware design is especially tailored for the quadratic extension field F22n, with n = 127, which allows us to attain a security level close to 128 bits. We explore extensively the usage of digit-based and Karatsuba multipliers for performing the quadratic field arithmetic associated to GLS elliptic curves and report the area and time performance obtained by these two types of multipliers. Targeting a XILINX KINTEX-7 FPGA device, we report a hardware implementation of our design that achieves a delay of just 3.98μs for computing one scalar multiplication. This allows us to claim the current speed record for this operation at or around the 128-bit security level for any hardware or software realization reported in the literature.
{"title":"Design and implementation of a constant-time FPGA accelerator for fast elliptic curve cryptography","authors":"Atil U. Ay, Erdinç Öztürk, F. Rodríguez-Henríquez, E. Savaş","doi":"10.1109/ReConFig.2016.7857163","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857163","url":null,"abstract":"In this paper we present a scalar multiplication hardware architecture that computes a constant-time variable-base point multiplication over the Galbraith-Lin-Scott (GLS) family of binary elliptic curves. Our hardware design is especially tailored for the quadratic extension field F22n, with n = 127, which allows us to attain a security level close to 128 bits. We explore extensively the usage of digit-based and Karatsuba multipliers for performing the quadratic field arithmetic associated to GLS elliptic curves and report the area and time performance obtained by these two types of multipliers. Targeting a XILINX KINTEX-7 FPGA device, we report a hardware implementation of our design that achieves a delay of just 3.98μs for computing one scalar multiplication. This allows us to claim the current speed record for this operation at or around the 128-bit security level for any hardware or software realization reported in the literature.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133758517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857167
Thorbjörn Posewsky, Daniel Ziener
Deep neural networks are an extremely successful and widely used technique for various pattern recognition and machine learning tasks. Due to power and resource constraints, these computationally intensive networks are difficult to implement in embedded systems. Yet, the number of applications that can benefit from the mentioned possibilities is rapidly rising. In this paper, we propose a novel architecture for processing previously learned and arbitrary deep neural networks on FPGA-based SoCs that is able to overcome these limitations. A key contribution of our approach, which we refer to as batch processing, achieves a mitigation of required weight matrix transfers from external memory by reusing weights across multiple input samples. This technique combined with a sophisticated pipelining and the usage of high performance interfaces accelerates the data processing compared to existing approaches on the same FPGA device by one order of magnitude. Furthermore, we achieve a comparable data throughput as a fully featured x86-based system at only a fraction of its energy consumption.
{"title":"Efficient deep neural network acceleration through FPGA-based batch processing","authors":"Thorbjörn Posewsky, Daniel Ziener","doi":"10.1109/ReConFig.2016.7857167","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857167","url":null,"abstract":"Deep neural networks are an extremely successful and widely used technique for various pattern recognition and machine learning tasks. Due to power and resource constraints, these computationally intensive networks are difficult to implement in embedded systems. Yet, the number of applications that can benefit from the mentioned possibilities is rapidly rising. In this paper, we propose a novel architecture for processing previously learned and arbitrary deep neural networks on FPGA-based SoCs that is able to overcome these limitations. A key contribution of our approach, which we refer to as batch processing, achieves a mitigation of required weight matrix transfers from external memory by reusing weights across multiple input samples. This technique combined with a sophisticated pipelining and the usage of high performance interfaces accelerates the data processing compared to existing approaches on the same FPGA device by one order of magnitude. Furthermore, we achieve a comparable data throughput as a fully featured x86-based system at only a fraction of its energy consumption.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114903257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857190
B. Buhrow, William J. Goetzinger, B. Gilbert
As network data rates advance toward 1 Tb/s, hardware-based implementations of anti-replay offer desirable tradeoffs over software. However, internal logic busses in FPGAs are becoming wider (512+ bits) and segmented (more than one packet per clock cycle) to accommodate increased network data rates. Such busses are challenging for applications such as anti-replay that require read-modify-write operations to a coherent database on each packet arrival. In this paper we present an FPGA-targeted pipelined anti-replay design capable of accommodating 1024 IPsec tunnels at 1 Tb/s data rate. The novel design is enabled by fast on-chip block RAMs in a xcvu190 Virtex Ultrascale FPGA that are used to construct a 20-port RAM memory operating at 400 MHz with over 5 Tb/s of peak bandwidth. Custom single-clock write-combining techniques are described that accommodate multiple concurrent updates to the same database address. We also investigate the limits of capacity and concurrency for the anti-replay application.
{"title":"1 Tb/s anti-replay protection with 20-port on-chip RAM memory in FPGAs","authors":"B. Buhrow, William J. Goetzinger, B. Gilbert","doi":"10.1109/ReConFig.2016.7857190","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857190","url":null,"abstract":"As network data rates advance toward 1 Tb/s, hardware-based implementations of anti-replay offer desirable tradeoffs over software. However, internal logic busses in FPGAs are becoming wider (512+ bits) and segmented (more than one packet per clock cycle) to accommodate increased network data rates. Such busses are challenging for applications such as anti-replay that require read-modify-write operations to a coherent database on each packet arrival. In this paper we present an FPGA-targeted pipelined anti-replay design capable of accommodating 1024 IPsec tunnels at 1 Tb/s data rate. The novel design is enabled by fast on-chip block RAMs in a xcvu190 Virtex Ultrascale FPGA that are used to construct a 20-port RAM memory operating at 400 MHz with over 5 Tb/s of peak bandwidth. Custom single-clock write-combining techniques are described that accommodate multiple concurrent updates to the same database address. We also investigate the limits of capacity and concurrency for the anti-replay application.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134599693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}