Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082780
O. Sander, S. Bähr, Enno Lübbers, T. Sandmann, Viet Vu Duy, J. Becker
Especially in complex system-of-systems scenarios, where multiple high-performance or real-time processing functions need to co-exist and interact, reconfigurable devices together with virtualization techniques show considerable promise to increase efficiency, ease integration and maintain functional and non-functional properties of the individual functions. In this paper, we propose a flexible interface architecture with low overhead for coupling reconfigurable coprocessors to high-performance general-purpose processors, allowing customized yet efficient construction of heterogeneous processing systems. Our implementation is based on PCI Express (PCIe) and optimized for virtualized systems, taking advantage of the SR-IOV capabilities in modern PCIe implementations. We describe the interface architecture and its fundamental technologies, detail the services provided to individual coprocessors and accelerator modules, and quantify key corner performance indicators relevant for virtualized applications.
{"title":"A flexible interface architecture for reconfigurable coprocessors in embedded multicore systems using PCIe Single-root I/O virtualization","authors":"O. Sander, S. Bähr, Enno Lübbers, T. Sandmann, Viet Vu Duy, J. Becker","doi":"10.1109/FPT.2014.7082780","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082780","url":null,"abstract":"Especially in complex system-of-systems scenarios, where multiple high-performance or real-time processing functions need to co-exist and interact, reconfigurable devices together with virtualization techniques show considerable promise to increase efficiency, ease integration and maintain functional and non-functional properties of the individual functions. In this paper, we propose a flexible interface architecture with low overhead for coupling reconfigurable coprocessors to high-performance general-purpose processors, allowing customized yet efficient construction of heterogeneous processing systems. Our implementation is based on PCI Express (PCIe) and optimized for virtualized systems, taking advantage of the SR-IOV capabilities in modern PCIe implementations. We describe the interface architecture and its fundamental technologies, detail the services provided to individual coprocessors and accelerator modules, and quantify key corner performance indicators relevant for virtualized applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"25 1","pages":"223-226"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79645248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082797
Yidi Liu, Benjamin Carrión Schäfer
This works presents a fast and efficient method to map multiple computationally intensive kernels onto the same FPGA given the FPGA area and communication bandwidth constraint. FPGAs have grown to a size where multiple applications can now be mapped onto a single device. It is therefore important to develop methods than can efficiently decide which kernels of all of the applications under consideration should be mapped onto the FPGA in order to maximize the total system acceleration. Our method shows very good results compared to a standard genetic algorithm, which is often used for multi-objective optimization problems and against the optimal solution obtained using an exhaustive search method. Experimental results show that our method is very scalable and extremely fast.
{"title":"HW acceleration of multiple applications on a single FPGA","authors":"Yidi Liu, Benjamin Carrión Schäfer","doi":"10.1109/FPT.2014.7082797","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082797","url":null,"abstract":"This works presents a fast and efficient method to map multiple computationally intensive kernels onto the same FPGA given the FPGA area and communication bandwidth constraint. FPGAs have grown to a size where multiple applications can now be mapped onto a single device. It is therefore important to develop methods than can efficiently decide which kernels of all of the applications under consideration should be mapped onto the FPGA in order to maximize the total system acceleration. Our method shows very good results compared to a standard genetic algorithm, which is often used for multi-objective optimization problems and against the optimal solution obtained using an exhaustive search method. Experimental results show that our method is very scalable and extremely fast.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"28 12 1","pages":"284-285"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87667772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082754
Shengjia Shao, Ce Guo, W. Luk, Stephen Weston
Transfer entropy is a measure of information transfer between two time series. It is an asymmetric measure based on entropy change which only takes into account the statistical dependency originating in the source series, but excludes dependency on a common external factor. Transfer entropy is able to capture system dynamics that traditional measures cannot, and has been successfully applied to various areas such as neuroscience, bioinformatics, data mining and finance. When time series becomes longer and resolution becomes higher, computing transfer entropy is demanding. This paper presents the first reconfigurable computing solution to accelerate transfer entropy computation. The novel aspects of our approach include a new technique based on Laplace's Rule of Succession for probability estimation; a novel architecture with optimised memory allocation, bit-width narrowing and mixed-precision optimisation; and its implementation targeting a Xilinx Virtex-6 SX475T FPGA. In our experiments, the proposed FPGA-based solution is up to 111.47 times faster than one Xeon CPU core, and 18.69 times faster than a 6-core Xeon CPU.
{"title":"Accelerating transfer entropy computation","authors":"Shengjia Shao, Ce Guo, W. Luk, Stephen Weston","doi":"10.1109/FPT.2014.7082754","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082754","url":null,"abstract":"Transfer entropy is a measure of information transfer between two time series. It is an asymmetric measure based on entropy change which only takes into account the statistical dependency originating in the source series, but excludes dependency on a common external factor. Transfer entropy is able to capture system dynamics that traditional measures cannot, and has been successfully applied to various areas such as neuroscience, bioinformatics, data mining and finance. When time series becomes longer and resolution becomes higher, computing transfer entropy is demanding. This paper presents the first reconfigurable computing solution to accelerate transfer entropy computation. The novel aspects of our approach include a new technique based on Laplace's Rule of Succession for probability estimation; a novel architecture with optimised memory allocation, bit-width narrowing and mixed-precision optimisation; and its implementation targeting a Xilinx Virtex-6 SX475T FPGA. In our experiments, the proposed FPGA-based solution is up to 111.47 times faster than one Xeon CPU core, and 18.69 times faster than a 6-core Xeon CPU.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"60-67"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89142750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082742
M. Butts
Throughout its twenty-five year history, logic emulation architectures have been governed by Rent's Rule. This empirical observation, first used to build 1960s mainframes, predicts the average number of cut nets that result when a digital module is arbitrarily partitioned into multiple parts, such as the FPGAs of a logic emulator. A fundamental advantage of emulation is that, unlike most devices, FPGAs always grow in capacity according to Moore's Law, just as the designs to be emulated have grown. Unfortunately packaging technology advances at a far slower pace, leaving emulators short on the pins demanded by Rent's Rule. Many cut nets are now sent through each package pin, which costs speed, power and area. At today's system-on-chip level of design, the number of system-level modules is growing, while their sizes are remaining constant. In the meantime, FPGAs have grown from a handful of logic lookup tables (LUTs) at the beginning to over a million LUTs today. At this scale, an entire system-level module such as an advanced 64-bit CPU can fit inside a single FPGA. Fewer module-internal nets need be cut, so Rent's Rule constraints are relaxing. Fewer and higher-level cut nets means logic emulation with megaLUT FPGAs is becoming faster, cooler, smaller, cheaper, and more reliable. FPGA's Moore's Law scaling is escaping from Rent's Rule.
{"title":"Logic emulation in the megaLUT era - Moore's Law beats Rent's Rule","authors":"M. Butts","doi":"10.1109/FPT.2014.7082742","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082742","url":null,"abstract":"Throughout its twenty-five year history, logic emulation architectures have been governed by Rent's Rule. This empirical observation, first used to build 1960s mainframes, predicts the average number of cut nets that result when a digital module is arbitrarily partitioned into multiple parts, such as the FPGAs of a logic emulator. A fundamental advantage of emulation is that, unlike most devices, FPGAs always grow in capacity according to Moore's Law, just as the designs to be emulated have grown. Unfortunately packaging technology advances at a far slower pace, leaving emulators short on the pins demanded by Rent's Rule. Many cut nets are now sent through each package pin, which costs speed, power and area. At today's system-on-chip level of design, the number of system-level modules is growing, while their sizes are remaining constant. In the meantime, FPGAs have grown from a handful of logic lookup tables (LUTs) at the beginning to over a million LUTs today. At this scale, an entire system-level module such as an advanced 64-bit CPU can fit inside a single FPGA. Fewer module-internal nets need be cut, so Rent's Rule constraints are relaxing. Fewer and higher-level cut nets means logic emulation with megaLUT FPGAs is becoming faster, cooler, smaller, cheaper, and more reliable. FPGA's Moore's Law scaling is escaping from Rent's Rule.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"36 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81317231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082786
Xian-wei Yang, R. Cheung
In this paper, we introduce a novel FPGA-based design for true random number generator (TRNG). It is able to harvest the timing difference caused by the nonuniformity of the Integrated Circuits (ICs) and use it to generate the randomness. Compared with the previous related work, this design uses a complementary scheme that leads to a doubled data rated output. The proposed complementary design has improved entropy and achieved higher throughput. The prototype design has been implemented and verified on a Xilinx Virtex-6 ML605 evaluation board. As a result, the generated random number stream is able to pass the statistical NIST and DIEHARD test suites showing a reliable performance. Meanwhile, it can approach the maximum data rate as 50 Mbps stably.
{"title":"A complementary architecture for high-speed true random number generator","authors":"Xian-wei Yang, R. Cheung","doi":"10.1109/FPT.2014.7082786","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082786","url":null,"abstract":"In this paper, we introduce a novel FPGA-based design for true random number generator (TRNG). It is able to harvest the timing difference caused by the nonuniformity of the Integrated Circuits (ICs) and use it to generate the randomness. Compared with the previous related work, this design uses a complementary scheme that leads to a doubled data rated output. The proposed complementary design has improved entropy and achieved higher throughput. The prototype design has been implemented and verified on a Xilinx Virtex-6 ML605 evaluation board. As a result, the generated random number stream is able to pass the statistical NIST and DIEHARD test suites showing a reliable performance. Meanwhile, it can approach the maximum data rate as 50 Mbps stably.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"248-251"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76412274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082746
Marcel Gort, J. Anderson
High-level synthesis (HLS) raises the level of abstraction for hardware design through the use of software methodologies. An impediment to productivity in HLS flows, however, is the run-time of the back-end toolflow - synthesis, packing, placement and routing - which can take hours or days for the largest designs. We propose a new back-end flow for HLS that makes use of pre-synthesized and placed "macros" for portions of the design, thereby reducing the amount of work to be done by the back-end tools, lowering run-time. A key aspect of our work is an analytical placement algorithm capable of handling large macros whose internal blocks have fixed relative placements, in conjunction with placing the surrounding individual logic blocks. In an experimental study, we consider the impact on run-time and quality-of-results of using macros: 1) in synthesis alone, and 2) in synthesis, packing and placement. Results show that the proposed approach reduces run-time by ~3x, on average, with a negative performance impact of ~5%.
{"title":"Design re-use for compile time reduction in FPGA high-level synthesis flows","authors":"Marcel Gort, J. Anderson","doi":"10.1109/FPT.2014.7082746","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082746","url":null,"abstract":"High-level synthesis (HLS) raises the level of abstraction for hardware design through the use of software methodologies. An impediment to productivity in HLS flows, however, is the run-time of the back-end toolflow - synthesis, packing, placement and routing - which can take hours or days for the largest designs. We propose a new back-end flow for HLS that makes use of pre-synthesized and placed \"macros\" for portions of the design, thereby reducing the amount of work to be done by the back-end tools, lowering run-time. A key aspect of our work is an analytical placement algorithm capable of handling large macros whose internal blocks have fixed relative placements, in conjunction with placing the surrounding individual logic blocks. In an experimental study, we consider the impact on run-time and quality-of-results of using macros: 1) in synthesis alone, and 2) in synthesis, packing and placement. Results show that the proposed approach reduces run-time by ~3x, on average, with a negative performance impact of ~5%.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"20 1","pages":"4-11"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81440057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082776
O. A. D. L. Junior, V. Fresse, F. Rousseau
The Networks-on-Chip (NoCs) are currently the most appropriate communication structure for many-core embedded systems. An FPGA-based emulation platform can drastically reduce the time needed to evaluate a NoC, even if it is composed by tens or hundreds of distributed components. These components should be timely managed in order to execute an evaluation traffic scenario. There is a lack of standard protocols to drive FPGA-based NoC emulators. Such protocols could ease the integration of emulation components developed by different designers. In this paper, we evaluate a light version of SNMP (Simple Network Management Protocol) to manage an FPGA-based NoC emulation platform. The SNMP protocol and its related components are adapted to a hardware implementation. This facilitates the configuration of the emulation nodes without FPGA-resynthesis, as well as the extraction of emulation results. Some experiments highlight that this protocol is quite simple to implement and very efficient for a light resources overhead.
{"title":"Evaluation of SNMP-like protocol to manage a NoC emulation platform","authors":"O. A. D. L. Junior, V. Fresse, F. Rousseau","doi":"10.1109/FPT.2014.7082776","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082776","url":null,"abstract":"The Networks-on-Chip (NoCs) are currently the most appropriate communication structure for many-core embedded systems. An FPGA-based emulation platform can drastically reduce the time needed to evaluate a NoC, even if it is composed by tens or hundreds of distributed components. These components should be timely managed in order to execute an evaluation traffic scenario. There is a lack of standard protocols to drive FPGA-based NoC emulators. Such protocols could ease the integration of emulation components developed by different designers. In this paper, we evaluate a light version of SNMP (Simple Network Management Protocol) to manage an FPGA-based NoC emulation platform. The SNMP protocol and its related components are adapted to a hardware implementation. This facilitates the configuration of the emulation nodes without FPGA-resynthesis, as well as the extraction of emulation results. Some experiments highlight that this protocol is quite simple to implement and very efficient for a light resources overhead.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"8 1","pages":"199-206"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81493685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082750
Chang Xu, Wentai Zhang, Guojie Luo
In this paper we propose a quantitative approach to analyze the impact of heterogeneous blocks (H-blocks) on the FPGA placement quality. The basic idea is to construct synthetic heterogeneous placement benchmarks with known optimal wire-length to facilitate the quantitative analysis. To the best of our knowledge, this is the first work that enables the construction of wirelength-optimal heterogeneous placement examples. Besides analyzing the quality of existing placers, we further decompose the impacts of H-blocks from the architectural aspect and netlist aspect. Our analysis shows that a heterogeneous design hides the wirelength degradation by a more compact netlist than its homogeneous version; however, the heterogeneity results in a optimality gap of 52% in wirelength, where 25% is from architectural heterogeneity and 27% is from netlist heterogeneity. Therefore, new heterogeneous placement algorithms are needed to bridge the optimality gap and improve design quality.
{"title":"Analyzing the impact of heterogeneous blocks on FPGA placement quality","authors":"Chang Xu, Wentai Zhang, Guojie Luo","doi":"10.1109/FPT.2014.7082750","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082750","url":null,"abstract":"In this paper we propose a quantitative approach to analyze the impact of heterogeneous blocks (H-blocks) on the FPGA placement quality. The basic idea is to construct synthetic heterogeneous placement benchmarks with known optimal wire-length to facilitate the quantitative analysis. To the best of our knowledge, this is the first work that enables the construction of wirelength-optimal heterogeneous placement examples. Besides analyzing the quality of existing placers, we further decompose the impacts of H-blocks from the architectural aspect and netlist aspect. Our analysis shows that a heterogeneous design hides the wirelength degradation by a more compact netlist than its homogeneous version; however, the heterogeneity results in a optimality gap of 52% in wirelength, where 25% is from architectural heterogeneity and 27% is from netlist heterogeneity. Therefore, new heterogeneous placement algorithms are needed to bridge the optimality gap and improve design quality.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"222 1","pages":"36-43"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74947045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082759
Hongyuan Ding, Miaoqing Huang
Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.
{"title":"Improve memory access for achieving both performance and energy efficiencies on heterogeneous systems","authors":"Hongyuan Ding, Miaoqing Huang","doi":"10.1109/FPT.2014.7082759","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082759","url":null,"abstract":"Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"30 1","pages":"91-98"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75534152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082791
Yi (Estelle) Wang, Akash Kumar, Yajun Ha
The key issue to improve the performance for secure large-scale Storage Area Network (SAN) applications lies in the speed of its encryption/decryption module. Software-based encryption/decryption cannot meet throughput requirements. To solve this problem, we propose a FPGA-based XTS-AES encryption/decryption to suit the needs for secure SAN applications with high throughput requirements. Besides throughput, area optimization is also considered in this proposed design. First, we reuse the same AES encryption to produce the tweak value and unify the operations of AES encryption/decryption in XTS-AES encryption/decryption. Second, we transfer the computations of AES encryption/decryption from GF(28) to GF(24)2, which enables us move the map and the inverse map functions outside the AES round. Third, we propose to support the SubBytes and the inverse SubBytes by the same hardware component. Finally, pipelined registers have been inserted into the proposed unrolled architecture for XTS-AES encryption/decryption. The experiments show that the proposed design achieves 36.2 Gbits/s throughput using 6784 slices on XC6VLX240T FPGA.
{"title":"FPGA-based high throughput XTS-AES encryption/decryption for storage area network","authors":"Yi (Estelle) Wang, Akash Kumar, Yajun Ha","doi":"10.1109/FPT.2014.7082791","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082791","url":null,"abstract":"The key issue to improve the performance for secure large-scale Storage Area Network (SAN) applications lies in the speed of its encryption/decryption module. Software-based encryption/decryption cannot meet throughput requirements. To solve this problem, we propose a FPGA-based XTS-AES encryption/decryption to suit the needs for secure SAN applications with high throughput requirements. Besides throughput, area optimization is also considered in this proposed design. First, we reuse the same AES encryption to produce the tweak value and unify the operations of AES encryption/decryption in XTS-AES encryption/decryption. Second, we transfer the computations of AES encryption/decryption from GF(28) to GF(24)2, which enables us move the map and the inverse map functions outside the AES round. Third, we propose to support the SubBytes and the inverse SubBytes by the same hardware component. Finally, pipelined registers have been inserted into the proposed unrolled architecture for XTS-AES encryption/decryption. The experiments show that the proposed design achieves 36.2 Gbits/s throughput using 6784 slices on XC6VLX240T FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"322 1","pages":"268-271"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76293414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}