Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082763
Kan Shi, D. Boland, G. Constantinides
Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.
{"title":"Efficient FPGA implementation of digit parallel online arithmetic operators","authors":"Kan Shi, D. Boland, G. Constantinides","doi":"10.1109/FPT.2014.7082763","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082763","url":null,"abstract":"Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"115-122"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79937125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082813
Bin Tang, Yaping Lin, Jiliang Zhang
Physical unclonable function (PUF) is a promising hardware security primitive that can be applied to various security related areas. The ring oscillator (RO) PUF is one of the most popular PUFs that can generate the volatile key by comparing the frequency between ROs. Previous RO PUFs incur unacceptable hardware overheads to improve the reliability in order to eliminate the effect of environment factors. In this paper, we propose a frequency offset algorithm (FOA) to enhance the reliability and low the hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs. Experimental results show that our proposed FOA method has the better reliability and lower hardware overhead than the temperature-aware cooperative (TAC). Especially, our proposed method can achieve the 100% utilization of ROs.
{"title":"Improving the reliability of RO PUF using frequency offset","authors":"Bin Tang, Yaping Lin, Jiliang Zhang","doi":"10.1109/FPT.2014.7082813","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082813","url":null,"abstract":"Physical unclonable function (PUF) is a promising hardware security primitive that can be applied to various security related areas. The ring oscillator (RO) PUF is one of the most popular PUFs that can generate the volatile key by comparing the frequency between ROs. Previous RO PUFs incur unacceptable hardware overheads to improve the reliability in order to eliminate the effect of environment factors. In this paper, we propose a frequency offset algorithm (FOA) to enhance the reliability and low the hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs. Experimental results show that our proposed FOA method has the better reliability and lower hardware overhead than the temperature-aware cooperative (TAC). Especially, our proposed method can achieve the 100% utilization of ROs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"60 1","pages":"338-341"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81084593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082800
Jiasen Huang, Junyan Ren, Wenbo Yin, Lingli Wang
Sparse Matrix-Vector Multiplication (SpMxV) algorithms suffer heavy performance penalties due to irregular memory accesses. In this paper, we introduce a novel compressed element storage (CES) format, in which the additional data structures for indexing are abandoned, and each location associated with the non-zero element of the matrix is now indicated by the name of a variable multiplied by the corresponding element of the vector. To ensure fastest access and parallel access without data hazards, on-chip registers are used exclusively to replace the BRAM or off-chip DRAM/SRAM to hold all the SpMxV data. On-chip DSP resources are fully utilized so as to ensure a maximum number of multipliers concurrently working.
{"title":"No zero padded sparse matrix-vector multiplication on FPGAs","authors":"Jiasen Huang, Junyan Ren, Wenbo Yin, Lingli Wang","doi":"10.1109/FPT.2014.7082800","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082800","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMxV) algorithms suffer heavy performance penalties due to irregular memory accesses. In this paper, we introduce a novel compressed element storage (CES) format, in which the additional data structures for indexing are abandoned, and each location associated with the non-zero element of the matrix is now indicated by the name of a variable multiplied by the corresponding element of the vector. To ensure fastest access and parallel access without data hazards, on-chip registers are used exclusively to replace the BRAM or off-chip DRAM/SRAM to hold all the SpMxV data. On-chip DSP resources are fully utilized so as to ensure a maximum number of multipliers concurrently working.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"28 4 1","pages":"290-291"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78570045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082778
K. Tatsumura, Masato Oda, S. Yasuda
Multi-context configuration memory stores multiple sets of configuration data and changes the entire configuration of FPGA quickly, enabling enhancement of hardware utilization with dynamic reconfiguration architectures. The memory area for one set of configuration data should be much smaller than the computational resource it controls. In this paper, we propose a pure-CMOS, nonvolatile, and small-footprint multi-context configuration memory. The multi-context memory includes multiple 2Tr nonvolatile memory elements, which are programmed by channel hot-electron injection, and allows context switching in a single clock cycle. A primitive dynamically reconfigurable device having a lookup table and minimum interconnect backed by 16-bit 8-context configuration memory was fabricated by a 0.18 um CMOS process and its functionality was demonstrated. The 2Tr nonvolatile memory element is more than 4 times denser than 6Tr SRAM, enabling achievement of greater logic density. The pure-CMOS and nonvolatile features would enhance the attractiveness of the technology in many applications.
多上下文配置存储器存储多组配置数据,并快速更改FPGA的整个配置,从而通过动态重构架构提高硬件利用率。一组配置数据的内存区域应该比它控制的计算资源小得多。在本文中,我们提出了一种纯cmos,非易失性和小占用的多上下文配置存储器。多上下文存储器包括多个2Tr非易失性存储器元件,其通过通道热电子注入编程,并允许在单个时钟周期内进行上下文切换。采用0.18 um CMOS工艺制作了一个具有查找表和最小互连的原始动态可重构器件,并对其功能进行了验证。2Tr非易失性存储器元件的密度是6Tr SRAM的4倍以上,可以实现更大的逻辑密度。纯cmos和非易失性的特性将增强该技术在许多应用中的吸引力。
{"title":"A pure-CMOS nonvolatile multi-context configuration memory for dynamically reconfigurable FPGAs","authors":"K. Tatsumura, Masato Oda, S. Yasuda","doi":"10.1109/FPT.2014.7082778","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082778","url":null,"abstract":"Multi-context configuration memory stores multiple sets of configuration data and changes the entire configuration of FPGA quickly, enabling enhancement of hardware utilization with dynamic reconfiguration architectures. The memory area for one set of configuration data should be much smaller than the computational resource it controls. In this paper, we propose a pure-CMOS, nonvolatile, and small-footprint multi-context configuration memory. The multi-context memory includes multiple 2Tr nonvolatile memory elements, which are programmed by channel hot-electron injection, and allows context switching in a single clock cycle. A primitive dynamically reconfigurable device having a lookup table and minimum interconnect backed by 16-bit 8-context configuration memory was fabricated by a 0.18 um CMOS process and its functionality was demonstrated. The 2Tr nonvolatile memory element is more than 4 times denser than 6Tr SRAM, enabling achievement of greater logic density. The pure-CMOS and nonvolatile features would enhance the attractiveness of the technology in many applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"41 1","pages":"215-222"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88066363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082825
A. Kojima
Blokus Duo is an abstract strategy game for two players. In this paper, we describe our FPGA implementation of Blokus Duo player for ICFPT2014 design contest, which is the revised version of the previous design for ICPFT2013 design contest. Our design consists of hardware logic part and software part using soft IP processor. The hardware logic part calculates evaluation value of the board status which is a heavy task for the software part. Our implementation uses recursive Alpha-Beta pruning and iteration deepening algorithm by the software part, which are complex to implement as the hardware logic circuit. The current version of our implementation on Xilinx Artix7 can run at 142MHz. The hardware logic part evaluates about 90,000 nodes in one second at the beginning of the game.
{"title":"FPGA implementation of Blokus Duo player using hardware/software co-design","authors":"A. Kojima","doi":"10.1109/FPT.2014.7082825","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082825","url":null,"abstract":"Blokus Duo is an abstract strategy game for two players. In this paper, we describe our FPGA implementation of Blokus Duo player for ICFPT2014 design contest, which is the revised version of the previous design for ICPFT2013 design contest. Our design consists of hardware logic part and software part using soft IP processor. The hardware logic part calculates evaluation value of the board status which is a heavy task for the software part. Our implementation uses recursive Alpha-Beta pruning and iteration deepening algorithm by the software part, which are complex to implement as the hardware logic circuit. The current version of our implementation on Xilinx Artix7 can run at 142MHz. The hardware logic part evaluates about 90,000 nodes in one second at the beginning of the game.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"378-381"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84034239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082774
Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon
RotoRouter addresses Denial-of-Service (DoS) attacks on networks with a novel protocol and router implementation. Sets of RotoRouters cooperate in detecting and filtering out invalid network traffic before it reaches network endpoints; a new router-enforceable connection protocol queries destination endpoints to authorize traffic flows and uses per-packet digital signatures to distinguish allowed from disallowed connections. A RotoRouter prototype was implemented on a four-port 1000BASE-T NetFPGA-10G platform and supports 1024 simultaneous active connections using 74 BRAMs (less than one quarter of the available NetFPGA-10G BRAMs). It is able to sustain 800 Mbps per port throughputs for 1500B packets with less than 0.3/its latency, even during a DoS attack. With additional logic and memory resources, the required validation and switching operations scale to port speeds in excess of 10 Gbps and links with more than 10,000 active flows.
{"title":"RotoRouter: Router support for endpoint-authorized decentralized traffic filtering to prevent DoS attacks","authors":"Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon","doi":"10.1109/FPT.2014.7082774","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082774","url":null,"abstract":"RotoRouter addresses Denial-of-Service (DoS) attacks on networks with a novel protocol and router implementation. Sets of RotoRouters cooperate in detecting and filtering out invalid network traffic before it reaches network endpoints; a new router-enforceable connection protocol queries destination endpoints to authorize traffic flows and uses per-packet digital signatures to distinguish allowed from disallowed connections. A RotoRouter prototype was implemented on a four-port 1000BASE-T NetFPGA-10G platform and supports 1024 simultaneous active connections using 74 BRAMs (less than one quarter of the available NetFPGA-10G BRAMs). It is able to sustain 800 Mbps per port throughputs for 1500B packets with less than 0.3/its latency, even during a DoS attack. With additional logic and memory resources, the required validation and switching operations scale to port speeds in excess of 10 Gbps and links with more than 10,000 active flows.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"183-190"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84325900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082758
Shaoyi Cheng, J. Wawrzynek
As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor-centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.
{"title":"Architectural synthesis of computational pipelines with decoupled memory access","authors":"Shaoyi Cheng, J. Wawrzynek","doi":"10.1109/FPT.2014.7082758","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082758","url":null,"abstract":"As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor-centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"83-90"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85643289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-01-01DOI: 10.1109/fpt.2014.7082827
Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie
Xilinx has developed even more advanced FPGAs and 2nd generation SoCs and 3D ICs to stay a generation ahead, and deliver an extra node worth of performance, power, and integration. The UltraScale architecture was developed to scale from 20nm planar through 16nm and beyond FinFET (FF) technologies, and from monolithic through 3D ICs. In this talk, we will study the cases about Xilinx FPGA in cutting edge applications, also the advantages of UltraScale architecture 2nd generation SoCs, and design tools. IoT and Wearable Applications Enabled by Bluetooth Low Energy (BLE) Solutions Patrick Kane, Cypress Abstract: The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session
{"title":"Industrial session","authors":"Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie","doi":"10.1109/fpt.2014.7082827","DOIUrl":"https://doi.org/10.1109/fpt.2014.7082827","url":null,"abstract":"Xilinx has developed even more advanced FPGAs and 2nd generation SoCs and 3D ICs to stay a generation ahead, and deliver an extra node worth of performance, power, and integration. The UltraScale architecture was developed to scale from 20nm planar through 16nm and beyond FinFET (FF) technologies, and from monolithic through 3D ICs. In this talk, we will study the cases about Xilinx FPGA in cutting edge applications, also the advantages of UltraScale architecture 2nd generation SoCs, and design tools. IoT and Wearable Applications Enabled by Bluetooth Low Energy (BLE) Solutions Patrick Kane, Cypress Abstract: The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"42 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83589874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-01-01DOI: 10.1109/FPT.2014.7082757
Benjamin Carrión Schäfer
{"title":"Time sharing of Runtime Coarse-Grain Reconfigurable Architectures processing elements in multi-process systems","authors":"Benjamin Carrión Schäfer","doi":"10.1109/FPT.2014.7082757","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082757","url":null,"abstract":"","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"76-82"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83458446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718320
P. Chow
Summary form only given. Ever since FPGAs were invented, there has been great interest in using them as computing devices, and with the logic densities of today's devices, many interesting functions have been shown to have significant performance and energy benefits when implemented in FPGAs. However, when an application requires the combination of a high-performance CPU and an FPGA accelerator, the effectiveness of the FPGA is highly determined by the latency and bandwidth between the CPU, the CPU memory system and the FPGA and its memory system. Putting FPGAs into the CPU socket is one way to address this issue. This talk will present the history, the advantages and disadvantages, the challenges, architectures, programming models and applications of "insocket" accelerator systems.
{"title":"Why Put FPGAs in your CPU socket?","authors":"P. Chow","doi":"10.1109/FPT.2013.6718320","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718320","url":null,"abstract":"Summary form only given. Ever since FPGAs were invented, there has been great interest in using them as computing devices, and with the logic densities of today's devices, many interesting functions have been shown to have significant performance and energy benefits when implemented in FPGAs. However, when an application requires the combination of a high-performance CPU and an FPGA accelerator, the effectiveness of the FPGA is highly determined by the latency and bandwidth between the CPU, the CPU memory system and the FPGA and its memory system. Putting FPGAs into the CPU socket is one way to address this issue. This talk will present the history, the advantages and disadvantages, the challenges, architectures, programming models and applications of \"insocket\" accelerator systems.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81476395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}