Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082812
Yongfu He, Shaojun Wang, Yu Peng, Y. Pang, Ning Ma, Jingyue Pang
Relevance Vector Machine (RVM) with the uncertainty expressing ability has spawned broad applications in Prognostic and Health Management (PHM). However computationally intensive intrinsic nature of RVM greatly limits its usage. This paper presents a software and hardware co-design approach based on HMPSoC technology, which efficiently exploited sequential and parallel nature of RVM. Multi-channel and pipelined hardware architecture for the acceleration of kernel formulation and intermediate values calculation is proposed. The hardware that wrapped with AXI-Stream interface is integrated into HMPSoC as an acceleration engine. We implement the design on an on-board PHM prototype platform with a Xilinx Zynq XC7Z020 AP SoC. The experiment results show 5.3× and 46.8× speed up in terms of the time cost than the RVM running on PC with a Xeon 5620 processor and ARM Cortex A9 processor. The energy consumption is reduced by 153.0× and 37.3×, respectively.
相关向量机(RVM)具有表达不确定性的能力,在预后和健康管理(PHM)中得到了广泛的应用。然而,RVM固有的计算密集型特性极大地限制了它的使用。本文提出了一种基于HMPSoC技术的软硬件协同设计方法,有效地利用了RVM的顺序和并行特性。提出了多通道和流水线的硬件结构,以加速核公式和中间值的计算。轴流接口封装的硬件作为加速引擎集成到HMPSoC中。我们在带有Xilinx Zynq XC7Z020 AP SoC的板载PHM原型平台上实现了该设计。实验结果表明,RVM在运行于Xeon 5620处理器和ARM Cortex A9处理器的PC机上时,运行速度分别提高5.3倍和46.8倍。能耗分别降低153.0倍和37.3倍。
{"title":"High performance relevance vector machine on HMPSoC","authors":"Yongfu He, Shaojun Wang, Yu Peng, Y. Pang, Ning Ma, Jingyue Pang","doi":"10.1109/FPT.2014.7082812","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082812","url":null,"abstract":"Relevance Vector Machine (RVM) with the uncertainty expressing ability has spawned broad applications in Prognostic and Health Management (PHM). However computationally intensive intrinsic nature of RVM greatly limits its usage. This paper presents a software and hardware co-design approach based on HMPSoC technology, which efficiently exploited sequential and parallel nature of RVM. Multi-channel and pipelined hardware architecture for the acceleration of kernel formulation and intermediate values calculation is proposed. The hardware that wrapped with AXI-Stream interface is integrated into HMPSoC as an acceleration engine. We implement the design on an on-board PHM prototype platform with a Xilinx Zynq XC7Z020 AP SoC. The experiment results show 5.3× and 46.8× speed up in terms of the time cost than the RVM running on PC with a Xeon 5620 processor and ARM Cortex A9 processor. The energy consumption is reduced by 153.0× and 37.3×, respectively.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"35 1","pages":"334-337"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81126948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082813
Bin Tang, Yaping Lin, Jiliang Zhang
Physical unclonable function (PUF) is a promising hardware security primitive that can be applied to various security related areas. The ring oscillator (RO) PUF is one of the most popular PUFs that can generate the volatile key by comparing the frequency between ROs. Previous RO PUFs incur unacceptable hardware overheads to improve the reliability in order to eliminate the effect of environment factors. In this paper, we propose a frequency offset algorithm (FOA) to enhance the reliability and low the hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs. Experimental results show that our proposed FOA method has the better reliability and lower hardware overhead than the temperature-aware cooperative (TAC). Especially, our proposed method can achieve the 100% utilization of ROs.
{"title":"Improving the reliability of RO PUF using frequency offset","authors":"Bin Tang, Yaping Lin, Jiliang Zhang","doi":"10.1109/FPT.2014.7082813","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082813","url":null,"abstract":"Physical unclonable function (PUF) is a promising hardware security primitive that can be applied to various security related areas. The ring oscillator (RO) PUF is one of the most popular PUFs that can generate the volatile key by comparing the frequency between ROs. Previous RO PUFs incur unacceptable hardware overheads to improve the reliability in order to eliminate the effect of environment factors. In this paper, we propose a frequency offset algorithm (FOA) to enhance the reliability and low the hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs. Experimental results show that our proposed FOA method has the better reliability and lower hardware overhead than the temperature-aware cooperative (TAC). Especially, our proposed method can achieve the 100% utilization of ROs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"60 1","pages":"338-341"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81084593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082800
Jiasen Huang, Junyan Ren, Wenbo Yin, Lingli Wang
Sparse Matrix-Vector Multiplication (SpMxV) algorithms suffer heavy performance penalties due to irregular memory accesses. In this paper, we introduce a novel compressed element storage (CES) format, in which the additional data structures for indexing are abandoned, and each location associated with the non-zero element of the matrix is now indicated by the name of a variable multiplied by the corresponding element of the vector. To ensure fastest access and parallel access without data hazards, on-chip registers are used exclusively to replace the BRAM or off-chip DRAM/SRAM to hold all the SpMxV data. On-chip DSP resources are fully utilized so as to ensure a maximum number of multipliers concurrently working.
{"title":"No zero padded sparse matrix-vector multiplication on FPGAs","authors":"Jiasen Huang, Junyan Ren, Wenbo Yin, Lingli Wang","doi":"10.1109/FPT.2014.7082800","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082800","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMxV) algorithms suffer heavy performance penalties due to irregular memory accesses. In this paper, we introduce a novel compressed element storage (CES) format, in which the additional data structures for indexing are abandoned, and each location associated with the non-zero element of the matrix is now indicated by the name of a variable multiplied by the corresponding element of the vector. To ensure fastest access and parallel access without data hazards, on-chip registers are used exclusively to replace the BRAM or off-chip DRAM/SRAM to hold all the SpMxV data. On-chip DSP resources are fully utilized so as to ensure a maximum number of multipliers concurrently working.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"28 4 1","pages":"290-291"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78570045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082778
K. Tatsumura, Masato Oda, S. Yasuda
Multi-context configuration memory stores multiple sets of configuration data and changes the entire configuration of FPGA quickly, enabling enhancement of hardware utilization with dynamic reconfiguration architectures. The memory area for one set of configuration data should be much smaller than the computational resource it controls. In this paper, we propose a pure-CMOS, nonvolatile, and small-footprint multi-context configuration memory. The multi-context memory includes multiple 2Tr nonvolatile memory elements, which are programmed by channel hot-electron injection, and allows context switching in a single clock cycle. A primitive dynamically reconfigurable device having a lookup table and minimum interconnect backed by 16-bit 8-context configuration memory was fabricated by a 0.18 um CMOS process and its functionality was demonstrated. The 2Tr nonvolatile memory element is more than 4 times denser than 6Tr SRAM, enabling achievement of greater logic density. The pure-CMOS and nonvolatile features would enhance the attractiveness of the technology in many applications.
多上下文配置存储器存储多组配置数据,并快速更改FPGA的整个配置,从而通过动态重构架构提高硬件利用率。一组配置数据的内存区域应该比它控制的计算资源小得多。在本文中,我们提出了一种纯cmos,非易失性和小占用的多上下文配置存储器。多上下文存储器包括多个2Tr非易失性存储器元件,其通过通道热电子注入编程,并允许在单个时钟周期内进行上下文切换。采用0.18 um CMOS工艺制作了一个具有查找表和最小互连的原始动态可重构器件,并对其功能进行了验证。2Tr非易失性存储器元件的密度是6Tr SRAM的4倍以上,可以实现更大的逻辑密度。纯cmos和非易失性的特性将增强该技术在许多应用中的吸引力。
{"title":"A pure-CMOS nonvolatile multi-context configuration memory for dynamically reconfigurable FPGAs","authors":"K. Tatsumura, Masato Oda, S. Yasuda","doi":"10.1109/FPT.2014.7082778","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082778","url":null,"abstract":"Multi-context configuration memory stores multiple sets of configuration data and changes the entire configuration of FPGA quickly, enabling enhancement of hardware utilization with dynamic reconfiguration architectures. The memory area for one set of configuration data should be much smaller than the computational resource it controls. In this paper, we propose a pure-CMOS, nonvolatile, and small-footprint multi-context configuration memory. The multi-context memory includes multiple 2Tr nonvolatile memory elements, which are programmed by channel hot-electron injection, and allows context switching in a single clock cycle. A primitive dynamically reconfigurable device having a lookup table and minimum interconnect backed by 16-bit 8-context configuration memory was fabricated by a 0.18 um CMOS process and its functionality was demonstrated. The 2Tr nonvolatile memory element is more than 4 times denser than 6Tr SRAM, enabling achievement of greater logic density. The pure-CMOS and nonvolatile features would enhance the attractiveness of the technology in many applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"41 1","pages":"215-222"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88066363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082825
A. Kojima
Blokus Duo is an abstract strategy game for two players. In this paper, we describe our FPGA implementation of Blokus Duo player for ICFPT2014 design contest, which is the revised version of the previous design for ICPFT2013 design contest. Our design consists of hardware logic part and software part using soft IP processor. The hardware logic part calculates evaluation value of the board status which is a heavy task for the software part. Our implementation uses recursive Alpha-Beta pruning and iteration deepening algorithm by the software part, which are complex to implement as the hardware logic circuit. The current version of our implementation on Xilinx Artix7 can run at 142MHz. The hardware logic part evaluates about 90,000 nodes in one second at the beginning of the game.
{"title":"FPGA implementation of Blokus Duo player using hardware/software co-design","authors":"A. Kojima","doi":"10.1109/FPT.2014.7082825","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082825","url":null,"abstract":"Blokus Duo is an abstract strategy game for two players. In this paper, we describe our FPGA implementation of Blokus Duo player for ICFPT2014 design contest, which is the revised version of the previous design for ICPFT2013 design contest. Our design consists of hardware logic part and software part using soft IP processor. The hardware logic part calculates evaluation value of the board status which is a heavy task for the software part. Our implementation uses recursive Alpha-Beta pruning and iteration deepening algorithm by the software part, which are complex to implement as the hardware logic circuit. The current version of our implementation on Xilinx Artix7 can run at 142MHz. The hardware logic part evaluates about 90,000 nodes in one second at the beginning of the game.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"378-381"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84034239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082774
Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon
RotoRouter addresses Denial-of-Service (DoS) attacks on networks with a novel protocol and router implementation. Sets of RotoRouters cooperate in detecting and filtering out invalid network traffic before it reaches network endpoints; a new router-enforceable connection protocol queries destination endpoints to authorize traffic flows and uses per-packet digital signatures to distinguish allowed from disallowed connections. A RotoRouter prototype was implemented on a four-port 1000BASE-T NetFPGA-10G platform and supports 1024 simultaneous active connections using 74 BRAMs (less than one quarter of the available NetFPGA-10G BRAMs). It is able to sustain 800 Mbps per port throughputs for 1500B packets with less than 0.3/its latency, even during a DoS attack. With additional logic and memory resources, the required validation and switching operations scale to port speeds in excess of 10 Gbps and links with more than 10,000 active flows.
{"title":"RotoRouter: Router support for endpoint-authorized decentralized traffic filtering to prevent DoS attacks","authors":"Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon","doi":"10.1109/FPT.2014.7082774","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082774","url":null,"abstract":"RotoRouter addresses Denial-of-Service (DoS) attacks on networks with a novel protocol and router implementation. Sets of RotoRouters cooperate in detecting and filtering out invalid network traffic before it reaches network endpoints; a new router-enforceable connection protocol queries destination endpoints to authorize traffic flows and uses per-packet digital signatures to distinguish allowed from disallowed connections. A RotoRouter prototype was implemented on a four-port 1000BASE-T NetFPGA-10G platform and supports 1024 simultaneous active connections using 74 BRAMs (less than one quarter of the available NetFPGA-10G BRAMs). It is able to sustain 800 Mbps per port throughputs for 1500B packets with less than 0.3/its latency, even during a DoS attack. With additional logic and memory resources, the required validation and switching operations scale to port speeds in excess of 10 Gbps and links with more than 10,000 active flows.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"183-190"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84325900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082758
Shaoyi Cheng, J. Wawrzynek
As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor-centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.
{"title":"Architectural synthesis of computational pipelines with decoupled memory access","authors":"Shaoyi Cheng, J. Wawrzynek","doi":"10.1109/FPT.2014.7082758","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082758","url":null,"abstract":"As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor-centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"83-90"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85643289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-01-01DOI: 10.1109/fpt.2014.7082827
Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie
Xilinx has developed even more advanced FPGAs and 2nd generation SoCs and 3D ICs to stay a generation ahead, and deliver an extra node worth of performance, power, and integration. The UltraScale architecture was developed to scale from 20nm planar through 16nm and beyond FinFET (FF) technologies, and from monolithic through 3D ICs. In this talk, we will study the cases about Xilinx FPGA in cutting edge applications, also the advantages of UltraScale architecture 2nd generation SoCs, and design tools. IoT and Wearable Applications Enabled by Bluetooth Low Energy (BLE) Solutions Patrick Kane, Cypress Abstract: The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session
{"title":"Industrial session","authors":"Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie","doi":"10.1109/fpt.2014.7082827","DOIUrl":"https://doi.org/10.1109/fpt.2014.7082827","url":null,"abstract":"Xilinx has developed even more advanced FPGAs and 2nd generation SoCs and 3D ICs to stay a generation ahead, and deliver an extra node worth of performance, power, and integration. The UltraScale architecture was developed to scale from 20nm planar through 16nm and beyond FinFET (FF) technologies, and from monolithic through 3D ICs. In this talk, we will study the cases about Xilinx FPGA in cutting edge applications, also the advantages of UltraScale architecture 2nd generation SoCs, and design tools. IoT and Wearable Applications Enabled by Bluetooth Low Energy (BLE) Solutions Patrick Kane, Cypress Abstract: The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"42 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83589874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-01-01DOI: 10.1109/FPT.2014.7082757
Benjamin Carrión Schäfer
{"title":"Time sharing of Runtime Coarse-Grain Reconfigurable Architectures processing elements in multi-process systems","authors":"Benjamin Carrión Schäfer","doi":"10.1109/FPT.2014.7082757","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082757","url":null,"abstract":"","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"76-82"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83458446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718320
P. Chow
Summary form only given. Ever since FPGAs were invented, there has been great interest in using them as computing devices, and with the logic densities of today's devices, many interesting functions have been shown to have significant performance and energy benefits when implemented in FPGAs. However, when an application requires the combination of a high-performance CPU and an FPGA accelerator, the effectiveness of the FPGA is highly determined by the latency and bandwidth between the CPU, the CPU memory system and the FPGA and its memory system. Putting FPGAs into the CPU socket is one way to address this issue. This talk will present the history, the advantages and disadvantages, the challenges, architectures, programming models and applications of "insocket" accelerator systems.
{"title":"Why Put FPGAs in your CPU socket?","authors":"P. Chow","doi":"10.1109/FPT.2013.6718320","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718320","url":null,"abstract":"Summary form only given. Ever since FPGAs were invented, there has been great interest in using them as computing devices, and with the logic densities of today's devices, many interesting functions have been shown to have significant performance and energy benefits when implemented in FPGAs. However, when an application requires the combination of a high-performance CPU and an FPGA accelerator, the effectiveness of the FPGA is highly determined by the latency and bandwidth between the CPU, the CPU memory system and the FPGA and its memory system. Putting FPGAs into the CPU socket is one way to address this issue. This talk will present the history, the advantages and disadvantages, the challenges, architectures, programming models and applications of \"insocket\" accelerator systems.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81476395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}