A Digital Wireless Channel Emulator (DWCE) is a system that is capable of emulating the RF environment for a group of wireless devices. A major issue with current designs is that they do not scale to a large enough number of nodes to emulate meaningful network. A reason for this lack of scalability is the large amount of computations and network capacity required for such a system. Previously documented DWCE systems implement a hub-and-spoke configuration that inhibits them from simply adding additional hardware to scale. This paper investigates the use of a FPGA cluster configured as a distributed system to provide the computational and network structure to scale a DWCE to support 1250 wireless devices. This scale is approximately two orders of magnitude larger than any other previously documented system. This paper presents multiple FPGA cluster configurations that use currently available hardware and describes the algorithms used to route the signals through the network and place the computational hardware on each FPGA. The low level VHDL Signal Path Component (SPC) is synthesized and mapped under different parameters to interpolate is resource utilization. One example FPGA build with enough SPCs to fill 80% of the FPGA resources is successfully run through the Xilinx tool-chain to determine the maximum FPGA system clock speed. Finally, the scaling results are presented that detail the maximum sample frequency of various sized DWCE systems which could be used to examine a variety of wireless devices.
{"title":"A Range and Scaling Study of an FPGA-Based Digital Wireless Channel Emulator","authors":"Scott Buscemi, William V. Kritikos, R. Sass","doi":"10.1109/FCCM.2013.42","DOIUrl":"https://doi.org/10.1109/FCCM.2013.42","url":null,"abstract":"A Digital Wireless Channel Emulator (DWCE) is a system that is capable of emulating the RF environment for a group of wireless devices. A major issue with current designs is that they do not scale to a large enough number of nodes to emulate meaningful network. A reason for this lack of scalability is the large amount of computations and network capacity required for such a system. Previously documented DWCE systems implement a hub-and-spoke configuration that inhibits them from simply adding additional hardware to scale. This paper investigates the use of a FPGA cluster configured as a distributed system to provide the computational and network structure to scale a DWCE to support 1250 wireless devices. This scale is approximately two orders of magnitude larger than any other previously documented system. This paper presents multiple FPGA cluster configurations that use currently available hardware and describes the algorithms used to route the signals through the network and place the computational hardware on each FPGA. The low level VHDL Signal Path Component (SPC) is synthesized and mapped under different parameters to interpolate is resource utilization. One example FPGA build with enough SPCs to fill 80% of the FPGA resources is successfully run through the Xilinx tool-chain to determine the maximum FPGA system clock speed. Finally, the scaling results are presented that detail the maximum sample frequency of various sized DWCE systems which could be used to examine a variety of wireless devices.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125909523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a new, open-source method for FPGA CAD researchers to realize their techniques on real Xilinx devices. Specifically, we extend the Verilog-To-Routing (VTR) suite, which includes the VPR place-and-route CAD tool on which many FPGA innovations have been based, to generate working Xilinx bitstreams via the Xilinx Design Language (XDL). Currently, we can faithfully translate VPR's heterogeneous packing and placement results into an exact Xilinx `map' netlist, which is then routed by its `par' tool. We showcase the utility of this new method with two compelling applications targeting a 40nm Virtex-6 device: a fair comparison of the area, delay, and CAD runtime of academia's state-of-the-art VTR How with a commercial, closed-source equivalent, along with a CAD experiment evaluated using physical measurements of on-chip power consumption and die temperature, over time. This extended How - VTR-to-Bitstream - is released to the community with the hope that it can enhance existing research projects as well as unlock new ones.
{"title":"Escaping the Academic Sandbox: Realizing VPR Circuits on Xilinx Devices","authors":"Eddie Hung, F. Eslami, S. Wilton","doi":"10.1109/FCCM.2013.40","DOIUrl":"https://doi.org/10.1109/FCCM.2013.40","url":null,"abstract":"This paper presents a new, open-source method for FPGA CAD researchers to realize their techniques on real Xilinx devices. Specifically, we extend the Verilog-To-Routing (VTR) suite, which includes the VPR place-and-route CAD tool on which many FPGA innovations have been based, to generate working Xilinx bitstreams via the Xilinx Design Language (XDL). Currently, we can faithfully translate VPR's heterogeneous packing and placement results into an exact Xilinx `map' netlist, which is then routed by its `par' tool. We showcase the utility of this new method with two compelling applications targeting a 40nm Virtex-6 device: a fair comparison of the area, delay, and CAD runtime of academia's state-of-the-art VTR How with a commercial, closed-source equivalent, along with a CAD experiment evaluated using physical measurements of on-chip power consumption and die temperature, over time. This extended How - VTR-to-Bitstream - is released to the community with the hope that it can enhance existing research projects as well as unlock new ones.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"58 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114101829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Partial reconfiguration (PR) of field-programmable gate arrays (FPGAs) enables hardware tasks to time multiplex PR regions (PRRs) by isolating reconfiguration to only the reconfigured PRR, which avoids halting the entire FPGA's execution. Time multiplexing PRRs requires support for unloading/loading tasks and for resuming a task's execution state. In order to resume a task's execution state, the execution state (context) must be saved when the task is unloaded so that the execution state can be restored when the task resumes- context save (CS) and context restore (CR), respectively. In this paper, we present a software-based, on-chip context save and restore (CSR) for PR-capable FPGAs. As compared to prior work, our CSR is autonomous (i.e., does not require any external host support), does not require custom on-chip hardware, is portable across any system design, and does not require tool flow modifications or special tools. Experimental results extensively evaluate the CSR execution time based on PRR size, enabling designers to trade off PRR granularity for CSR execution time based on application requirements.
{"title":"On-chip Context Save and Restore of Hardware Tasks on Partially Reconfigurable FPGAs","authors":"Aurelio Morales-Villanueva, A. Gordon-Ross","doi":"10.1109/FCCM.2013.13","DOIUrl":"https://doi.org/10.1109/FCCM.2013.13","url":null,"abstract":"Partial reconfiguration (PR) of field-programmable gate arrays (FPGAs) enables hardware tasks to time multiplex PR regions (PRRs) by isolating reconfiguration to only the reconfigured PRR, which avoids halting the entire FPGA's execution. Time multiplexing PRRs requires support for unloading/loading tasks and for resuming a task's execution state. In order to resume a task's execution state, the execution state (context) must be saved when the task is unloaded so that the execution state can be restored when the task resumes- context save (CS) and context restore (CR), respectively. In this paper, we present a software-based, on-chip context save and restore (CSR) for PR-capable FPGAs. As compared to prior work, our CSR is autonomous (i.e., does not require any external host support), does not require custom on-chip hardware, is portable across any system design, and does not require tool flow modifications or special tools. Experimental results extensively evaluate the CSR execution time based on PRR size, enabling designers to trade off PRR granularity for CSR execution time based on application requirements.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128106302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-28DOI: 10.1109/FPGA.2013.6882273
Kenneth L. Pocek, R. Tessier, A. DeHon
For 20 years, the International Symposium on Field-Programmable Custom Computing Machines (FCCM) has explored how FPGAs and FPGA-like architectures can bring unique capabilities to computational tasks. We survey the evolution of the field of reconfigurable computing as reflected in FCCM, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.
{"title":"Birth and adolescence of reconfigurable computing: a survey of the first 20 years of field-programmable custom computing machines","authors":"Kenneth L. Pocek, R. Tessier, A. DeHon","doi":"10.1109/FPGA.2013.6882273","DOIUrl":"https://doi.org/10.1109/FPGA.2013.6882273","url":null,"abstract":"For 20 years, the International Symposium on Field-Programmable Custom Computing Machines (FCCM) has explored how FPGAs and FPGA-like architectures can bring unique capabilities to computational tasks. We survey the evolution of the field of reconfigurable computing as reflected in FCCM, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134211983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang
Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we appl
{"title":"Global Atmospheric Simulation on a Reconfigurable Platform","authors":"L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang","doi":"10.1109/FCCM.2013.26","DOIUrl":"https://doi.org/10.1109/FCCM.2013.26","url":null,"abstract":"Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we appl","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123949155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Integer compression techniques can generally be classified as bit-wise and byte-wise approaches. Though at the cost of a larger processing time, bit-wise techniques typically result in a better compression ratio. The Golomb-Rice (GR) method is a bit-wise lossless technique applied to the compression of images, audio files and lists of inverted indices. However, since GR is a serial algorithm, decompression is regarded as a very slow process; to the best of our knowledge, all existing software and hardware native (non-modified) GR decoding engines operate bit-serially on the encoded stream. In this paper, we present (1) the first no-stall hardware architecture, capable of decompressing streams of integers compressed using the GR method, at a rate of several bytes (multiple integers) per hardware cycle; (2) a novel GR decoder based on the latter architecture is further detailed, operating at a peak rate of one integer per cycle. A thorough design space exploration study on the resulting resource utilization and throughput of the aforementioned approaches is presented. Furthermore, a performance study is provided, comparing software approaches to implementations of the novel hardware decoders. While occupying 10% of a Xilinx V6LX240T FPGA, the no-stall architecture core achieves a sustained throughput of over 7 Gbps.
{"title":"A High Throughput No-Stall Golomb-Rice Hardware Decoder","authors":"R. Moussalli, W. Najjar, Xi Luo, Amna Khan","doi":"10.1109/FCCM.2013.9","DOIUrl":"https://doi.org/10.1109/FCCM.2013.9","url":null,"abstract":"Integer compression techniques can generally be classified as bit-wise and byte-wise approaches. Though at the cost of a larger processing time, bit-wise techniques typically result in a better compression ratio. The Golomb-Rice (GR) method is a bit-wise lossless technique applied to the compression of images, audio files and lists of inverted indices. However, since GR is a serial algorithm, decompression is regarded as a very slow process; to the best of our knowledge, all existing software and hardware native (non-modified) GR decoding engines operate bit-serially on the encoded stream. In this paper, we present (1) the first no-stall hardware architecture, capable of decompressing streams of integers compressed using the GR method, at a rate of several bytes (multiple integers) per hardware cycle; (2) a novel GR decoder based on the latter architecture is further detailed, operating at a peak rate of one integer per cycle. A thorough design space exploration study on the resulting resource utilization and throughput of the aforementioned approaches is presented. Furthermore, a performance study is provided, comparing software approaches to implementations of the novel hardware decoders. While occupying 10% of a Xilinx V6LX240T FPGA, the no-stall architecture core achieves a sustained throughput of over 7 Gbps.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Schweizer, Dustin Peterson, Johannes Maximilian Kühn, T. Kuhn, W. Rosenstiel
This paper introduces an FPGA-based fault injection system. To realize this system a library was developed, which implements a static mapping between a circuit described at RTL or gate-level and its corresponding placed and routed FPGA design. The aim of this mapping is to preserve module and port structure of the placed and routed FPGA design to the RT/gate-level circuit description. To demonstrate the accuracy of this mapping the ISCAS'89 benchmark circuits and the VHDL netlist of the LEON3 system are used. The results show that about 99% of the ports in the RT/gate-level circuit description can be located in the placed and routed FPGA design. Based on this library a fault injection tool was developed to accelerate the fault injection experiment time by bypassing some stages (synthesis, placement and routing) of a re-compilation process. In these experiments a 12 × speedup was achieved when compared to fault injections based on serial fault emulation.
{"title":"A Fast and Accurate FPGA-Based Fault Injection System","authors":"Thomas Schweizer, Dustin Peterson, Johannes Maximilian Kühn, T. Kuhn, W. Rosenstiel","doi":"10.1109/FCCM.2013.47","DOIUrl":"https://doi.org/10.1109/FCCM.2013.47","url":null,"abstract":"This paper introduces an FPGA-based fault injection system. To realize this system a library was developed, which implements a static mapping between a circuit described at RTL or gate-level and its corresponding placed and routed FPGA design. The aim of this mapping is to preserve module and port structure of the placed and routed FPGA design to the RT/gate-level circuit description. To demonstrate the accuracy of this mapping the ISCAS'89 benchmark circuits and the VHDL netlist of the LEON3 system are used. The results show that about 99% of the ports in the RT/gate-level circuit description can be located in the placed and routed FPGA design. Based on this library a fault injection tool was developed to accelerate the fault injection experiment time by bypassing some stages (synthesis, placement and routing) of a re-compilation process. In these experiments a 12 × speedup was achieved when compared to fault injections based on serial fault emulation.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115256534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Amouri, H. Amrouch, T. Ebi, J. Henkel, M. Tahoori
Accurate thermal profile estimation for FPGA, at design time, is necessary to avoid unexpected thermal hot-spots in the circuit before deploying the FPGA to the in-field operation. Both accurate dynamic and leakage power values are needed for the thermal profile estimation and they can be estimated using the FPGA vendor's tools. However these report leakage power as a single value for the whole chip, and no details are given in literature or the FPGA toolset about its distribution across the FPGA chip for the thermal simulation. To cope with this problem, we present a method for properly distributing the leakage power across the FPGA chip. The method uses a temperature-leakage loop estimation model for distributing and adapting the leakage power for more accurate thermal simulation. Furthermore, to accurately calibrate the presented method and its model and also to validate the resulting thermal profiles, we utilize an infrared thermal camera, which measures the emissions from the backside of a Virtex-5 FPGA chip. The results of testing several designs, with different sizes and frequencies, show that our approach can achieve accurate thermal-profile estimation when compared to the camera measurements, with average absolute estimation error of around 1°C across the chip.
{"title":"Accurate Thermal-Profile Estimation and Validation for FPGA-Mapped Circuits","authors":"A. Amouri, H. Amrouch, T. Ebi, J. Henkel, M. Tahoori","doi":"10.1109/FCCM.2013.48","DOIUrl":"https://doi.org/10.1109/FCCM.2013.48","url":null,"abstract":"Accurate thermal profile estimation for FPGA, at design time, is necessary to avoid unexpected thermal hot-spots in the circuit before deploying the FPGA to the in-field operation. Both accurate dynamic and leakage power values are needed for the thermal profile estimation and they can be estimated using the FPGA vendor's tools. However these report leakage power as a single value for the whole chip, and no details are given in literature or the FPGA toolset about its distribution across the FPGA chip for the thermal simulation. To cope with this problem, we present a method for properly distributing the leakage power across the FPGA chip. The method uses a temperature-leakage loop estimation model for distributing and adapting the leakage power for more accurate thermal simulation. Furthermore, to accurately calibrate the presented method and its model and also to validate the resulting thermal profiles, we utilize an infrared thermal camera, which measures the emissions from the backside of a Virtex-5 FPGA chip. The results of testing several designs, with different sizes and frequencies, show that our approach can achieve accurate thermal-profile estimation when compared to the camera measurements, with average absolute estimation error of around 1°C across the chip.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114787891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.
{"title":"On Optimizing the Arithmetic Precision of MCMC Algorithms","authors":"Grigorios Mingas, Farhan Rahman, C. Bouganis","doi":"10.1109/FCCM.2013.31","DOIUrl":"https://doi.org/10.1109/FCCM.2013.31","url":null,"abstract":"Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This summary paper1 proposes an FPGA-based array processor which performs Laplacian filtering on a 40 by 40 pixel grayscale video. The architecture comprises of bit-serial pixel processors interconnected to give a two-dimensional mesh array. This architecture features the novel use of partial reconfiguration which transfers data to and fro the array. Each processor occupies a configurable logic block and achieves a target frame rate of 10000 frames per second, at an operating frequency of 0.31 MHz on the Virtex-6 ML605 Evaluation Kit. The detailed correspondence between the contents of slice lookup tables and the Virtex-6 bitstream format is also documented.
{"title":"High Speed Video Processing Using Fine-Grained Processing on FPGA Platform","authors":"Z. Ang, Akash Kumar, Yajun Ha","doi":"10.1109/FCCM.2013.32","DOIUrl":"https://doi.org/10.1109/FCCM.2013.32","url":null,"abstract":"This summary paper1 proposes an FPGA-based array processor which performs Laplacian filtering on a 40 by 40 pixel grayscale video. The architecture comprises of bit-serial pixel processors interconnected to give a two-dimensional mesh array. This architecture features the novel use of partial reconfiguration which transfers data to and fro the array. Each processor occupies a configurable logic block and achieves a target frame rate of 10000 frames per second, at an operating frequency of 0.31 MHz on the Virtex-6 ML605 Evaluation Kit. The detailed correspondence between the contents of slice lookup tables and the Virtex-6 bitstream format is also documented.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115460457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}