Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082744
Feng-Hsiung Hsu
Microsoft has gone through massive changes in the last few years. First, it was the dominant software company. Then, it became a “Devices and Services” company, and now it is “Mobile First, Cloud First”. Of course, deep down in the bones, it is still a software company. In this talk, I will give a personal account on how FPGA acceleration gradually gained traction inside Microsoft, difficulties and lessons learned in getting acceptance, FPGA's apparently imminent deployment inside Microsoft data centers, and finally what may be needed in FPGA programming software tool developments for wider acceptance inside a company like Microsoft.
{"title":"Doing FPGA in a former software company","authors":"Feng-Hsiung Hsu","doi":"10.1109/FPT.2014.7082744","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082744","url":null,"abstract":"Microsoft has gone through massive changes in the last few years. First, it was the dominant software company. Then, it became a “Devices and Services” company, and now it is “Mobile First, Cloud First”. Of course, deep down in the bones, it is still a software company. In this talk, I will give a personal account on how FPGA acceleration gradually gained traction inside Microsoft, difficulties and lessons learned in getting acceptance, FPGA's apparently imminent deployment inside Microsoft data centers, and finally what may be needed in FPGA programming software tool developments for wider acceptance inside a company like Microsoft.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"11 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89376697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Side Channel Attack (SCA) aims to extract the secret information from cryptography chips by analyzing the leakage of physical parameters. Power analysis based SCA is a popular approach to obtain secret keys by monitoring the power consumption of cryptography chips. However, most SCA evaluation methods are performed on FPGA platforms while many parasitic physical effects cannot be revealed before the cryptography chips are taped out. Roughly ignoring these effects will significantly increase the attack difficulties due to the corresponding measurement noise. Power supply noise has been observed to be critical for power analysis based SCA. This paper demonstrates a power supply noise aware evaluation framework for practical side channel attack from cryptography system design to physical design. On-chip power delivery network is implemented among physical design stage. Consequently the supply noise of power network can be explored according to the post-layout implementation. Additionally, the countermeasures of cryptography chips could be enhanced by on-chip decapacitors placement due to its influences on the characteristics of power delivery network.
{"title":"Power supply noise aware evaluation framework for side channel attacks and countermeasures","authors":"Jianlei Yang, Chenguang Wang, Yici Cai, Qiang Zhou","doi":"10.1109/FPT.2014.7082770","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082770","url":null,"abstract":"Side Channel Attack (SCA) aims to extract the secret information from cryptography chips by analyzing the leakage of physical parameters. Power analysis based SCA is a popular approach to obtain secret keys by monitoring the power consumption of cryptography chips. However, most SCA evaluation methods are performed on FPGA platforms while many parasitic physical effects cannot be revealed before the cryptography chips are taped out. Roughly ignoring these effects will significantly increase the attack difficulties due to the corresponding measurement noise. Power supply noise has been observed to be critical for power analysis based SCA. This paper demonstrates a power supply noise aware evaluation framework for practical side channel attack from cryptography system design to physical design. On-chip power delivery network is implemented among physical design stage. Consequently the supply noise of power network can be explored according to the post-layout implementation. Additionally, the countermeasures of cryptography chips could be enhanced by on-chip decapacitors placement due to its influences on the characteristics of power delivery network.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"56 1","pages":"161-166"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87331421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082799
E. Fukuda, Hiroaki Inoue, Takashi Takenaka, Dahoo Kim, Tsunaki Sadahisa, T. Asai, M. Motomura
As the volume of data that web services handle is becoming larger, many web service providers are utilizing memcached, an in-memory key-value store to improve their web server's performance. While memcached usually runs on a server with a high performance processor, various hardware platforms has been evaluated for running memcached in order to achieve higher performance. Recently, several works that use FPGAs have successfully achieved higher performance than Xeon. These works, however, struggles to utilize a large memory with FPGAs. In this paper, we propose a system that enables us to overcome this problem and enhances memcached by caching a part of software memcached's commands and data to the network interface card equipped with an FPGA and a DRAM. Our evaluation showed that the NIC cache has less than 30% of hit rate for workload with Latest key selection distribution, and 30% to 60% for Zipf distribution workloads.
{"title":"Achieving higher performance of memcached by caching at network interface","authors":"E. Fukuda, Hiroaki Inoue, Takashi Takenaka, Dahoo Kim, Tsunaki Sadahisa, T. Asai, M. Motomura","doi":"10.1109/FPT.2014.7082799","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082799","url":null,"abstract":"As the volume of data that web services handle is becoming larger, many web service providers are utilizing memcached, an in-memory key-value store to improve their web server's performance. While memcached usually runs on a server with a high performance processor, various hardware platforms has been evaluated for running memcached in order to achieve higher performance. Recently, several works that use FPGAs have successfully achieved higher performance than Xeon. These works, however, struggles to utilize a large memory with FPGAs. In this paper, we propose a system that enables us to overcome this problem and enhances memcached by caching a part of software memcached's commands and data to the network interface card equipped with an FPGA and a DRAM. Our evaluation showed that the NIC cache has less than 30% of hit rate for workload with Latest key selection distribution, and 30% to 60% for Zipf distribution workloads.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"59 1","pages":"288-289"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87128879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082765
A. Ehliar
In this paper we describe an open source floating-point adder and multiplier implemented using a 36-bit custom number format based on radix-16 and optimized for the 7-series FPGAs from Xilinx. Although this number format is not identical to the single-precision IEEE-754 format, the floatingpoint operators are designed in such a way that the numerical results for a given operation will be identical to the result from an IEEE-754 compliant operator with support for round-to-nearest even, NaNs and Infs, and subnormal numbers. The drawback of this number format is that the rounding step is more involved than in a regular, radix-2 based operator. On the other hand, the use of a high radix means that the area cost associated with normalization and denormalization can be reduced, leading to a net area advantage for the custom number format, under the assumption that support for subnormal numbers is required. The area of the floating-point adder in a Kintex-7 FPGA is 261 slice LUTs and the area of the floating-point multiplier is 235 slice LUTs and 2 DSP48E blocks. The adder can operate at 319 MHz and the multiplier can operate at a frequency of 305 MHz.
{"title":"Area efficient floating-point adder and multiplier with IEEE-754 compatible semantics","authors":"A. Ehliar","doi":"10.1109/FPT.2014.7082765","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082765","url":null,"abstract":"In this paper we describe an open source floating-point adder and multiplier implemented using a 36-bit custom number format based on radix-16 and optimized for the 7-series FPGAs from Xilinx. Although this number format is not identical to the single-precision IEEE-754 format, the floatingpoint operators are designed in such a way that the numerical results for a given operation will be identical to the result from an IEEE-754 compliant operator with support for round-to-nearest even, NaNs and Infs, and subnormal numbers. The drawback of this number format is that the rounding step is more involved than in a regular, radix-2 based operator. On the other hand, the use of a high radix means that the area cost associated with normalization and denormalization can be reduced, leading to a net area advantage for the custom number format, under the assumption that support for subnormal numbers is required. The area of the floating-point adder in a Kintex-7 FPGA is 261 slice LUTs and the area of the floating-point multiplier is 235 slice LUTs and 2 DSP48E blocks. The adder can operate at 319 MHz and the multiplier can operate at a frequency of 305 MHz.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"23 1","pages":"131-138"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78659877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082775
Shuanglong Liu, Grigorios Mingas, C. Bouganis
Particle filters (PFs) are a set of algorithms that implement recursive Bayesian filtering, which represent the posterior distribution by a set of weighted samples. Resampling is a fundamental operation in PF algorithms. It consists of taking a population of samples and reconstructing it based on the weights attached to each sample, favouring the samples with large weights. However, resampling is computationally intensive when the number of samples is large and, most importantly, it is not inherently parallelizable like the other steps of the particle filter. Parallel computing devices such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) have been proposed to accelerate resampling. In this paper, we propose novel parallel architectures that map four state-of-the-art resampling algorithms (systematic, residual systematic, Metropolis and Rejection resampling) to a FPGA. FPGA-specific optimisations are introduced to further optimize the performance of the above systems. The proposed architectures are implemented in a Virtex-6 LX240T FPGA device with half-utilization of logic resources. Compared to the respective state-of-the-art implementations on an NVIDIA K20 GPU, the achieved speedups are in the range of 1.7x-49x.
{"title":"Parallel resampling for particle filters on FPGAs","authors":"Shuanglong Liu, Grigorios Mingas, C. Bouganis","doi":"10.1109/FPT.2014.7082775","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082775","url":null,"abstract":"Particle filters (PFs) are a set of algorithms that implement recursive Bayesian filtering, which represent the posterior distribution by a set of weighted samples. Resampling is a fundamental operation in PF algorithms. It consists of taking a population of samples and reconstructing it based on the weights attached to each sample, favouring the samples with large weights. However, resampling is computationally intensive when the number of samples is large and, most importantly, it is not inherently parallelizable like the other steps of the particle filter. Parallel computing devices such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) have been proposed to accelerate resampling. In this paper, we propose novel parallel architectures that map four state-of-the-art resampling algorithms (systematic, residual systematic, Metropolis and Rejection resampling) to a FPGA. FPGA-specific optimisations are introduced to further optimize the performance of the above systems. The proposed architectures are implemented in a Virtex-6 LX240T FPGA device with half-utilization of logic resources. Compared to the respective state-of-the-art implementations on an NVIDIA K20 GPU, the achieved speedups are in the range of 1.7x-49x.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"191-198"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74092905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082772
Duncan J. M. Moss, Zhe Zhang, Nicholas J. Fraser, P. Leong
Anomaly detection based on spectral features is applicable to a diverse range of problems including prognostic and health management, vibration analysis, astronomy, biomedicai engineering and computational finance. The input data could be regularly sampled, as in the case of a standard analogue to digital converter sampling a bandlimited signal at above the Nyquist rate, or irregularly sampled, as in the case of stock quotes or astronomical data. In this paper, we present new online algorithms for the computation of power spectra for regularly or irregularly sampled data, and performing anomaly detection on time series data. Both algorithms allow hardware implementations with O(l) time complexity, this being the minimum for any system that considers all the samples. We combine the two algorithms to form a power Spectrum-based Anomaly Detector (SAD). We also describe an implementation of SAD which has minimal hardware requirements, and achieves one to two orders of magnitude improvement in speed, latency, power and energy over a traditional processor-based design.
{"title":"An FPGA-based spectral anomaly detection system","authors":"Duncan J. M. Moss, Zhe Zhang, Nicholas J. Fraser, P. Leong","doi":"10.1109/FPT.2014.7082772","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082772","url":null,"abstract":"Anomaly detection based on spectral features is applicable to a diverse range of problems including prognostic and health management, vibration analysis, astronomy, biomedicai engineering and computational finance. The input data could be regularly sampled, as in the case of a standard analogue to digital converter sampling a bandlimited signal at above the Nyquist rate, or irregularly sampled, as in the case of stock quotes or astronomical data. In this paper, we present new online algorithms for the computation of power spectra for regularly or irregularly sampled data, and performing anomaly detection on time series data. Both algorithms allow hardware implementations with O(l) time complexity, this being the minimum for any system that considers all the samples. We combine the two algorithms to form a power Spectrum-based Anomaly Detector (SAD). We also describe an implementation of SAD which has minimal hardware requirements, and achieves one to two orders of magnitude improvement in speed, latency, power and energy over a traditional processor-based design.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"42 1","pages":"175-182"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78820197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082783
Hui Yan Cheah, Suhaib A. Fahmy, Nachiket Kapre
FPGA soft processors have been shown to achieve high frequency when designed around the specific capabilities of heterogenous resources on modern FPGAs. However, such performance comes at a cost of deep pipelines, which can result in a larger number of idle cycles when executing programs with long dependency chains in the instruction sequence. We perform a full design-space exploration of a DSP block based soft processor to examine the effect of pipeline depth on frequency, area, and program runtime, noting the significant number of NOPs required to resolve dependencies. We then explore the potential of a restricted data forwarding approach in improving runtime by significantly reducing NOP padding. The result is a processor that runs close to the fabric limit of 500MHz with a case for simple data forwarding.
{"title":"Analysis and optimization of a deeply pipelined FPGA soft processor","authors":"Hui Yan Cheah, Suhaib A. Fahmy, Nachiket Kapre","doi":"10.1109/FPT.2014.7082783","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082783","url":null,"abstract":"FPGA soft processors have been shown to achieve high frequency when designed around the specific capabilities of heterogenous resources on modern FPGAs. However, such performance comes at a cost of deep pipelines, which can result in a larger number of idle cycles when executing programs with long dependency chains in the instruction sequence. We perform a full design-space exploration of a DSP block based soft processor to examine the effect of pipeline depth on frequency, area, and program runtime, noting the significant number of NOPs required to resolve dependencies. We then explore the potential of a restricted data forwarding approach in improving runtime by significantly reducing NOP padding. The result is a processor that runs close to the fabric limit of 500MHz with a case for simple data forwarding.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"235-238"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84095359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082803
Fredrik Brosser, Emil Milh, Vilhelm Geijer, P. Larsson-Edefors
SRAM-based FPGAs are becoming increasingly attractive for use in space applications due to their reconfigurability and signal processing capabilities, as well as their increasing speed and capacity. Traditional SRAM-based FPGAs, however, are highly sensitive to the ionizing radiation environment in space, making them prone to radiation-induced memory upsets. In this paper, we evaluate and compare scrubbing techniques for Xilinx SRAM-based FPGAs with respect to radiation-induced single event upsets. A test framework using an exchangeable payload is developed for this purpose and run on a Xilinx Virtex-5 FPGA. We show that recent SRAM-based FPGAs can constitute a cost-efficient alternative to radiation-hardened or antifuse FPGAs for non-critical space application such as satellite instruments.
{"title":"Assessing scrubbing techniques for Xilinx SRAM-based FPGAs in space applications","authors":"Fredrik Brosser, Emil Milh, Vilhelm Geijer, P. Larsson-Edefors","doi":"10.1109/FPT.2014.7082803","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082803","url":null,"abstract":"SRAM-based FPGAs are becoming increasingly attractive for use in space applications due to their reconfigurability and signal processing capabilities, as well as their increasing speed and capacity. Traditional SRAM-based FPGAs, however, are highly sensitive to the ionizing radiation environment in space, making them prone to radiation-induced memory upsets. In this paper, we evaluate and compare scrubbing techniques for Xilinx SRAM-based FPGAs with respect to radiation-induced single event upsets. A test framework using an exchangeable payload is developed for this purpose and run on a Xilinx Virtex-5 FPGA. We show that recent SRAM-based FPGAs can constitute a cost-efficient alternative to radiation-hardened or antifuse FPGAs for non-critical space application such as satellite instruments.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"32 1","pages":"296-299"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91350383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082752
Aryan Tavakkoli, David B. Thomas
This paper presents a novel reconfigurable hardware accelerator for the pricing of American options using the binomial-tree model. The proposed architecture exploits both pipeline and coarse-grain parallelism in a highly efficient and scalable systolic solution, designed to exploit the large numbers of DSP blocks in modern architectures. The architecture can be tuned at compile-time to match user requirements, from dedicating the entire FPGA to low latency calculation of a single option, to high throughput concurrent evaluation of multiple options. On a Xilinx Virtex-7 xc7vx980t FPGA this allows a single option with 768 time steps to be priced with a latency of less than 22 micro-seconds and a pricing rate of more than 100 K options/sec. Compared to the fastest previous reconfigurable implementation of concurrent option evaluation, we achieve an improvement of 65 x in latency and 9x in throughput with a value of 10.7 G nodes/sec, on a Virtex-4 xc4vsx55 FPGA.
{"title":"Low-latency option pricing using systolic binomial trees","authors":"Aryan Tavakkoli, David B. Thomas","doi":"10.1109/FPT.2014.7082752","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082752","url":null,"abstract":"This paper presents a novel reconfigurable hardware accelerator for the pricing of American options using the binomial-tree model. The proposed architecture exploits both pipeline and coarse-grain parallelism in a highly efficient and scalable systolic solution, designed to exploit the large numbers of DSP blocks in modern architectures. The architecture can be tuned at compile-time to match user requirements, from dedicating the entire FPGA to low latency calculation of a single option, to high throughput concurrent evaluation of multiple options. On a Xilinx Virtex-7 xc7vx980t FPGA this allows a single option with 768 time steps to be priced with a latency of less than 22 micro-seconds and a pricing rate of more than 100 K options/sec. Compared to the fastest previous reconfigurable implementation of concurrent option evaluation, we achieve an improvement of 65 x in latency and 9x in throughput with a value of 10.7 G nodes/sec, on a Virtex-4 xc4vsx55 FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"527 1","pages":"44-51"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83555272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082787
Siddhartha, Nachiket Kapre
Performance of FPGA-based token dataflow architectures is often limited by the long tail distribution of parallelism in the compute paths of the dataflow graphs. This is known to limit speedup of dataflow processing of Sparse LU factorization to only 3-10x over CPUs. One reason behind the limitations is the serialization penalty of processing high-fanout nodes in the dataflow graph on traditional dataflow processing architectures. In this paper, we show how to perform one-time static fanout decomposition and selective node replication transformations to input dataflow graphs. These transformations are one-time static compute costs that are typically amortized over millions of iterations. For dataflow graphs extracted for sparse LU factorization, we demonstrate up to 2.3x speedup (1.2x geomean average) with this technique across a range of benchmark problems.
{"title":"Fanout decomposition dataflow optimizations for FPGA-based Sparse LU factorization","authors":"Siddhartha, Nachiket Kapre","doi":"10.1109/FPT.2014.7082787","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082787","url":null,"abstract":"Performance of FPGA-based token dataflow architectures is often limited by the long tail distribution of parallelism in the compute paths of the dataflow graphs. This is known to limit speedup of dataflow processing of Sparse LU factorization to only 3-10x over CPUs. One reason behind the limitations is the serialization penalty of processing high-fanout nodes in the dataflow graph on traditional dataflow processing architectures. In this paper, we show how to perform one-time static fanout decomposition and selective node replication transformations to input dataflow graphs. These transformations are one-time static compute costs that are typically amortized over millions of iterations. For dataflow graphs extracted for sparse LU factorization, we demonstrate up to 2.3x speedup (1.2x geomean average) with this technique across a range of benchmark problems.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"78 1","pages":"252-255"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82619278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}