Eigenvalue decomposition (EVD) is a widely-used factorization tool to perform principal component analysis, and has been employed for dimensionality reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining and wireless communications. EVD is considered computationally expensive, and as software implementations have not been able to meet the performance requirements of many real-time applications, the use of reconfigurable computing technology has shown promise in accelerating this type of computation. In this paper, we present an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. Our experimental results using an FPGA-based hybrid acceleration system indicate the efficiency of our novel array architecture, with dimension-dependent speedups over an optimized software implementation that range from 1.5× to 15.45× in terms of computation time.
{"title":"An Efficient Architecture for Floating-Point Eigenvalue Decomposition","authors":"Xinying Wang, Joseph Zambreno","doi":"10.1109/FCCM.2014.27","DOIUrl":"https://doi.org/10.1109/FCCM.2014.27","url":null,"abstract":"Eigenvalue decomposition (EVD) is a widely-used factorization tool to perform principal component analysis, and has been employed for dimensionality reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining and wireless communications. EVD is considered computationally expensive, and as software implementations have not been able to meet the performance requirements of many real-time applications, the use of reconfigurable computing technology has shown promise in accelerating this type of computation. In this paper, we present an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. Our experimental results using an FPGA-based hybrid acceleration system indicate the efficiency of our novel array architecture, with dimension-dependent speedups over an optimized software implementation that range from 1.5× to 15.45× in terms of computation time.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122333640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computer vision applications make extensive use of floating-point number representation, both single and double precision. The major advantage of floating-point representation is the very large range of values that can be represented with a limited number of bits. Most CPU, and all GPU designs have been extensively optimized for short latency and high-throughput processing of floating-point operations. On an FPGA, the bit-width of operands is a major determinant of its resource utilization, the achievable clock frequency and hence its throughput. By using a fixed-point representation with fewer bits, an application developer could implement more processing units and a higher-clock frequency and a dramatically larger throughput. However, smaller bit-widths may lead to inaccurate or incorrect results. Object and human detection are fundamental problems in computer vision and a very active research area. In these applications a high throughput and an economy of resources are highly desirable features allowing the applications to be embedded in mobile or fielddeployable equipment. The Histogram of Oriented Gradients (HOG) algorithm [1], developed for human detection and expanded to object detection, is one of the most successful and popular algorithm in its class. In this algorithm, object descriptors are extracted from detection window with grids of overlapping blocks. Each block is divided into cells in which histograms of intensity gradients are collected as HOG features. Vectors of histograms are normalized and passed to a Support Vector Machine (SVM) classifier to recognize a person or an object.
计算机视觉应用广泛使用浮点数表示,包括单精度和双精度。浮点表示法的主要优点是可以用有限的位数表示非常大的值范围。大多数CPU和所有GPU设计都针对浮点操作的短延迟和高吞吐量处理进行了广泛优化。在FPGA上,操作数的位宽是其资源利用率、可实现时钟频率以及吞吐量的主要决定因素。通过使用具有更少位的定点表示,应用程序开发人员可以实现更多的处理单元、更高的时钟频率和更大的吞吐量。但是,较小的位宽可能导致不准确或错误的结果。物体和人的检测是计算机视觉的基本问题,也是一个非常活跃的研究领域。在这些应用中,高吞吐量和资源经济性是非常理想的特性,允许应用程序嵌入移动或现场可部署的设备中。面向梯度直方图(Histogram of Oriented Gradients, HOG)算法[1]是针对人体检测而开发并扩展到目标检测的算法,是同类算法中最成功、最流行的算法之一。在该算法中,从具有重叠块网格的检测窗口中提取目标描述符。每个块被分成若干个单元,在这些单元中收集强度梯度直方图作为HOG特征。直方图的向量被归一化并传递给支持向量机(SVM)分类器来识别一个人或一个物体。
{"title":"High-Throughput Fixed-Point Object Detection on FPGAs","authors":"Xiaoyin Ma, W. Najjar, A. Roy-Chowdhury","doi":"10.1109/FCCM.2014.40","DOIUrl":"https://doi.org/10.1109/FCCM.2014.40","url":null,"abstract":"Computer vision applications make extensive use of floating-point number representation, both single and double precision. The major advantage of floating-point representation is the very large range of values that can be represented with a limited number of bits. Most CPU, and all GPU designs have been extensively optimized for short latency and high-throughput processing of floating-point operations. On an FPGA, the bit-width of operands is a major determinant of its resource utilization, the achievable clock frequency and hence its throughput. By using a fixed-point representation with fewer bits, an application developer could implement more processing units and a higher-clock frequency and a dramatically larger throughput. However, smaller bit-widths may lead to inaccurate or incorrect results. Object and human detection are fundamental problems in computer vision and a very active research area. In these applications a high throughput and an economy of resources are highly desirable features allowing the applications to be embedded in mobile or fielddeployable equipment. The Histogram of Oriented Gradients (HOG) algorithm [1], developed for human detection and expanded to object detection, is one of the most successful and popular algorithm in its class. In this algorithm, object descriptors are extracted from detection window with grids of overlapping blocks. Each block is divided into cells in which histograms of intensity gradients are collected as HOG features. Vectors of histograms are normalized and passed to a Support Vector Machine (SVM) classifier to recognize a person or an object.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114568407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sanae, Yuko Hara-Azumi, S. Yamashita, Y. Nakashima
In this work, we first study LUT optimization in PPCs for increasing their area-efficiency for yield improvement. We focus on the fact that although 22n configurations are available for an-input LUT, such full programmability is not needed, i.e., one configuration is enough for bypassing one specific fault. Then, we optimize away too rich programmability of LUTs exploiting application features in order to reduce the area cost without degrading the fault bypassability from the original PPC.
{"title":"Better-Than-DMR Techniques for Yield Improvement","authors":"S. Sanae, Yuko Hara-Azumi, S. Yamashita, Y. Nakashima","doi":"10.1109/FCCM.2014.21","DOIUrl":"https://doi.org/10.1109/FCCM.2014.21","url":null,"abstract":"In this work, we first study LUT optimization in PPCs for increasing their area-efficiency for yield improvement. We focus on the fact that although 22n configurations are available for an-input LUT, such full programmability is not needed, i.e., one configuration is enough for bypassing one specific fault. Then, we optimize away too rich programmability of LUTs exploiting application features in order to reduce the area cost without degrading the fault bypassability from the original PPC.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127588310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Fowers, Kalin Ovtcharov, K. Strauss, Eric S. Chung, G. Stitt
Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4x lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.
{"title":"A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication","authors":"J. Fowers, Kalin Ovtcharov, K. Strauss, Eric S. Chung, G. Stitt","doi":"10.1109/FCCM.2014.23","DOIUrl":"https://doi.org/10.1109/FCCM.2014.23","url":null,"abstract":"Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4x lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133045431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, K. Kent, J. Anderson, Jonathan Rose, Vaughn Betz
Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.
{"title":"On Hard Adders and Carry Chains in FPGAs","authors":"J. Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, K. Kent, J. Anderson, Jonathan Rose, Vaughn Betz","doi":"10.1109/FCCM.2014.25","DOIUrl":"https://doi.org/10.1109/FCCM.2014.25","url":null,"abstract":"Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124040923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanbiao Li, Dafang Zhang, Xian Yu, Jing Long, W. Liang
Summary form only given. Named Data Networking (NDN) is an emerging future Internet architecture with an alternative communication paradigm. For NDN, name lookup, just like IP address lookup for TCP/IP, plays an important role in forwarding. However, performing Longest Prefix Matching (LPM) to NDN names is more challenging. Recently, Graphic Processing Units (GPUs) have been shown to be of value in supporting wire speed name lookup, but the latency resulted by batching and transferring names is not so encouraging. On the other hand, in the area of IP address lookup, FPGA is widely used to implement Static Radom Accessing Memory (SRAM)-based pipeline for fast lookup and controllable latency. Thus, in this paper, we study how to accelerate NDN name lookup using FPGA-based pipeline.
{"title":"From GPU to FPGA: A Pipelined Hierarchical Approach to Fast and Memory-Efficient NDN Name Lookup","authors":"Yanbiao Li, Dafang Zhang, Xian Yu, Jing Long, W. Liang","doi":"10.1109/FCCM.2014.39","DOIUrl":"https://doi.org/10.1109/FCCM.2014.39","url":null,"abstract":"Summary form only given. Named Data Networking (NDN) is an emerging future Internet architecture with an alternative communication paradigm. For NDN, name lookup, just like IP address lookup for TCP/IP, plays an important role in forwarding. However, performing Longest Prefix Matching (LPM) to NDN names is more challenging. Recently, Graphic Processing Units (GPUs) have been shown to be of value in supporting wire speed name lookup, but the latency resulted by batching and transferring names is not so encouraging. On the other hand, in the area of IP address lookup, FPGA is widely used to implement Static Radom Accessing Memory (SRAM)-based pipeline for fast lookup and controllable latency. Thus, in this paper, we study how to accelerate NDN name lookup using FPGA-based pipeline.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130937704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With continued scaling, all transistors are no longer created equal. The delay of a length 4 horizontal routing segment at coordinates (23,17) will differ from one at (12,14) in the same FPGA and from the same segment in another FPGA. The vendor tools give conservative values for these delays, but knowing exactly what these delays are can be invaluable. In this paper, we show how to obtain this information, inexpensively, using only components that already exist on the FPGA (configurable PLLs, registers, logic, and interconnect). The techniques we present are general and can be used to measure the delays of any resource on any FPGA with these components. We provide general algorithms for identifying the set of useful delay components, the set of measurements necessary to compute these delay components, and the calculations necessary to perform the computation. We demonstrate our techniques on the interconnect for an Altera Cyclone III (65nm). As a result, we are able to quantify over a 100 ps spread in delays for nominally identical routing segments on a single FPGA.
随着规模的不断扩大,所有的晶体管都不再是平等的。在坐标(23,17)处的长度为4的水平路由段的延迟将不同于同一FPGA中的(12,14)段,也不同于另一个FPGA中的同一段。供应商的工具给出了这些延迟的保守值,但是确切地知道这些延迟是什么是非常宝贵的。在本文中,我们展示了如何仅使用FPGA上已经存在的组件(可配置锁相环,寄存器,逻辑和互连)以低成本获取此信息。我们提出的技术是通用的,可用于测量具有这些组件的FPGA上任何资源的延迟。我们提供了通用算法来识别一组有用的延迟分量,计算这些延迟分量所需的一组测量,以及执行计算所需的计算。我们在Altera Cyclone III (65nm)的互连上展示了我们的技术。因此,我们能够量化单个FPGA上名义上相同的路由段的延迟超过100 ps。
{"title":"GROK-INT: Generating Real On-Chip Knowledge for Interconnect Delays Using Timing Extraction","authors":"Benjamin Gojman, A. DeHon","doi":"10.1109/FCCM.2014.31","DOIUrl":"https://doi.org/10.1109/FCCM.2014.31","url":null,"abstract":"With continued scaling, all transistors are no longer created equal. The delay of a length 4 horizontal routing segment at coordinates (23,17) will differ from one at (12,14) in the same FPGA and from the same segment in another FPGA. The vendor tools give conservative values for these delays, but knowing exactly what these delays are can be invaluable. In this paper, we show how to obtain this information, inexpensively, using only components that already exist on the FPGA (configurable PLLs, registers, logic, and interconnect). The techniques we present are general and can be used to measure the delays of any resource on any FPGA with these components. We provide general algorithms for identifying the set of useful delay components, the set of measurements necessary to compute these delay components, and the calculations necessary to perform the computation. We demonstrate our techniques on the interconnect for an Altera Cyclone III (65nm). As a result, we are able to quantify over a 100 ps spread in delays for nominally identical routing segments on a single FPGA.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116078948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Kersey, S. Yalamanchili, Hyojong Kim, Nimit Nigania, Hyesoon Kim
General-purpose GPUs or GPGPUs have taken their place in the market, being present in 38 of the Top 500 supercomputers [5]. In the same way that the emergence of FPGAs in the 1980s led to a demand for soft cores with instruction sets similar to the CPUs of the day, we anticipate a similar demand in the 2010s for soft cores with GPGPU instruction sets. These architectures are distinguished by their SIMT, single-instruction-multiple-thread, execution model, acheiving throughput by running multiple threads of execution simultaneously across multiple functional units, keeping separate register values for each lane of execution.
{"title":"Harmonica: An FPGA-Based Data Parallel Soft Core","authors":"C. Kersey, S. Yalamanchili, Hyojong Kim, Nimit Nigania, Hyesoon Kim","doi":"10.1109/FCCM.2014.53","DOIUrl":"https://doi.org/10.1109/FCCM.2014.53","url":null,"abstract":"General-purpose GPUs or GPGPUs have taken their place in the market, being present in 38 of the Top 500 supercomputers [5]. In the same way that the emergence of FPGAs in the 1980s led to a demand for soft cores with instruction sets similar to the CPUs of the day, we anticipate a similar demand in the 2010s for soft cores with GPGPU instruction sets. These architectures are distinguished by their SIMT, single-instruction-multiple-thread, execution model, acheiving throughput by running multiple threads of execution simultaneously across multiple functional units, keeping separate register values for each lane of execution.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115953256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work takes an architectural approach to systematically characterize components and mechanisms that are the main sources of low operating clock frequency when implementing a typical pipelined general purpose processor on an FPGA. Several previous works have addressed specific implementation inefficiencies, however mostly on a case-by-case basis. Accordingly. there is a need to systematically characterize the sources of inefficiency in soft processor designs. Such a characterization serves to deepen our understanding of FPGA implementation trade-offs and can serve as the starting point for developing FPGA-friendly designs that achieve higher performance and/or lower area. We start with a typical 5-stage pipelined architecture that is optimized for custom logic implementation and that focuses on correctness, modularity, and speed of development.
{"title":"An Architectural Approach to Characterizing and Eliminating Sources of Inefficiency in a Soft Processor Design","authors":"Kaveh Aasaraai, Andreas Moshovos","doi":"10.1109/FCCM.2014.51","DOIUrl":"https://doi.org/10.1109/FCCM.2014.51","url":null,"abstract":"This work takes an architectural approach to systematically characterize components and mechanisms that are the main sources of low operating clock frequency when implementing a typical pipelined general purpose processor on an FPGA. Several previous works have addressed specific implementation inefficiencies, however mostly on a case-by-case basis. Accordingly. there is a need to systematically characterize the sources of inefficiency in soft processor designs. Such a characterization serves to deepen our understanding of FPGA implementation trade-offs and can serve as the starting point for developing FPGA-friendly designs that achieve higher performance and/or lower area. We start with a typical 5-stage pipelined architecture that is optimized for custom logic implementation and that focuses on correctness, modularity, and speed of development.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114097031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. The Discrete Fourier Transform (DFT) can be viewed as the Fourier Transform of a periodic and regularly sampled signal as commonly defined in equation 1. The Non-Uniform Discrete Fourier Transform (NuDFT) is a generalization of the DFT for data that may not be regularly sampled in spatial or temporal dimensions. This flexibility allows for benefits in situation where sensor placement cannot be guaranteed to be regular or where prior knowledge of the informational content could allow for better sampling patterns than a regular one. NuDFT is used in applications such as Synthetic Aperture Radar (SAR), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). The NuDFT definition is shown in equation 2. Here the sample locations are points si in the set S. Each point, si has a complex value consisting of location or frequency components six and siy. The location or frequency components are, of course, not restriced to a discrete sampling grid.
{"title":"Memory Optimized Re-gridding for Non-uniform Fast Fourier Transform on FPGAs","authors":"Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar","doi":"10.1109/FCCM.2014.35","DOIUrl":"https://doi.org/10.1109/FCCM.2014.35","url":null,"abstract":"Summary form only given. The Discrete Fourier Transform (DFT) can be viewed as the Fourier Transform of a periodic and regularly sampled signal as commonly defined in equation 1. The Non-Uniform Discrete Fourier Transform (NuDFT) is a generalization of the DFT for data that may not be regularly sampled in spatial or temporal dimensions. This flexibility allows for benefits in situation where sensor placement cannot be guaranteed to be regular or where prior knowledge of the informational content could allow for better sampling patterns than a regular one. NuDFT is used in applications such as Synthetic Aperture Radar (SAR), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). The NuDFT definition is shown in equation 2. Here the sample locations are points si in the set S. Each point, si has a complex value consisting of location or frequency components six and siy. The location or frequency components are, of course, not restriced to a discrete sampling grid.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130985734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}