首页 > 最新文献

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
An Efficient Architecture for Floating-Point Eigenvalue Decomposition 一种有效的浮点特征值分解结构
Xinying Wang, Joseph Zambreno
Eigenvalue decomposition (EVD) is a widely-used factorization tool to perform principal component analysis, and has been employed for dimensionality reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining and wireless communications. EVD is considered computationally expensive, and as software implementations have not been able to meet the performance requirements of many real-time applications, the use of reconfigurable computing technology has shown promise in accelerating this type of computation. In this paper, we present an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. Our experimental results using an FPGA-based hybrid acceleration system indicate the efficiency of our novel array architecture, with dimension-dependent speedups over an optimized software implementation that range from 1.5× to 15.45× in terms of computation time.
特征值分解(Eigenvalue decomposition, EVD)是一种被广泛应用于主成分分析的分解工具,在图像处理、文本挖掘和无线通信等科学和工程应用中被广泛用于降维和模式识别。EVD被认为是计算昂贵的,并且由于软件实现无法满足许多实时应用程序的性能要求,使用可重构计算技术在加速这种类型的计算方面显示出了希望。本文提出了一种高效的基于fpga的EVD双精度浮点结构,可以有效地分析大规模矩阵。我们使用基于fpga的混合加速系统的实验结果表明,我们的新型阵列架构的效率,在优化的软件实现上,计算时间的加速幅度从1.5倍到15.45倍不等。
{"title":"An Efficient Architecture for Floating-Point Eigenvalue Decomposition","authors":"Xinying Wang, Joseph Zambreno","doi":"10.1109/FCCM.2014.27","DOIUrl":"https://doi.org/10.1109/FCCM.2014.27","url":null,"abstract":"Eigenvalue decomposition (EVD) is a widely-used factorization tool to perform principal component analysis, and has been employed for dimensionality reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining and wireless communications. EVD is considered computationally expensive, and as software implementations have not been able to meet the performance requirements of many real-time applications, the use of reconfigurable computing technology has shown promise in accelerating this type of computation. In this paper, we present an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. Our experimental results using an FPGA-based hybrid acceleration system indicate the efficiency of our novel array architecture, with dimension-dependent speedups over an optimized software implementation that range from 1.5× to 15.45× in terms of computation time.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122333640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
High-Throughput Fixed-Point Object Detection on FPGAs fpga的高通量定点目标检测
Xiaoyin Ma, W. Najjar, A. Roy-Chowdhury
Computer vision applications make extensive use of floating-point number representation, both single and double precision. The major advantage of floating-point representation is the very large range of values that can be represented with a limited number of bits. Most CPU, and all GPU designs have been extensively optimized for short latency and high-throughput processing of floating-point operations. On an FPGA, the bit-width of operands is a major determinant of its resource utilization, the achievable clock frequency and hence its throughput. By using a fixed-point representation with fewer bits, an application developer could implement more processing units and a higher-clock frequency and a dramatically larger throughput. However, smaller bit-widths may lead to inaccurate or incorrect results. Object and human detection are fundamental problems in computer vision and a very active research area. In these applications a high throughput and an economy of resources are highly desirable features allowing the applications to be embedded in mobile or fielddeployable equipment. The Histogram of Oriented Gradients (HOG) algorithm [1], developed for human detection and expanded to object detection, is one of the most successful and popular algorithm in its class. In this algorithm, object descriptors are extracted from detection window with grids of overlapping blocks. Each block is divided into cells in which histograms of intensity gradients are collected as HOG features. Vectors of histograms are normalized and passed to a Support Vector Machine (SVM) classifier to recognize a person or an object.
计算机视觉应用广泛使用浮点数表示,包括单精度和双精度。浮点表示法的主要优点是可以用有限的位数表示非常大的值范围。大多数CPU和所有GPU设计都针对浮点操作的短延迟和高吞吐量处理进行了广泛优化。在FPGA上,操作数的位宽是其资源利用率、可实现时钟频率以及吞吐量的主要决定因素。通过使用具有更少位的定点表示,应用程序开发人员可以实现更多的处理单元、更高的时钟频率和更大的吞吐量。但是,较小的位宽可能导致不准确或错误的结果。物体和人的检测是计算机视觉的基本问题,也是一个非常活跃的研究领域。在这些应用中,高吞吐量和资源经济性是非常理想的特性,允许应用程序嵌入移动或现场可部署的设备中。面向梯度直方图(Histogram of Oriented Gradients, HOG)算法[1]是针对人体检测而开发并扩展到目标检测的算法,是同类算法中最成功、最流行的算法之一。在该算法中,从具有重叠块网格的检测窗口中提取目标描述符。每个块被分成若干个单元,在这些单元中收集强度梯度直方图作为HOG特征。直方图的向量被归一化并传递给支持向量机(SVM)分类器来识别一个人或一个物体。
{"title":"High-Throughput Fixed-Point Object Detection on FPGAs","authors":"Xiaoyin Ma, W. Najjar, A. Roy-Chowdhury","doi":"10.1109/FCCM.2014.40","DOIUrl":"https://doi.org/10.1109/FCCM.2014.40","url":null,"abstract":"Computer vision applications make extensive use of floating-point number representation, both single and double precision. The major advantage of floating-point representation is the very large range of values that can be represented with a limited number of bits. Most CPU, and all GPU designs have been extensively optimized for short latency and high-throughput processing of floating-point operations. On an FPGA, the bit-width of operands is a major determinant of its resource utilization, the achievable clock frequency and hence its throughput. By using a fixed-point representation with fewer bits, an application developer could implement more processing units and a higher-clock frequency and a dramatically larger throughput. However, smaller bit-widths may lead to inaccurate or incorrect results. Object and human detection are fundamental problems in computer vision and a very active research area. In these applications a high throughput and an economy of resources are highly desirable features allowing the applications to be embedded in mobile or fielddeployable equipment. The Histogram of Oriented Gradients (HOG) algorithm [1], developed for human detection and expanded to object detection, is one of the most successful and popular algorithm in its class. In this algorithm, object descriptors are extracted from detection window with grids of overlapping blocks. Each block is divided into cells in which histograms of intensity gradients are collected as HOG features. Vectors of histograms are normalized and passed to a Support Vector Machine (SVM) classifier to recognize a person or an object.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114568407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Better-Than-DMR Techniques for Yield Improvement 优于dmr的增产技术
S. Sanae, Yuko Hara-Azumi, S. Yamashita, Y. Nakashima
In this work, we first study LUT optimization in PPCs for increasing their area-efficiency for yield improvement. We focus on the fact that although 22n configurations are available for an-input LUT, such full programmability is not needed, i.e., one configuration is enough for bypassing one specific fault. Then, we optimize away too rich programmability of LUTs exploiting application features in order to reduce the area cost without degrading the fault bypassability from the original PPC.
在这项工作中,我们首先研究了PPCs的LUT优化,以提高其面积效率以提高产量。我们关注的事实是,尽管有22n个配置可用于一个输入LUT,但不需要这样的完全可编程性,也就是说,一个配置足以绕过一个特定的故障。然后,我们优化了利用应用程序特性的lut的过于丰富的可编程性,以便在不降低原始PPC的故障绕过性的情况下降低面积成本。
{"title":"Better-Than-DMR Techniques for Yield Improvement","authors":"S. Sanae, Yuko Hara-Azumi, S. Yamashita, Y. Nakashima","doi":"10.1109/FCCM.2014.21","DOIUrl":"https://doi.org/10.1109/FCCM.2014.21","url":null,"abstract":"In this work, we first study LUT optimization in PPCs for increasing their area-efficiency for yield improvement. We focus on the fact that although 22n configurations are available for an-input LUT, such full programmability is not needed, i.e., one configuration is enough for bypassing one specific fault. Then, we optimize away too rich programmability of LUTs exploiting application features in order to reduce the area cost without degrading the fault bypassability from the original PPC.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127588310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication 稀疏矩阵-矢量乘法的高存储带宽FPGA加速
J. Fowers, Kalin Ovtcharov, K. Strauss, Eric S. Chung, G. Stitt
Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4x lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.
稀疏矩阵向量乘法(SMVM)是一种用于各种科学和商业应用的关键原语。尽管具有显著的并行性,但由于其不规则的内存访问特性,SMVM是一个难以优化的内核。许多研究已经提出使用fpga来加速SMVM的实现。然而,大多数先前的方法侧重于在矩阵的单行内并行化乘法-累积操作(如果行很少,这限制了并行性)和/或在获取矩阵和向量元素时低效地使用内存系统。在本文中,我们介绍了一种fpga优化的SMVM架构和一种新的稀疏矩阵编码,该编码显式地暴露了跨行并行性,同时保持了硬件复杂性和片上内存使用率低。该系统优于先前的FPGA SMVM实现。对于我们评估的700多个佛罗里达大学稀疏矩阵,它的平均性能也在CPU SMVM性能的三分之二左右,即使它具有2.4倍的DRAM内存带宽,并且在GPU SVMV性能的平均三分之一之内,即使在9倍的内存带宽下。此外,它仅消耗25W,基于最大设备功耗,其功率效率分别比CPU和GPU高2.6倍和2.3倍。
{"title":"A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication","authors":"J. Fowers, Kalin Ovtcharov, K. Strauss, Eric S. Chung, G. Stitt","doi":"10.1109/FCCM.2014.23","DOIUrl":"https://doi.org/10.1109/FCCM.2014.23","url":null,"abstract":"Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4x lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133045431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 119
On Hard Adders and Carry Chains in FPGAs fpga中的硬加法器和进位链
J. Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, K. Kent, J. Anderson, Jonathan Rose, Vaughn Betz
Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.
强化加法器和进位逻辑被广泛应用于商用fpga中,以提高运算函数的效率。与这种强化相关的设计选择和复杂性有很多,包括电路设计、FPGA架构选择和CAD流程。然而,关于这些选择的研究很少,因此我们探索了硬加法器设计的许多可能性。我们还强调了前端细化期间的优化,这些优化有助于改善通过强化算法对逻辑合成施加的限制。我们展示了硬加法器和进位链,当用于简单加法器时,将性能提高四倍或更多,但在包含算术的大型基准设计中,将总体性能提高大约15%。我们测量了带有进位链的架构的平均面积增加5%,但认为更好的逻辑综合应该减少这种损失。有趣的是,我们表明,添加专用的逻辑块间携带链路或快速携带前瞻性强化加法器只会对完整设计产生微小的延迟改进。
{"title":"On Hard Adders and Carry Chains in FPGAs","authors":"J. Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, K. Kent, J. Anderson, Jonathan Rose, Vaughn Betz","doi":"10.1109/FCCM.2014.25","DOIUrl":"https://doi.org/10.1109/FCCM.2014.25","url":null,"abstract":"Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124040923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
From GPU to FPGA: A Pipelined Hierarchical Approach to Fast and Memory-Efficient NDN Name Lookup 从GPU到FPGA:快速高效内存NDN名称查找的流水线分层方法
Yanbiao Li, Dafang Zhang, Xian Yu, Jing Long, W. Liang
Summary form only given. Named Data Networking (NDN) is an emerging future Internet architecture with an alternative communication paradigm. For NDN, name lookup, just like IP address lookup for TCP/IP, plays an important role in forwarding. However, performing Longest Prefix Matching (LPM) to NDN names is more challenging. Recently, Graphic Processing Units (GPUs) have been shown to be of value in supporting wire speed name lookup, but the latency resulted by batching and transferring names is not so encouraging. On the other hand, in the area of IP address lookup, FPGA is widely used to implement Static Radom Accessing Memory (SRAM)-based pipeline for fast lookup and controllable latency. Thus, in this paper, we study how to accelerate NDN name lookup using FPGA-based pipeline.
只提供摘要形式。命名数据网络(NDN)是一种新兴的未来互联网架构,具有可选的通信范式。对于NDN来说,名称查找就像TCP/IP的IP地址查找一样,在转发中起着重要的作用。然而,对NDN名称执行最长前缀匹配(LPM)更具挑战性。最近,图形处理单元(gpu)已被证明在支持线速名称查找方面很有价值,但是批处理和传输名称所导致的延迟并不是那么令人鼓舞。另一方面,在IP地址查找领域,FPGA被广泛用于实现基于静态随机存取存储器(SRAM)的管道,以实现快速查找和可控延迟。因此,在本文中,我们研究了如何使用基于fpga的管道来加速NDN名称查找。
{"title":"From GPU to FPGA: A Pipelined Hierarchical Approach to Fast and Memory-Efficient NDN Name Lookup","authors":"Yanbiao Li, Dafang Zhang, Xian Yu, Jing Long, W. Liang","doi":"10.1109/FCCM.2014.39","DOIUrl":"https://doi.org/10.1109/FCCM.2014.39","url":null,"abstract":"Summary form only given. Named Data Networking (NDN) is an emerging future Internet architecture with an alternative communication paradigm. For NDN, name lookup, just like IP address lookup for TCP/IP, plays an important role in forwarding. However, performing Longest Prefix Matching (LPM) to NDN names is more challenging. Recently, Graphic Processing Units (GPUs) have been shown to be of value in supporting wire speed name lookup, but the latency resulted by batching and transferring names is not so encouraging. On the other hand, in the area of IP address lookup, FPGA is widely used to implement Static Radom Accessing Memory (SRAM)-based pipeline for fast lookup and controllable latency. Thus, in this paper, we study how to accelerate NDN name lookup using FPGA-based pipeline.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130937704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GROK-INT: Generating Real On-Chip Knowledge for Interconnect Delays Using Timing Extraction GROK-INT:利用时序提取生成互连延迟的真实片上知识
Benjamin Gojman, A. DeHon
With continued scaling, all transistors are no longer created equal. The delay of a length 4 horizontal routing segment at coordinates (23,17) will differ from one at (12,14) in the same FPGA and from the same segment in another FPGA. The vendor tools give conservative values for these delays, but knowing exactly what these delays are can be invaluable. In this paper, we show how to obtain this information, inexpensively, using only components that already exist on the FPGA (configurable PLLs, registers, logic, and interconnect). The techniques we present are general and can be used to measure the delays of any resource on any FPGA with these components. We provide general algorithms for identifying the set of useful delay components, the set of measurements necessary to compute these delay components, and the calculations necessary to perform the computation. We demonstrate our techniques on the interconnect for an Altera Cyclone III (65nm). As a result, we are able to quantify over a 100 ps spread in delays for nominally identical routing segments on a single FPGA.
随着规模的不断扩大,所有的晶体管都不再是平等的。在坐标(23,17)处的长度为4的水平路由段的延迟将不同于同一FPGA中的(12,14)段,也不同于另一个FPGA中的同一段。供应商的工具给出了这些延迟的保守值,但是确切地知道这些延迟是什么是非常宝贵的。在本文中,我们展示了如何仅使用FPGA上已经存在的组件(可配置锁相环,寄存器,逻辑和互连)以低成本获取此信息。我们提出的技术是通用的,可用于测量具有这些组件的FPGA上任何资源的延迟。我们提供了通用算法来识别一组有用的延迟分量,计算这些延迟分量所需的一组测量,以及执行计算所需的计算。我们在Altera Cyclone III (65nm)的互连上展示了我们的技术。因此,我们能够量化单个FPGA上名义上相同的路由段的延迟超过100 ps。
{"title":"GROK-INT: Generating Real On-Chip Knowledge for Interconnect Delays Using Timing Extraction","authors":"Benjamin Gojman, A. DeHon","doi":"10.1109/FCCM.2014.31","DOIUrl":"https://doi.org/10.1109/FCCM.2014.31","url":null,"abstract":"With continued scaling, all transistors are no longer created equal. The delay of a length 4 horizontal routing segment at coordinates (23,17) will differ from one at (12,14) in the same FPGA and from the same segment in another FPGA. The vendor tools give conservative values for these delays, but knowing exactly what these delays are can be invaluable. In this paper, we show how to obtain this information, inexpensively, using only components that already exist on the FPGA (configurable PLLs, registers, logic, and interconnect). The techniques we present are general and can be used to measure the delays of any resource on any FPGA with these components. We provide general algorithms for identifying the set of useful delay components, the set of measurements necessary to compute these delay components, and the calculations necessary to perform the computation. We demonstrate our techniques on the interconnect for an Altera Cyclone III (65nm). As a result, we are able to quantify over a 100 ps spread in delays for nominally identical routing segments on a single FPGA.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116078948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Harmonica: An FPGA-Based Data Parallel Soft Core 口琴:基于fpga的数据并行软核
C. Kersey, S. Yalamanchili, Hyojong Kim, Nimit Nigania, Hyesoon Kim
General-purpose GPUs or GPGPUs have taken their place in the market, being present in 38 of the Top 500 supercomputers [5]. In the same way that the emergence of FPGAs in the 1980s led to a demand for soft cores with instruction sets similar to the CPUs of the day, we anticipate a similar demand in the 2010s for soft cores with GPGPU instruction sets. These architectures are distinguished by their SIMT, single-instruction-multiple-thread, execution model, acheiving throughput by running multiple threads of execution simultaneously across multiple functional units, keeping separate register values for each lane of execution.
通用gpu或gpgpu已经在市场上占据了一席之地,在500强超级计算机中有38台是通用gpu[5]。就像20世纪80年代fpga的出现导致对具有类似于当时cpu的指令集的软核的需求一样,我们预计2010年代对具有GPGPU指令集的软核的需求也会类似。这些体系结构的特点是它们的SIMT(单指令多线程)执行模型,通过在多个功能单元上同时运行多个执行线程来实现吞吐量,并为每个执行通道保留单独的寄存器值。
{"title":"Harmonica: An FPGA-Based Data Parallel Soft Core","authors":"C. Kersey, S. Yalamanchili, Hyojong Kim, Nimit Nigania, Hyesoon Kim","doi":"10.1109/FCCM.2014.53","DOIUrl":"https://doi.org/10.1109/FCCM.2014.53","url":null,"abstract":"General-purpose GPUs or GPGPUs have taken their place in the market, being present in 38 of the Top 500 supercomputers [5]. In the same way that the emergence of FPGAs in the 1980s led to a demand for soft cores with instruction sets similar to the CPUs of the day, we anticipate a similar demand in the 2010s for soft cores with GPGPU instruction sets. These architectures are distinguished by their SIMT, single-instruction-multiple-thread, execution model, acheiving throughput by running multiple threads of execution simultaneously across multiple functional units, keeping separate register values for each lane of execution.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115953256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Architectural Approach to Characterizing and Eliminating Sources of Inefficiency in a Soft Processor Design 描述和消除软处理器设计中低效率来源的体系结构方法
Kaveh Aasaraai, Andreas Moshovos
This work takes an architectural approach to systematically characterize components and mechanisms that are the main sources of low operating clock frequency when implementing a typical pipelined general purpose processor on an FPGA. Several previous works have addressed specific implementation inefficiencies, however mostly on a case-by-case basis. Accordingly. there is a need to systematically characterize the sources of inefficiency in soft processor designs. Such a characterization serves to deepen our understanding of FPGA implementation trade-offs and can serve as the starting point for developing FPGA-friendly designs that achieve higher performance and/or lower area. We start with a typical 5-stage pipelined architecture that is optimized for custom logic implementation and that focuses on correctness, modularity, and speed of development.
当在FPGA上实现一个典型的流水线通用处理器时,这项工作采用了一种体系结构方法来系统地表征作为低工作时钟频率主要来源的组件和机制。以前的一些工作已经解决了具体的执行效率低下问题,但主要是在具体情况具体分析的基础上。相应的行动。有必要系统地描述软处理器设计中低效率的来源。这样的特性有助于加深我们对FPGA实现权衡的理解,并且可以作为开发FPGA友好设计的起点,以实现更高的性能和/或更低的面积。我们从一个典型的5阶段流水线架构开始,该架构针对自定义逻辑实现进行了优化,并专注于正确性、模块化和开发速度。
{"title":"An Architectural Approach to Characterizing and Eliminating Sources of Inefficiency in a Soft Processor Design","authors":"Kaveh Aasaraai, Andreas Moshovos","doi":"10.1109/FCCM.2014.51","DOIUrl":"https://doi.org/10.1109/FCCM.2014.51","url":null,"abstract":"This work takes an architectural approach to systematically characterize components and mechanisms that are the main sources of low operating clock frequency when implementing a typical pipelined general purpose processor on an FPGA. Several previous works have addressed specific implementation inefficiencies, however mostly on a case-by-case basis. Accordingly. there is a need to systematically characterize the sources of inefficiency in soft processor designs. Such a characterization serves to deepen our understanding of FPGA implementation trade-offs and can serve as the starting point for developing FPGA-friendly designs that achieve higher performance and/or lower area. We start with a typical 5-stage pipelined architecture that is optimized for custom logic implementation and that focuses on correctness, modularity, and speed of development.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114097031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Memory Optimized Re-gridding for Non-uniform Fast Fourier Transform on FPGAs fpga上非均匀快速傅立叶变换的内存优化重网格
Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar
Summary form only given. The Discrete Fourier Transform (DFT) can be viewed as the Fourier Transform of a periodic and regularly sampled signal as commonly defined in equation 1. The Non-Uniform Discrete Fourier Transform (NuDFT) is a generalization of the DFT for data that may not be regularly sampled in spatial or temporal dimensions. This flexibility allows for benefits in situation where sensor placement cannot be guaranteed to be regular or where prior knowledge of the informational content could allow for better sampling patterns than a regular one. NuDFT is used in applications such as Synthetic Aperture Radar (SAR), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). The NuDFT definition is shown in equation 2. Here the sample locations are points si in the set S. Each point, si has a complex value consisting of location or frequency components six and siy. The location or frequency components are, of course, not restriced to a discrete sampling grid.
只提供摘要形式。离散傅立叶变换(DFT)可以看作是一个周期和有规则采样信号的傅立叶变换,通常定义在公式1中。非均匀离散傅里叶变换(NuDFT)是对在空间或时间维度上可能没有规则采样的数据的DFT的推广。这种灵活性在传感器放置不能保证规律的情况下有好处,或者对信息内容的先验知识可以允许比常规采样模式更好的采样模式。NuDFT应用于合成孔径雷达(SAR)、计算机断层扫描(CT)和磁共振成像(MRI)等领域。NuDFT的定义如公式2所示。这里的样本位置是集合s中的点si。每个点si都有一个由位置或频率分量6和si组成的复值。当然,位置或频率分量并不局限于离散采样网格。
{"title":"Memory Optimized Re-gridding for Non-uniform Fast Fourier Transform on FPGAs","authors":"Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar","doi":"10.1109/FCCM.2014.35","DOIUrl":"https://doi.org/10.1109/FCCM.2014.35","url":null,"abstract":"Summary form only given. The Discrete Fourier Transform (DFT) can be viewed as the Fourier Transform of a periodic and regularly sampled signal as commonly defined in equation 1. The Non-Uniform Discrete Fourier Transform (NuDFT) is a generalization of the DFT for data that may not be regularly sampled in spatial or temporal dimensions. This flexibility allows for benefits in situation where sensor placement cannot be guaranteed to be regular or where prior knowledge of the informational content could allow for better sampling patterns than a regular one. NuDFT is used in applications such as Synthetic Aperture Radar (SAR), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). The NuDFT definition is shown in equation 2. Here the sample locations are points si in the set S. Each point, si has a complex value consisting of location or frequency components six and siy. The location or frequency components are, of course, not restriced to a discrete sampling grid.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130985734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1