首页 > 最新文献

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
Accurate Thermal-Profile Estimation and Validation for FPGA-Mapped Circuits fpga映射电路的精确热分布估计和验证
A. Amouri, H. Amrouch, T. Ebi, J. Henkel, M. Tahoori
Accurate thermal profile estimation for FPGA, at design time, is necessary to avoid unexpected thermal hot-spots in the circuit before deploying the FPGA to the in-field operation. Both accurate dynamic and leakage power values are needed for the thermal profile estimation and they can be estimated using the FPGA vendor's tools. However these report leakage power as a single value for the whole chip, and no details are given in literature or the FPGA toolset about its distribution across the FPGA chip for the thermal simulation. To cope with this problem, we present a method for properly distributing the leakage power across the FPGA chip. The method uses a temperature-leakage loop estimation model for distributing and adapting the leakage power for more accurate thermal simulation. Furthermore, to accurately calibrate the presented method and its model and also to validate the resulting thermal profiles, we utilize an infrared thermal camera, which measures the emissions from the backside of a Virtex-5 FPGA chip. The results of testing several designs, with different sizes and frequencies, show that our approach can achieve accurate thermal-profile estimation when compared to the camera measurements, with average absolute estimation error of around 1°C across the chip.
在设计阶段对FPGA进行精确的热分布估计,是在FPGA部署到现场运行之前避免电路中出现意外热热点的必要条件。热分布估计需要精确的动态和泄漏功率值,它们可以使用FPGA供应商的工具进行估计。然而,这些报告泄漏功率作为整个芯片的单个值,并且在文献或FPGA工具集中没有给出关于其在FPGA芯片上分布的详细信息,用于热模拟。为了解决这个问题,我们提出了一种在FPGA芯片上合理分配漏功率的方法。该方法采用温度泄漏回路估计模型对泄漏功率进行分布和调整,以实现更精确的热模拟。此外,为了准确校准所提出的方法及其模型,并验证所得到的热剖面,我们使用红外热像仪,测量Virtex-5 FPGA芯片背面的辐射。测试几种不同尺寸和频率的设计的结果表明,与相机测量相比,我们的方法可以实现准确的热剖面估计,整个芯片的平均绝对估计误差约为1°C。
{"title":"Accurate Thermal-Profile Estimation and Validation for FPGA-Mapped Circuits","authors":"A. Amouri, H. Amrouch, T. Ebi, J. Henkel, M. Tahoori","doi":"10.1109/FCCM.2013.48","DOIUrl":"https://doi.org/10.1109/FCCM.2013.48","url":null,"abstract":"Accurate thermal profile estimation for FPGA, at design time, is necessary to avoid unexpected thermal hot-spots in the circuit before deploying the FPGA to the in-field operation. Both accurate dynamic and leakage power values are needed for the thermal profile estimation and they can be estimated using the FPGA vendor's tools. However these report leakage power as a single value for the whole chip, and no details are given in literature or the FPGA toolset about its distribution across the FPGA chip for the thermal simulation. To cope with this problem, we present a method for properly distributing the leakage power across the FPGA chip. The method uses a temperature-leakage loop estimation model for distributing and adapting the leakage power for more accurate thermal simulation. Furthermore, to accurately calibrate the presented method and its model and also to validate the resulting thermal profiles, we utilize an infrared thermal camera, which measures the emissions from the backside of a Virtex-5 FPGA chip. The results of testing several designs, with different sizes and frequencies, show that our approach can achieve accurate thermal-profile estimation when compared to the camera measurements, with average absolute estimation error of around 1°C across the chip.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114787891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
High Speed Video Processing Using Fine-Grained Processing on FPGA Platform 基于FPGA平台的细粒度高速视频处理
Z. Ang, Akash Kumar, Yajun Ha
This summary paper1 proposes an FPGA-based array processor which performs Laplacian filtering on a 40 by 40 pixel grayscale video. The architecture comprises of bit-serial pixel processors interconnected to give a two-dimensional mesh array. This architecture features the novel use of partial reconfiguration which transfers data to and fro the array. Each processor occupies a configurable logic block and achieves a target frame rate of 10000 frames per second, at an operating frequency of 0.31 MHz on the Virtex-6 ML605 Evaluation Kit. The detailed correspondence between the contents of slice lookup tables and the Virtex-6 bitstream format is also documented.
本文提出了一种基于fpga的阵列处理器,对40 × 40像素的灰度视频进行拉普拉斯滤波。该体系结构由位串行像素处理器组成,这些处理器相互连接以形成二维网格阵列。该体系结构的特点是采用了局部重构的新方法,可以在数组之间来回传输数据。每个处理器占用一个可配置的逻辑块,在Virtex-6 ML605评估套件上实现每秒10000帧的目标帧率,工作频率为0.31 MHz。切片查找表的内容和Virtex-6位流格式之间的详细对应关系也被记录下来。
{"title":"High Speed Video Processing Using Fine-Grained Processing on FPGA Platform","authors":"Z. Ang, Akash Kumar, Yajun Ha","doi":"10.1109/FCCM.2013.32","DOIUrl":"https://doi.org/10.1109/FCCM.2013.32","url":null,"abstract":"This summary paper1 proposes an FPGA-based array processor which performs Laplacian filtering on a 40 by 40 pixel grayscale video. The architecture comprises of bit-serial pixel processors interconnected to give a two-dimensional mesh array. This architecture features the novel use of partial reconfiguration which transfers data to and fro the array. Each processor occupies a configurable logic block and achieves a target frame rate of 10000 frames per second, at an operating frequency of 0.31 MHz on the Virtex-6 ML605 Evaluation Kit. The detailed correspondence between the contents of slice lookup tables and the Virtex-6 bitstream format is also documented.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115460457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
High-Level Description and Synthesis of Floating-Point Accumulators on FPGA FPGA上浮点累加器的高级描述与综合
Marc-André Daigneault, J. David
Decades of research in the field of high level hardware description now result in tools that are able to automatically transform C/C++ constructs into highly optimized parallel and pipelined architectures. Such approaches work fine when the control flow is a priory known since the computation results in a large dataflow graph that can be mapped into the available operators. Nevertheless, some applications have a control flow that is highly dependant on the data. This paper focuses on the hardware implementation of such applications and presents a high level synthesis methodology applied to a Hardware Description Language (HDL) in which assignments correspond to self-synchronized connections between predefined data streaming sources and sinks. A data transfer occurs over an established connection when both source and sink are ready, according to their synchronization interfaces. Founded on a high-level communicating FSM programming model, the language allows the user to describe and dynamically modify streaming architectures exploiting spatial and temporal parallelism. Our compiler attempts to maximize the number of transfers at each clock cycle and automatically fixes the potential combinatorial loops induced by the dynamic connection of dependant sources and sinks. The methodology is applied to the synthesis of a pipelined floating point accumulator using the Delayed-Buffering (DB) reduction method. The results we obtain are similar to state-of-the-art dedicated architectures but require much less design time and expertise.
在高级硬件描述领域几十年的研究现在产生了能够自动将C/ c++结构转换为高度优化的并行和流水线体系结构的工具。当控制流是已知的优先级时,这种方法可以很好地工作,因为计算结果是一个可以映射到可用操作符的大型数据流图。然而,一些应用程序具有高度依赖于数据的控制流。本文着重于此类应用的硬件实现,并提出了一种应用于硬件描述语言(HDL)的高级综合方法,其中分配对应于预定义数据流源和汇之间的自同步连接。当源和接收都准备好时,根据它们的同步接口,在已建立的连接上进行数据传输。该语言基于高级通信FSM编程模型,允许用户描述和动态修改利用空间和时间并行性的流架构。我们的编译器试图最大化每个时钟周期的传输数量,并自动修复由依赖源和汇的动态连接引起的潜在组合环路。将该方法应用于使用延迟缓冲(DB)缩减方法的流水线浮点累加器的合成。我们得到的结果类似于最先进的专用架构,但需要更少的设计时间和专业知识。
{"title":"High-Level Description and Synthesis of Floating-Point Accumulators on FPGA","authors":"Marc-André Daigneault, J. David","doi":"10.1109/FCCM.2013.37","DOIUrl":"https://doi.org/10.1109/FCCM.2013.37","url":null,"abstract":"Decades of research in the field of high level hardware description now result in tools that are able to automatically transform C/C++ constructs into highly optimized parallel and pipelined architectures. Such approaches work fine when the control flow is a priory known since the computation results in a large dataflow graph that can be mapped into the available operators. Nevertheless, some applications have a control flow that is highly dependant on the data. This paper focuses on the hardware implementation of such applications and presents a high level synthesis methodology applied to a Hardware Description Language (HDL) in which assignments correspond to self-synchronized connections between predefined data streaming sources and sinks. A data transfer occurs over an established connection when both source and sink are ready, according to their synchronization interfaces. Founded on a high-level communicating FSM programming model, the language allows the user to describe and dynamically modify streaming architectures exploiting spatial and temporal parallelism. Our compiler attempts to maximize the number of transfers at each clock cycle and automatically fixes the potential combinatorial loops induced by the dynamic connection of dependant sources and sinks. The methodology is applied to the synthesis of a pipelined floating point accumulator using the Delayed-Buffering (DB) reduction method. The results we obtain are similar to state-of-the-art dedicated architectures but require much less design time and expertise.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
On Optimizing the Arithmetic Precision of MCMC Algorithms MCMC算法的算法精度优化
Grigorios Mingas, Farhan Rahman, C. Bouganis
Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.
马尔可夫链蒙特卡罗(MCMC)是一种普遍存在的随机方法,用于从任意概率分布中抽取随机样本,例如在贝叶斯推理中遇到的随机样本。在具有高维和大规模数据的问题中,MCMC通常需要非常长的运行时间才能给出具有代表性的样本。现场可编程门阵列(fpga)由于其支持大规模并行的能力,已被证明是MCMC加速的合适平台。本文介绍了一种自动化方法,该方法可以最大限度地降低fpga映射MCMC采样器中计算最密集部分的浮点精度,同时保持输出中与精度相关的偏差在用户指定的公差范围内。该方法基于一种有效的偏差估计器,可以在少量随机样本的情况下估计输出中的偏差。优化过程包括FPGA预运行,预运行预估偏置并选择优化精度。然后,这种精度用于重新配置FPGA,以实现最终的长时间MCMC运行,从而允许更高的采样吞吐量。该过程不需要用户干预。该方法在混合模型和神经网络回归两个贝叶斯推理案例中进行了测试。通过双精度FPGA设计实现的加速是3.5 -5倍(包括优化开销)。与顺序CPU和GPGPU的比较显示,速度分别为223x-446x和16x-18x。
{"title":"On Optimizing the Arithmetic Precision of MCMC Algorithms","authors":"Grigorios Mingas, Farhan Rahman, C. Bouganis","doi":"10.1109/FCCM.2013.31","DOIUrl":"https://doi.org/10.1109/FCCM.2013.31","url":null,"abstract":"Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Parallel Computation of Skyline Queries Skyline查询的并行计算
L. Woods, G. Alonso, J. Teubner
Due to stagnant clock speeds and high power consumption of commodity microprocessors, database vendors have started to explore massively parallel co-processors such as FPGAs to further increase performance. A typical approach is to push simple but compute-intensive operations (e.g., prefiltering, (de)compression) to FPGAs for acceleration. In this paper, we show how a significantly more complex operation- the computation of the skyline-can be holistically implemented on an FPGA. A skyline query computes the pareto optimal set of multi-dimensional data points. These queries have been studied in software extensively over the last decade but this paper is the first to examine skyline computation in hardware. We propose a methodology that interleaves data storage and computation, allowing multiple operations to be executed on the same working set in parallel, while accounting for all data dependencies. Our experiments show that we achieve very promising results compared to CPU-based solutions.
由于商用微处理器的时钟速度停滞不前和高功耗,数据库供应商已经开始探索大规模并行协处理器,如fpga,以进一步提高性能。一种典型的方法是将简单但计算密集型的操作(例如,预滤波,(解)压缩)推到fpga上来加速。在本文中,我们展示了如何在FPGA上整体实现一个更复杂的操作-天际线的计算。skyline查询计算多维数据点的帕累托最优集。在过去的十年中,这些查询已经在软件中得到了广泛的研究,但本文是第一次在硬件中研究天际线计算。我们提出了一种交叉数据存储和计算的方法,允许在同一工作集中并行执行多个操作,同时考虑所有数据依赖性。我们的实验表明,与基于cpu的解决方案相比,我们获得了非常有希望的结果。
{"title":"Parallel Computation of Skyline Queries","authors":"L. Woods, G. Alonso, J. Teubner","doi":"10.1109/FCCM.2013.18","DOIUrl":"https://doi.org/10.1109/FCCM.2013.18","url":null,"abstract":"Due to stagnant clock speeds and high power consumption of commodity microprocessors, database vendors have started to explore massively parallel co-processors such as FPGAs to further increase performance. A typical approach is to push simple but compute-intensive operations (e.g., prefiltering, (de)compression) to FPGAs for acceleration. In this paper, we show how a significantly more complex operation- the computation of the skyline-can be holistically implemented on an FPGA. A skyline query computes the pareto optimal set of multi-dimensional data points. These queries have been studied in software extensively over the last decade but this paper is the first to examine skyline computation in hardware. We propose a methodology that interleaves data storage and computation, allowing multiple operations to be executed on the same working set in parallel, while accounting for all data dependencies. Our experiments show that we achieve very promising results compared to CPU-based solutions.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"222 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124398941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Reconfigurable Acceleration of Short Read Mapping 短读映射的可重构加速
James Arram, K. H. Tsoi, W. Luk, P. Jiang
Recent improvements in the throughput of nextgeneration DNA sequencing machines poses a great computational challenge in analysing the massive quantities of data produced. This paper proposes a novel approach, based on reconfigurable computing technology, for accelerating short read mapping, where the positions of millions of short reads are located relative to a known reference sequence. Our approach consists of two key components: an exact string matcher for the bulk of the alignment process, and an approximate string matcher for the remaining cases. We characterise interesting regions of the design space, including homogeneous, heterogeneous and run-time reconfigurable designs and provide back of envelope estimations of the corresponding performance. We show that a particular implementation of this architecture targeting a single FPGA can be up to 293 times faster than BWA on an Intel X5650 CPU, and 134 times faster than SOAP3 on an NVIDIA GTX 580 GPU.
下一代DNA测序机的吞吐量最近有所提高,这对分析产生的大量数据提出了巨大的计算挑战。本文提出了一种基于可重构计算技术的新方法,用于加速短读映射,其中数百万个短读的位置相对于已知参考序列进行定位。我们的方法由两个关键组件组成:用于大部分对齐过程的精确字符串匹配器,以及用于其余情况的近似字符串匹配器。我们描述了设计空间中有趣的区域,包括同构的、异构的和运行时可重构的设计,并提供了相应性能的粗略估计。我们表明,针对单个FPGA的这种架构的特定实现可以比Intel X5650 CPU上的BWA快293倍,比NVIDIA GTX 580 GPU上的SOAP3快134倍。
{"title":"Reconfigurable Acceleration of Short Read Mapping","authors":"James Arram, K. H. Tsoi, W. Luk, P. Jiang","doi":"10.1109/FCCM.2013.57","DOIUrl":"https://doi.org/10.1109/FCCM.2013.57","url":null,"abstract":"Recent improvements in the throughput of nextgeneration DNA sequencing machines poses a great computational challenge in analysing the massive quantities of data produced. This paper proposes a novel approach, based on reconfigurable computing technology, for accelerating short read mapping, where the positions of millions of short reads are located relative to a known reference sequence. Our approach consists of two key components: an exact string matcher for the bulk of the alignment process, and an approximate string matcher for the remaining cases. We characterise interesting regions of the design space, including homogeneous, heterogeneous and run-time reconfigurable designs and provide back of envelope estimations of the corresponding performance. We show that a particular implementation of this architecture targeting a single FPGA can be up to 293 times faster than BWA on an Intel X5650 CPU, and 134 times faster than SOAP3 on an NVIDIA GTX 580 GPU.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131028315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
An FPGA Based PCI-E Root Complex Architecture for Standalone SOPCs 基于FPGA的独立单片机PCI-E根复合体结构
Yingjie Cao, Yongxin Zhu, Xu Wang, Jiang Jiang, Meikang Qiu
We present an FPGA (field programmable gate array) based PCI-E (PCI-Express) root complex architecture for SOPCs (System-on-a-Programmable-Chip) in this paper. In our work, the system on the FPGA serves as a PCIE master device rather than a PCIE endpoint, which is usually a common practice as a co-processing device driven by a desktop computer or a server. We use this system to control a PCIE endpoint, which is also an FPGA based endpoint implemented on another FPGA board. This architecture requires only IP cores free of charge. We also provide basic software driver so that specific device driver can be developed on it to control popular PCIE device in the future, i.e. ethernet card or graphic card. The whole architecture has been implemented on Xilinx Virtex-6 FPGAs to indicate that this architecture is a feasible approach to standalone SOPCs, which has better efficiencies than those with additional generic controlling processors.
本文提出了一种基于FPGA(现场可编程门阵列)的PCI-E (PCI-Express)根复合体结构,用于SOPCs(单可编程芯片系统)。在我们的工作中,FPGA上的系统充当PCIE主设备而不是PCIE端点,这通常是由台式计算机或服务器驱动的协同处理设备的常见做法。我们使用该系统来控制PCIE端点,该端点也是在另一块FPGA板上实现的基于FPGA的端点。这种架构只需要免费的IP核。我们还提供了基本的软件驱动程序,以便在其上开发特定的设备驱动程序,以控制未来流行的PCIE设备,即以太网卡或图形卡。整个体系结构已在Xilinx Virtex-6 fpga上实现,表明该体系结构是独立sopc的可行方法,其效率优于带有额外通用控制处理器的sopc。
{"title":"An FPGA Based PCI-E Root Complex Architecture for Standalone SOPCs","authors":"Yingjie Cao, Yongxin Zhu, Xu Wang, Jiang Jiang, Meikang Qiu","doi":"10.1109/FCCM.2013.29","DOIUrl":"https://doi.org/10.1109/FCCM.2013.29","url":null,"abstract":"We present an FPGA (field programmable gate array) based PCI-E (PCI-Express) root complex architecture for SOPCs (System-on-a-Programmable-Chip) in this paper. In our work, the system on the FPGA serves as a PCIE master device rather than a PCIE endpoint, which is usually a common practice as a co-processing device driven by a desktop computer or a server. We use this system to control a PCIE endpoint, which is also an FPGA based endpoint implemented on another FPGA board. This architecture requires only IP cores free of charge. We also provide basic software driver so that specific device driver can be developed on it to control popular PCIE device in the future, i.e. ethernet card or graphic card. The whole architecture has been implemented on Xilinx Virtex-6 FPGAs to indicate that this architecture is a feasible approach to standalone SOPCs, which has better efficiencies than those with additional generic controlling processors.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automating Elimination of Idle Functions by Run-Time Reconfiguration 通过运行时重新配置自动消除空闲函数
Xinyu Niu, T. Chau, Qiwei Jin, W. Luk, Qiang Liu, O. Pell
A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Reconfiguration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.31 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.
在优化资源利用率的同时,提出了一种自动识别和利用运行时重构机会的设计方法。我们介绍了可重构数据流图,这是一种分层图结构,使可重构设计能够分三个步骤进行综合:功能分析、配置组织和运行时解决方案生成。基于障碍期权定价、粒子滤波和逆时迁移的三种应用被用于评估所提出的方法。运行时解决方案通过消除空闲函数来接近理论性能,并且比优化的静态设计快1.31到2.19倍。采用该方法开发的FPGA设计比优化的CPU参考设计快28.8倍,比优化的GPU设计快1.55倍。
{"title":"Automating Elimination of Idle Functions by Run-Time Reconfiguration","authors":"Xinyu Niu, T. Chau, Qiwei Jin, W. Luk, Qiang Liu, O. Pell","doi":"10.1145/2700415","DOIUrl":"https://doi.org/10.1145/2700415","url":null,"abstract":"A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Reconfiguration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.31 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131437428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Image Segmentation Using Hardware Forest Classifiers
Richard Neil Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram
Image segmentation is the process of partitioning an image into segments or subsets of pixels for purposes of further analysis, such as separating the interesting objects in the foreground from the un-interesting objects in the background. In many image processing applications, the process requires a sequence of computational steps on a per pixel basis, thereby binding the performance to the size and resolution of the image. As applications require greater resolution and larger images the computational resources of this step can quickly exceed those of available CPUs, especially in the power and thermal constrained areas of consumer electronics and mobile. In this work, we use a hardware tree-based classifier to solve the image segmentation problem. The application is background removal (BGR) from depth-maps obtained from the Microsoft Kinect sensor. After the image is segmented, subsequent steps then classify the objects in the scene. The approach is flexible: to address different application domains we only need to change the trees used by the classifiers. We describe two distinct approaches and evaluate their performance using the commercial-grade testing environment used for the Microsoft Xbox gaming console.
图像分割是将图像划分为像素段或像素子集以进行进一步分析的过程,例如将前景中感兴趣的对象与背景中不感兴趣的对象分开。在许多图像处理应用中,该过程需要以每个像素为基础的一系列计算步骤,从而将性能与图像的大小和分辨率绑定在一起。由于应用程序需要更高的分辨率和更大的图像,这一步的计算资源可能很快超过可用的cpu,特别是在消费电子和移动设备的功率和热受限领域。在这项工作中,我们使用基于硬件树的分类器来解决图像分割问题。该应用程序是从微软Kinect传感器获得的深度图中去除背景(BGR)。在对图像进行分割之后,接下来的步骤就是对场景中的物体进行分类。这种方法很灵活:要处理不同的应用程序域,我们只需要更改分类器使用的树。我们描述了两种不同的方法,并使用用于Microsoft Xbox游戏机的商业级测试环境评估它们的性能。
{"title":"Image Segmentation Using Hardware Forest Classifiers","authors":"Richard Neil Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram","doi":"10.1109/FCCM.2013.20","DOIUrl":"https://doi.org/10.1109/FCCM.2013.20","url":null,"abstract":"Image segmentation is the process of partitioning an image into segments or subsets of pixels for purposes of further analysis, such as separating the interesting objects in the foreground from the un-interesting objects in the background. In many image processing applications, the process requires a sequence of computational steps on a per pixel basis, thereby binding the performance to the size and resolution of the image. As applications require greater resolution and larger images the computational resources of this step can quickly exceed those of available CPUs, especially in the power and thermal constrained areas of consumer electronics and mobile. In this work, we use a hardware tree-based classifier to solve the image segmentation problem. The application is background removal (BGR) from depth-maps obtained from the Microsoft Kinect sensor. After the image is segmented, subsequent steps then classify the objects in the scene. The approach is flexible: to address different application domains we only need to change the trees used by the classifiers. We describe two distinct approaches and evaluate their performance using the commercial-grade testing environment used for the Microsoft Xbox gaming console.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123974294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Open-Source Bitstream Generation 开源比特流生成
Ritesh Soni, Neil Steiner, M. French
This work presents an open-source bitstream generation tool for Torc. Bitstream generation has traditionally been the single part of the FPGA design flow that could not be openly reproduced, but our novel approach enables this without reverse-engineering or violating End-User License Agreement terms. We begin by creating a library of “micro-bitstreams” which constitute a collection of primitives at a granularity of our choosing. These primitives can then be combined to create larger designs, or portions thereof, with simple merging operations. Our effort is motivated by a desire to resume earlier work on embedded bitstream generation and autonomous hardware. This is not feasible with Xilinx bitgen because there is no reasonable way to run an x86 binary with complex library and data dependencies on most embedded systems. Initial support is limited to the Virtex5, but we intend to extend this to other Xilinx architectures. We are able to support nearly all routing resources in the device, as well as the most common logic resources.
这项工作提出了一个开源的Torc比特流生成工具。传统上,比特流生成是FPGA设计流程的单个部分,无法公开复制,但我们的新方法可以在不进行逆向工程或违反最终用户许可协议条款的情况下实现这一点。我们首先创建一个“微比特流”库,它以我们选择的粒度组成了一组原语。然后可以使用简单的合并操作组合这些原语以创建更大的设计或其中的部分。我们努力的动机是希望恢复早期在嵌入式比特流生成和自主硬件方面的工作。这在Xilinx bitgen中是不可行的,因为在大多数嵌入式系统上没有合理的方法来运行具有复杂库和数据依赖的x86二进制文件。最初的支持仅限于Virtex5,但我们打算将其扩展到其他Xilinx体系结构。我们能够支持设备中几乎所有的路由资源,以及最常见的逻辑资源。
{"title":"Open-Source Bitstream Generation","authors":"Ritesh Soni, Neil Steiner, M. French","doi":"10.1109/FCCM.2013.45","DOIUrl":"https://doi.org/10.1109/FCCM.2013.45","url":null,"abstract":"This work presents an open-source bitstream generation tool for Torc. Bitstream generation has traditionally been the single part of the FPGA design flow that could not be openly reproduced, but our novel approach enables this without reverse-engineering or violating End-User License Agreement terms. We begin by creating a library of “micro-bitstreams” which constitute a collection of primitives at a granularity of our choosing. These primitives can then be combined to create larger designs, or portions thereof, with simple merging operations. Our effort is motivated by a desire to resume earlier work on embedded bitstream generation and autonomous hardware. This is not feasible with Xilinx bitgen because there is no reasonable way to run an x86 binary with complex library and data dependencies on most embedded systems. Initial support is limited to the Virtex5, but we intend to extend this to other Xilinx architectures. We are able to support nearly all routing resources in the device, as well as the most common logic resources.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129189278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1