高性能计算技术最新文献

英文中文

Evaluating FPGAs for floating-point performance 评估fpga的浮点性能

高性能计算技术

Pub Date : 2008-11-01 DOI: 10.1109/HPRCTA.2008.4745680

D. Strenski, Jim Simkins, R. Walke, Ralph Wittig

Field programmable gate arrays (FPGAs) have been available for more than 25 years. Initially they were used to simplify embedded processing circuits and then expanded into simulating application specific integrated circuit (ASIC) designs. In the past few years they have grown in density and speed to replace ASICs in some applications and to assist microprocessors as attached accelerators. This paper will calculate the floating-point peak performance for three types of FPGAs using 64-bit, 32-bit, and 24-bit word lengths and compare this with a reference quad-core microprocessor. These calculations are further refined to estimate the actual performance of these FPGAs at floating-point calculations and compared with the microprocessor at its optimal design point and also away from this design point. Lastly, the paper explores the nature of floating-point calculations and looks at examples where the same algorithmic accuracy can be achieved with non-floating-point calculations.

现场可编程门阵列(fpga)已经有超过25年的历史。最初，它们用于简化嵌入式处理电路，然后扩展到模拟特定应用集成电路(ASIC)设计。在过去的几年里，它们的密度和速度都在增长，在某些应用中取代了asic，并作为附加加速器辅助微处理器。本文将计算使用64位、32位和24位字长的三种类型fpga的浮点峰值性能，并将其与参考四核微处理器进行比较。这些计算进一步细化，以估计这些fpga在浮点计算中的实际性能，并与微处理器在其最佳设计点和远离该设计点进行比较。最后，本文探讨了浮点计算的本质，并介绍了使用非浮点计算可以实现相同算法精度的示例。

引用次数: 10

Architecture of a vertically stacked reconfigurable computer 垂直堆叠可重构计算机的体系结构

高性能计算技术

Pub Date : 2008-11-01 DOI: 10.1109/HPRCTA.2008.4745678

D.S. Stevenson, R. O. Conn

This paper describes a scalable reconfigurable computing system using a Silicon circuit board (SiCB) design approach. The design involves attachment of unpackaged FPGA and memory die to wafer scale silicon substrates. The resulting reconfigurable computing system has substantial density, cost, and power consumption and performance improvements relative to other known integration or construction methods. A description of the SiCB technology and detailed advantages are provided. In addition, the reconfigurable computer architecture resulting from the use of SiCBs is described.

本文介绍了一种采用硅电路板(SiCB)设计方法的可扩展可重构计算系统。该设计涉及将未封装的FPGA和存储芯片连接到晶圆级硅衬底上。由此产生的可重构计算系统相对于其他已知的集成或构造方法具有显著的密度、成本和功耗以及性能改进。介绍了SiCB技术及其优点。此外，还描述了由于使用sicb而产生的可重构计算机体系结构。

引用次数: 0

Massively parallelized Quasi-Monte Carlo financial simulation on a FPGA supercomputer 在FPGA超级计算机上大规模并行准蒙特卡罗金融模拟

高性能计算技术

Pub Date : 2008-11-01 DOI: 10.1109/HPRCTA.2008.4745684

Xiang Tian, K. Benkrid

Quasi-Monte Carlo simulation is a specialized Monte Carlo method which uses quasi-random, or low-discrepancy, numbers as the stochastic parameters. In many applications, this method has proved advantageous compared to the traditional Monte Carlo simulation method, which uses pseudo-random numbers, as it converges relatively quickly, and with a better level of accuracy. We implemented a massively parallelized Quasi-Monte Carlo simulation engine on a FPGA-based supercomputer, called Maxwell, and developed at the University of Edinburgh. Maxwell consists of 32 IBM Intel Xeon blades each hosting two Virtex-4 FPGA nodes through PCI-X interface. Real hardware implementation of our FPGA-based quasi-Monte Carlo engine on the Maxwell machine outperforms equivalent software implementations running on the Xeon processors by 3 orders of magnitude, with the speed-up figure scaling linearly with the number of processing nodes. The paper presents the detailed design and implementation of our Quasi-Monte Carlo engine in the context of financial derivatives pricing.

准蒙特卡罗模拟是一种专门的蒙特卡罗方法，它使用准随机或低差异的数字作为随机参数。在许多应用中，与使用伪随机数的传统蒙特卡罗模拟方法相比，该方法已被证明具有优势，因为它收敛相对较快，并且具有更好的精度。我们在爱丁堡大学开发的基于fpga的超级计算机Maxwell上实现了一个大规模并行的准蒙特卡罗模拟引擎。Maxwell由32个IBM Intel至强刀片组成，每个刀片通过PCI-X接口承载两个Virtex-4 FPGA节点。我们基于fpga的准蒙特卡罗引擎在Maxwell机器上的实际硬件实现比在Xeon处理器上运行的等效软件实现高出3个数量级，加速图与处理节点的数量呈线性增长。本文介绍了我们的准蒙特卡罗引擎在金融衍生品定价方面的详细设计和实现。

引用次数: 4

Implementing phase unwrapping using Field Programmable Gate Arrays or Graphics Processing Units: A comparison 使用现场可编程门阵列或图形处理单元实现相位展开:比较

高性能计算技术

Pub Date : 2008-11-01 DOI: 10.1109/HPRCTA.2008.4745687

S. Braganza, M. Leeser

Phase unwrapping is the process of converting discontinuous phase data into a continuous image. This procedure is required by any imaging technology that uses phase data such as MRI, SAR or OQM microscopy. Such algorithms often take a significant amount of time to process on a general purpose computer, rendering it difficult to process large quantities of information. This paper compares implementations of a specific phase unwrapping algorithm known as Minimum LP norm unwrapping on a field programmable gate array (FPGA) and on a graphics processing unit (GPU) for the purpose of acceleration. The computation required involves a matrix preconditioner (based on a DCT transform) and a conjugate gradient calculation along with a few other matrix operations. These functions are partitioned to run on the host or the accelerator depending on the capabilities of the accelerator. The tradeoffs between the two platforms are analyzed and compared to a general purpose processor (GPP) in terms of performance, power and cost.

相位展开是将不连续的相位数据转换成连续图像的过程。任何使用相位数据的成像技术(如MRI、SAR或OQM显微镜)都需要此程序。在通用计算机上处理这样的算法通常需要花费大量的时间，使得处理大量信息变得困难。本文比较了一种称为最小LP范数展开的特定相位展开算法在现场可编程门阵列(FPGA)和图形处理单元(GPU)上的加速实现。所需的计算包括矩阵预条件(基于DCT变换)和共轭梯度计算以及其他一些矩阵操作。根据加速器的功能，将这些函数划分为在主机或加速器上运行。分析了两种平台之间的权衡，并将其与通用处理器(GPP)在性能、功耗和成本方面进行了比较。

引用次数: 1

Performance bounds of partial run-time reconfiguration in high-performance reconfigurable computing 高性能可重构计算中部分运行时重构的性能边界

高性能计算技术

Pub Date : 2007-11-11 DOI: 10.1145/1328554.1328561

E. El-Araby, I. González, T. El-Ghazawi

High-Performance Reconfigurable Computing (HPRC) systems have always been characterized by their high performance and flexibility. Flexibility has been traditionally exploited through the Run-Time Reconfiguration (RTR) provided by most of the available platforms. However, the RTR feature comes with the cost of high configuration overhead which might negatively impact the overall performance. Currently, modern FPGAs have more advanced mechanisms for reducing the configuration overheads, particularly Partial Run-Time Reconfiguration (PRTR). It has been perceived that PRTR on HPRC systems can be the trend for improving the performance. In this work, we will investigate the potential of PRTR on HPRC by formally analyzing the execution model and experimentally verifying our analytical findings by enabling PRTR for the first time, to the best of our knowledge, on one of the state-of-the-art HPRC systems, Cray XD1. Our approach is general and can be applied to any of the available HPRC systems. The paper will conclude with recommendations and conditions, based on our conceptual and experimental work, for the optimal utilization of PRTR as well as possible future usage in HPRC.

高性能可重构计算(HPRC)系统一直以高性能和灵活性为特点。传统上，灵活性是通过大多数可用平台提供的运行时重新配置(RTR)来实现的。然而，RTR特性带来了高配置开销的代价，这可能会对整体性能产生负面影响。目前，现代fpga有更先进的机制来减少配置开销，特别是部分运行时重新配置(PRTR)。人们已经认识到，在HPRC系统上进行PRTR可能是提高性能的趋势。在这项工作中，我们将通过正式分析执行模型来研究PRTR对HPRC的潜力，并通过实验验证我们的分析结果，据我们所知，首次在最先进的HPRC系统之一Cray XD1上启用PRTR。我们的方法是通用的，可以应用于任何可用的HPRC系统。本文将根据我们的概念和实验工作，对PRTR的最佳利用以及未来在HPRC中的可能使用提出建议和条件。

{"title":"Performance bounds of partial run-time reconfiguration in high-performance reconfigurable computing","authors":"E. El-Araby, I. González, T. El-Ghazawi","doi":"10.1145/1328554.1328561","DOIUrl":"https://doi.org/10.1145/1328554.1328561","url":null,"abstract":"High-Performance Reconfigurable Computing (HPRC) systems have always been characterized by their high performance and flexibility. Flexibility has been traditionally exploited through the Run-Time Reconfiguration (RTR) provided by most of the available platforms. However, the RTR feature comes with the cost of high configuration overhead which might negatively impact the overall performance. Currently, modern FPGAs have more advanced mechanisms for reducing the configuration overheads, particularly Partial Run-Time Reconfiguration (PRTR). It has been perceived that PRTR on HPRC systems can be the trend for improving the performance. In this work, we will investigate the potential of PRTR on HPRC by formally analyzing the execution model and experimentally verifying our analytical findings by enabling PRTR for the first time, to the best of our knowledge, on one of the state-of-the-art HPRC systems, Cray XD1. Our approach is general and can be applied to any of the available HPRC systems. The paper will conclude with recommendations and conditions, based on our conceptual and experimental work, for the optimal utilization of PRTR as well as possible future usage in HPRC.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"7 1","pages":"11-20"},"PeriodicalIF":0.0,"publicationDate":"2007-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75222153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

RAT: a methodology for predicting performance in application design migration to FPGAs 在应用程序设计迁移到fpga时预测性能的一种方法

高性能计算技术

Pub Date : 2007-11-11 DOI: 10.1145/1328554.1328560

B. Holland, K. Nagarajan, C. Conger, A. Jacobs, A. George

Before any application is migrated to a reconfigurable computer (RC), it is important to consider its amenability to the hardware paradigm. In order to maximize the probability of success for an application's migration to an FPGA, one must quickly and with a reasonable degree of accuracy analyze not only the performance of the system but also the required precision and necessary resources to support a particular design. This extra preparation is meant to reduce the risk of failure to achieve the application's design requirements (e.g. speed or area) by quantitatively predicting the expected performance and system utilization. This paper presents the RC Amenability Test (RAT), a methodology for rapidly analyzing an application's design compatibility to a specific FPGA platform.

在将任何应用程序迁移到可重构计算机(RC)之前，重要的是要考虑其对硬件范例的适应性。为了最大限度地提高应用程序迁移到FPGA的成功概率，必须以合理的精度快速分析系统的性能，还要分析支持特定设计所需的精度和必要的资源。这种额外的准备是为了通过定量预测预期的性能和系统利用率来减少实现应用程序设计需求(例如速度或面积)失败的风险。本文介绍了RC兼容性测试(RAT)，一种快速分析应用程序与特定FPGA平台的设计兼容性的方法。

引用次数: 40

Session details: Technology 会议详情:技术

高性能计算技术

Pub Date : 2007-11-11 DOI: 10.1145/3250823

T. El-Ghazawi

引用次数: 0

Language classification using n-grams accelerated by FPGA-based Bloom filters 基于fpga的Bloom滤波器加速的n-gram语言分类

高性能计算技术

Pub Date : 2007-11-11 DOI: 10.1145/1328554.1328564

A. Jacob, M. Gokhale

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.

N-Gram(文本文档中的n个字符序列)计数是一种成熟的技术，用于对文档中的文本语言进行分类。在本文中，通过在XtremeData XD1000系统上使用可重构硬件来加速n-gram处理。我们的设计在多个级别上采用并行性，并行布隆过滤器访问片上RAM，并行语言分类器和并行文档处理。与使用片外SRAM进行查找的另一种硬件实现(HAIL算法)相比，我们的高度可扩展实现仅使用片内内存块。我们的端到端语言分类实现运行在85倍的可比软件和1.45倍的竞争硬件设计上。

引用次数: 16

Simulating data processing for an advanced ion mobility mass spectrometer 先进离子迁移率质谱仪的模拟数据处理

高性能计算技术

Pub Date : 2007-11-11 DOI: 10.1145/1328554.1328563

D. Chavarría-Miranda, B. Clowers, G. Anderson, M. Belov

We have designed and implemented a Cray XD 1-based simulation of data capture and signal processing for an advanced Ion Mobility mass spectrometer (Hadamard transform Ion Mobility). Our simulation is a hybrid application that uses both an FPGA component and a CPU-based software component to simulate Ion Mobility mass spectrometry data processing. The FPGA component includes data capture and accumulation, as well as a more sophisticated deconvolution algorithm based on a PNNL-developed enhancement to standard Hadamard transform Ion Mobility spectrometry. The software portion is in charge of streaming data to the FPGA and collecting results. We expect the computational and memory addressing logic of the FPGA component to be portable to an instrument-attached FPGA board that can be interfaced with a Hadamard transform Ion Mobility mass spectrometer.

我们设计并实现了基于Cray xd1的模拟数据捕获和信号处理，用于先进的离子迁移质谱(Hadamard变换离子迁移)。我们的模拟是一个混合应用程序，使用FPGA组件和基于cpu的软件组件来模拟离子迁移质谱数据处理。FPGA组件包括数据捕获和积累，以及基于pnnl开发的对标准Hadamard变换离子迁移谱的增强的更复杂的反褶积算法。软件部分负责将数据流传输到FPGA并采集结果。我们期望FPGA组件的计算和内存寻址逻辑可移植到可与Hadamard变换离子迁移质谱计接口的仪器附加FPGA板上。

引用次数: 2

Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform Smith-Waterman算法在可重构超级计算平台上的实现

高性能计算技术

Pub Date : 2007-11-11 DOI: 10.1145/1328554.1328565

Peiheng Zhang, Guangming Tan, G. Gao

An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies -- a key to minimize the overall PE pipeline cycle time; (3) we also put forward a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor speedups 185X and 250X respectively.

XtremeData公司开发了一款创新的可重构超级计算平台——XD1000，以利用FPGA技术的快速发展和Hyper-Transport互连的高性能。在本文中，我们提出了史密斯-沃特曼算法在平台上的DNA和蛋白质序列的实现。主要特点包括:(1)我们提出了一种多级PE(处理元件)设计，大大减少了FPGA资源的使用，从而允许更多的并行性被利用;(2)我们的设计具有不均匀阶段延迟的流水线控制机制，这是最小化整体PE管道周期时间的关键;(3)我们还提出了一种压缩替代矩阵存储结构，从而大大降低了片上SRAM的使用率。最后，我们实现了一个运行在66.7MHz的384-PE收缩阵列，它可以达到25.6GCUPS的峰值性能。与2.2GHz AMD Opteron主机处理器相比，FPGA协处理器的速度分别提高了185X和250X。

引用次数: 123

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

高性能计算技术

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀