Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745680
D. Strenski, Jim Simkins, R. Walke, Ralph Wittig
Field programmable gate arrays (FPGAs) have been available for more than 25 years. Initially they were used to simplify embedded processing circuits and then expanded into simulating application specific integrated circuit (ASIC) designs. In the past few years they have grown in density and speed to replace ASICs in some applications and to assist microprocessors as attached accelerators. This paper will calculate the floating-point peak performance for three types of FPGAs using 64-bit, 32-bit, and 24-bit word lengths and compare this with a reference quad-core microprocessor. These calculations are further refined to estimate the actual performance of these FPGAs at floating-point calculations and compared with the microprocessor at its optimal design point and also away from this design point. Lastly, the paper explores the nature of floating-point calculations and looks at examples where the same algorithmic accuracy can be achieved with non-floating-point calculations.
{"title":"Evaluating FPGAs for floating-point performance","authors":"D. Strenski, Jim Simkins, R. Walke, Ralph Wittig","doi":"10.1109/HPRCTA.2008.4745680","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745680","url":null,"abstract":"Field programmable gate arrays (FPGAs) have been available for more than 25 years. Initially they were used to simplify embedded processing circuits and then expanded into simulating application specific integrated circuit (ASIC) designs. In the past few years they have grown in density and speed to replace ASICs in some applications and to assist microprocessors as attached accelerators. This paper will calculate the floating-point peak performance for three types of FPGAs using 64-bit, 32-bit, and 24-bit word lengths and compare this with a reference quad-core microprocessor. These calculations are further refined to estimate the actual performance of these FPGAs at floating-point calculations and compared with the microprocessor at its optimal design point and also away from this design point. Lastly, the paper explores the nature of floating-point calculations and looks at examples where the same algorithmic accuracy can be achieved with non-floating-point calculations.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"17 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74473846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745678
D.S. Stevenson, R. O. Conn
This paper describes a scalable reconfigurable computing system using a Silicon circuit board (SiCB) design approach. The design involves attachment of unpackaged FPGA and memory die to wafer scale silicon substrates. The resulting reconfigurable computing system has substantial density, cost, and power consumption and performance improvements relative to other known integration or construction methods. A description of the SiCB technology and detailed advantages are provided. In addition, the reconfigurable computer architecture resulting from the use of SiCBs is described.
{"title":"Architecture of a vertically stacked reconfigurable computer","authors":"D.S. Stevenson, R. O. Conn","doi":"10.1109/HPRCTA.2008.4745678","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745678","url":null,"abstract":"This paper describes a scalable reconfigurable computing system using a Silicon circuit board (SiCB) design approach. The design involves attachment of unpackaged FPGA and memory die to wafer scale silicon substrates. The resulting reconfigurable computing system has substantial density, cost, and power consumption and performance improvements relative to other known integration or construction methods. A description of the SiCB technology and detailed advantages are provided. In addition, the reconfigurable computer architecture resulting from the use of SiCBs is described.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"43 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76782827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745684
Xiang Tian, K. Benkrid
Quasi-Monte Carlo simulation is a specialized Monte Carlo method which uses quasi-random, or low-discrepancy, numbers as the stochastic parameters. In many applications, this method has proved advantageous compared to the traditional Monte Carlo simulation method, which uses pseudo-random numbers, as it converges relatively quickly, and with a better level of accuracy. We implemented a massively parallelized Quasi-Monte Carlo simulation engine on a FPGA-based supercomputer, called Maxwell, and developed at the University of Edinburgh. Maxwell consists of 32 IBM Intel Xeon blades each hosting two Virtex-4 FPGA nodes through PCI-X interface. Real hardware implementation of our FPGA-based quasi-Monte Carlo engine on the Maxwell machine outperforms equivalent software implementations running on the Xeon processors by 3 orders of magnitude, with the speed-up figure scaling linearly with the number of processing nodes. The paper presents the detailed design and implementation of our Quasi-Monte Carlo engine in the context of financial derivatives pricing.
{"title":"Massively parallelized Quasi-Monte Carlo financial simulation on a FPGA supercomputer","authors":"Xiang Tian, K. Benkrid","doi":"10.1109/HPRCTA.2008.4745684","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745684","url":null,"abstract":"Quasi-Monte Carlo simulation is a specialized Monte Carlo method which uses quasi-random, or low-discrepancy, numbers as the stochastic parameters. In many applications, this method has proved advantageous compared to the traditional Monte Carlo simulation method, which uses pseudo-random numbers, as it converges relatively quickly, and with a better level of accuracy. We implemented a massively parallelized Quasi-Monte Carlo simulation engine on a FPGA-based supercomputer, called Maxwell, and developed at the University of Edinburgh. Maxwell consists of 32 IBM Intel Xeon blades each hosting two Virtex-4 FPGA nodes through PCI-X interface. Real hardware implementation of our FPGA-based quasi-Monte Carlo engine on the Maxwell machine outperforms equivalent software implementations running on the Xeon processors by 3 orders of magnitude, with the speed-up figure scaling linearly with the number of processing nodes. The paper presents the detailed design and implementation of our Quasi-Monte Carlo engine in the context of financial derivatives pricing.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"45 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82566743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745687
S. Braganza, M. Leeser
Phase unwrapping is the process of converting discontinuous phase data into a continuous image. This procedure is required by any imaging technology that uses phase data such as MRI, SAR or OQM microscopy. Such algorithms often take a significant amount of time to process on a general purpose computer, rendering it difficult to process large quantities of information. This paper compares implementations of a specific phase unwrapping algorithm known as Minimum LP norm unwrapping on a field programmable gate array (FPGA) and on a graphics processing unit (GPU) for the purpose of acceleration. The computation required involves a matrix preconditioner (based on a DCT transform) and a conjugate gradient calculation along with a few other matrix operations. These functions are partitioned to run on the host or the accelerator depending on the capabilities of the accelerator. The tradeoffs between the two platforms are analyzed and compared to a general purpose processor (GPP) in terms of performance, power and cost.
{"title":"Implementing phase unwrapping using Field Programmable Gate Arrays or Graphics Processing Units: A comparison","authors":"S. Braganza, M. Leeser","doi":"10.1109/HPRCTA.2008.4745687","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745687","url":null,"abstract":"Phase unwrapping is the process of converting discontinuous phase data into a continuous image. This procedure is required by any imaging technology that uses phase data such as MRI, SAR or OQM microscopy. Such algorithms often take a significant amount of time to process on a general purpose computer, rendering it difficult to process large quantities of information. This paper compares implementations of a specific phase unwrapping algorithm known as Minimum LP norm unwrapping on a field programmable gate array (FPGA) and on a graphics processing unit (GPU) for the purpose of acceleration. The computation required involves a matrix preconditioner (based on a DCT transform) and a conjugate gradient calculation along with a few other matrix operations. These functions are partitioned to run on the host or the accelerator depending on the capabilities of the accelerator. The tradeoffs between the two platforms are analyzed and compared to a general purpose processor (GPP) in terms of performance, power and cost.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"3 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80231259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-Performance Reconfigurable Computing (HPRC) systems have always been characterized by their high performance and flexibility. Flexibility has been traditionally exploited through the Run-Time Reconfiguration (RTR) provided by most of the available platforms. However, the RTR feature comes with the cost of high configuration overhead which might negatively impact the overall performance. Currently, modern FPGAs have more advanced mechanisms for reducing the configuration overheads, particularly Partial Run-Time Reconfiguration (PRTR). It has been perceived that PRTR on HPRC systems can be the trend for improving the performance. In this work, we will investigate the potential of PRTR on HPRC by formally analyzing the execution model and experimentally verifying our analytical findings by enabling PRTR for the first time, to the best of our knowledge, on one of the state-of-the-art HPRC systems, Cray XD1. Our approach is general and can be applied to any of the available HPRC systems. The paper will conclude with recommendations and conditions, based on our conceptual and experimental work, for the optimal utilization of PRTR as well as possible future usage in HPRC.
{"title":"Performance bounds of partial run-time reconfiguration in high-performance reconfigurable computing","authors":"E. El-Araby, I. González, T. El-Ghazawi","doi":"10.1145/1328554.1328561","DOIUrl":"https://doi.org/10.1145/1328554.1328561","url":null,"abstract":"High-Performance Reconfigurable Computing (HPRC) systems have always been characterized by their high performance and flexibility. Flexibility has been traditionally exploited through the Run-Time Reconfiguration (RTR) provided by most of the available platforms. However, the RTR feature comes with the cost of high configuration overhead which might negatively impact the overall performance. Currently, modern FPGAs have more advanced mechanisms for reducing the configuration overheads, particularly Partial Run-Time Reconfiguration (PRTR). It has been perceived that PRTR on HPRC systems can be the trend for improving the performance. In this work, we will investigate the potential of PRTR on HPRC by formally analyzing the execution model and experimentally verifying our analytical findings by enabling PRTR for the first time, to the best of our knowledge, on one of the state-of-the-art HPRC systems, Cray XD1. Our approach is general and can be applied to any of the available HPRC systems. The paper will conclude with recommendations and conditions, based on our conceptual and experimental work, for the optimal utilization of PRTR as well as possible future usage in HPRC.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"7 1","pages":"11-20"},"PeriodicalIF":0.0,"publicationDate":"2007-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75222153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Holland, K. Nagarajan, C. Conger, A. Jacobs, A. George
Before any application is migrated to a reconfigurable computer (RC), it is important to consider its amenability to the hardware paradigm. In order to maximize the probability of success for an application's migration to an FPGA, one must quickly and with a reasonable degree of accuracy analyze not only the performance of the system but also the required precision and necessary resources to support a particular design. This extra preparation is meant to reduce the risk of failure to achieve the application's design requirements (e.g. speed or area) by quantitatively predicting the expected performance and system utilization. This paper presents the RC Amenability Test (RAT), a methodology for rapidly analyzing an application's design compatibility to a specific FPGA platform.
{"title":"RAT: a methodology for predicting performance in application design migration to FPGAs","authors":"B. Holland, K. Nagarajan, C. Conger, A. Jacobs, A. George","doi":"10.1145/1328554.1328560","DOIUrl":"https://doi.org/10.1145/1328554.1328560","url":null,"abstract":"Before any application is migrated to a reconfigurable computer (RC), it is important to consider its amenability to the hardware paradigm. In order to maximize the probability of success for an application's migration to an FPGA, one must quickly and with a reasonable degree of accuracy analyze not only the performance of the system but also the required precision and necessary resources to support a particular design. This extra preparation is meant to reduce the risk of failure to achieve the application's design requirements (e.g. speed or area) by quantitatively predicting the expected performance and system utilization. This paper presents the RC Amenability Test (RAT), a methodology for rapidly analyzing an application's design compatibility to a specific FPGA platform.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"2 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2007-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75602516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
{"title":"Language classification using n-grams accelerated by FPGA-based Bloom filters","authors":"A. Jacob, M. Gokhale","doi":"10.1145/1328554.1328564","DOIUrl":"https://doi.org/10.1145/1328554.1328564","url":null,"abstract":"N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"26 1","pages":"31-37"},"PeriodicalIF":0.0,"publicationDate":"2007-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80607248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Chavarría-Miranda, B. Clowers, G. Anderson, M. Belov
We have designed and implemented a Cray XD 1-based simulation of data capture and signal processing for an advanced Ion Mobility mass spectrometer (Hadamard transform Ion Mobility). Our simulation is a hybrid application that uses both an FPGA component and a CPU-based software component to simulate Ion Mobility mass spectrometry data processing. The FPGA component includes data capture and accumulation, as well as a more sophisticated deconvolution algorithm based on a PNNL-developed enhancement to standard Hadamard transform Ion Mobility spectrometry. The software portion is in charge of streaming data to the FPGA and collecting results. We expect the computational and memory addressing logic of the FPGA component to be portable to an instrument-attached FPGA board that can be interfaced with a Hadamard transform Ion Mobility mass spectrometer.
{"title":"Simulating data processing for an advanced ion mobility mass spectrometer","authors":"D. Chavarría-Miranda, B. Clowers, G. Anderson, M. Belov","doi":"10.1145/1328554.1328563","DOIUrl":"https://doi.org/10.1145/1328554.1328563","url":null,"abstract":"We have designed and implemented a Cray XD 1-based simulation of data capture and signal processing for an advanced Ion Mobility mass spectrometer (Hadamard transform Ion Mobility). Our simulation is a hybrid application that uses both an FPGA component and a CPU-based software component to simulate Ion Mobility mass spectrometry data processing. The FPGA component includes data capture and accumulation, as well as a more sophisticated deconvolution algorithm based on a PNNL-developed enhancement to standard Hadamard transform Ion Mobility spectrometry. The software portion is in charge of streaming data to the FPGA and collecting results. We expect the computational and memory addressing logic of the FPGA component to be portable to an instrument-attached FPGA board that can be interfaced with a Hadamard transform Ion Mobility mass spectrometer.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"21 1","pages":"21-29"},"PeriodicalIF":0.0,"publicationDate":"2007-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91155961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies -- a key to minimize the overall PE pipeline cycle time; (3) we also put forward a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor speedups 185X and 250X respectively.
{"title":"Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform","authors":"Peiheng Zhang, Guangming Tan, G. Gao","doi":"10.1145/1328554.1328565","DOIUrl":"https://doi.org/10.1145/1328554.1328565","url":null,"abstract":"An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies -- a key to minimize the overall PE pipeline cycle time; (3) we also put forward a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor speedups 185X and 250X respectively.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"20 1","pages":"39-48"},"PeriodicalIF":0.0,"publicationDate":"2007-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87521399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}