Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408658
Yu Liu, Wei Zhang
In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement Instruction Level Parallelism (ILP) by using the compiler, rather than complex on-die circuitry to control parallel instruction execution like the superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and a Scratchpad-aware Scheduling method to improve the performance by optimizing the Load-To-Use Distance.
{"title":"Exploiting SPM-aware Scheduling on EPIC architectures for high-performance real-time systems","authors":"Yu Liu, Wei Zhang","doi":"10.1109/HPEC.2012.6408658","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408658","url":null,"abstract":"In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement Instruction Level Parallelism (ILP) by using the compiler, rather than complex on-die circuitry to control parallel instruction execution like the superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and a Scratchpad-aware Scheduling method to improve the performance by optimizing the Load-To-Use Distance.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129171184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408663
M. Holzrichter
The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented. Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.
{"title":"An application of the constraint programming to the design and operation of synthetic aperture radars","authors":"M. Holzrichter","doi":"10.1109/HPEC.2012.6408663","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408663","url":null,"abstract":"The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented. Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"54 21","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120870676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408671
Andrew Prout, W. Arcand, David Bestor, C. Byun, Bill Bergeron, M. Hubbell, J. Kepner, P. Michaleas, J. Mullen, A. Reuther, Antonio Rosa
High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems. Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security practice. Systems that reach back to the user's terminal for access to the authenticator are possible, but only in fully interactive supercomputing where connectivity to the user's terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system [1,2,3] at MIT Lincoln Laboratory and its use to solve one such authentication problem.
{"title":"Scalable cryptographic authentication for high performance computing","authors":"Andrew Prout, W. Arcand, David Bestor, C. Byun, Bill Bergeron, M. Hubbell, J. Kepner, P. Michaleas, J. Mullen, A. Reuther, Antonio Rosa","doi":"10.1109/HPEC.2012.6408671","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408671","url":null,"abstract":"High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems. Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security practice. Systems that reach back to the user's terminal for access to the authenticator are possible, but only in fully interactive supercomputing where connectivity to the user's terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system [1,2,3] at MIT Lincoln Laboratory and its use to solve one such authentication problem.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408666
I. Straznicky
MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive these environments, and successfully provide the enormous processing capability offered by the latest generation of GPUs to harsh environment applications.
{"title":"Ruggedization of MXM graphics modules","authors":"I. Straznicky","doi":"10.1109/HPEC.2012.6408666","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408666","url":null,"abstract":"MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive these environments, and successfully provide the enormous processing capability offered by the latest generation of GPUs to harsh environment applications.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124475237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408668
A. Reuther, P. Michaleas, Andrew Prout, J. Kepner
The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest operating systems from each other. 1 In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.
{"title":"HPC-VMs: Virtual machines in high performance computing systems","authors":"A. Reuther, P. Michaleas, Andrew Prout, J. Kepner","doi":"10.1109/HPEC.2012.6408668","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408668","url":null,"abstract":"The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest operating systems from each other. 1 In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408661
E. Thompson, Timothy R. Anderson
The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.
{"title":"Use of CUDA for the Continuous Space Language Model","authors":"E. Thompson, Timothy R. Anderson","doi":"10.1109/HPEC.2012.6408661","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408661","url":null,"abstract":"The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123358292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408669
Edward Fernandez, W. Najjar, S. Lonardi, J. Villarreal
In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions of short strings (called reads) against a reference genome. At the time of writing, a representative run requires to match tens of millions of reads of length of about 100 symbols against a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper, we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory image between hardware and software. By comparing the performance of FHAST against Bowtie on the Convey HC-1 we observed up to ~70X improvement in total end-to-end execution time, reducing runs that take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against multiple CPUs in Bowtie.
{"title":"Multithreaded FPGA acceleration of DNA sequence mapping","authors":"Edward Fernandez, W. Najjar, S. Lonardi, J. Villarreal","doi":"10.1109/HPEC.2012.6408669","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408669","url":null,"abstract":"In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions of short strings (called reads) against a reference genome. At the time of writing, a representative run requires to match tens of millions of reads of length of about 100 symbols against a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper, we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory image between hardware and software. By comparing the performance of FHAST against Bowtie on the Convey HC-1 we observed up to ~70X improvement in total end-to-end execution time, reducing runs that take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against multiple CPUs in Bowtie.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123610296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408676
M. Baskaran, Benoît Meister, Nicolas Vasilache, R. Lethin
For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing with sparse data. In this paper, we address these challenges in applying tensor decompositions in real data analytic applications. We describe new sparse tensor storage formats that provide storage benefits and are flexible and efficient for performing tensor computations. Further, we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in tensor decomposition algorithms. Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our techniques on both synthetic small data sets and large-scale sparse real data sets.
{"title":"Efficient and scalable computations with sparse tensors","authors":"M. Baskaran, Benoît Meister, Nicolas Vasilache, R. Lethin","doi":"10.1109/HPEC.2012.6408676","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408676","url":null,"abstract":"For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing with sparse data. In this paper, we address these challenges in applying tensor decompositions in real data analytic applications. We describe new sparse tensor storage formats that provide storage benefits and are flexible and efficient for performing tensor computations. Further, we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in tensor decomposition algorithms. Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our techniques on both synthetic small data sets and large-scale sparse real data sets.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132518036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408675
Tao Cui, F. Franchetti
Solving a large number of load flow problems quickly is required for Monte Carlo analysis and various power system problems, including long term steady state simulation, system benchmarking, among others. Due to the computational burden, such applications are considered to be time-consuming, and infeasible for online or realtime application. In this work we developed a high performance framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and various code optimization techniques. We optimized data structures to better fit the memory hierarchy. We use the SPIRAL code generator to exploit inherent patterns of the load flow model through code specizlization. We use SIMD instructions and multithreading to parallelize our solver. Finally, we designed a Monte Carlo thread scheduling infrastructure to enable real time operation. The optimized solver is able to achieve more than 50% of peak performance on a Intel Core i7 CPU, which translates to solving millions of load flow problems within a second for IEEE 37 test feeder.
{"title":"Optimized parallel distribution load flow solver on commodity multi-core CPU","authors":"Tao Cui, F. Franchetti","doi":"10.1109/HPEC.2012.6408675","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408675","url":null,"abstract":"Solving a large number of load flow problems quickly is required for Monte Carlo analysis and various power system problems, including long term steady state simulation, system benchmarking, among others. Due to the computational burden, such applications are considered to be time-consuming, and infeasible for online or realtime application. In this work we developed a high performance framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and various code optimization techniques. We optimized data structures to better fit the memory hierarchy. We use the SPIRAL code generator to exploit inherent patterns of the load flow model through code specizlization. We use SIMD instructions and multithreading to parallelize our solver. Finally, we designed a Monte Carlo thread scheduling infrastructure to enable real time operation. The optimized solver is able to achieve more than 50% of peak performance on a Intel Core i7 CPU, which translates to solving millions of load flow problems within a second for IEEE 37 test feeder.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128322026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408674
Saoni Mukherjee, Nicholas Moore, J. Brock, M. Leeser
Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT) using two dimensional projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the weighted data to backproject the data and create the final three dimensional construction. This is implemented on two types of hardware: CPU and a heterogeneous system combining CPU and GPU. The CPU codes in C and MATLAB are compared with the heterogeneous versions written in CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as on mouse data.
{"title":"CUDA and OpenCL implementations of 3D CT reconstruction for biomedical imaging","authors":"Saoni Mukherjee, Nicholas Moore, J. Brock, M. Leeser","doi":"10.1109/HPEC.2012.6408674","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408674","url":null,"abstract":"Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT) using two dimensional projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the weighted data to backproject the data and create the final three dimensional construction. This is implemented on two types of hardware: CPU and a heterogeneous system combining CPU and GPU. The CPU codes in C and MATLAB are compared with the heterogeneous versions written in CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as on mouse data.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124052431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}