Vinícius Dias, R. Moreira, Wagner Meira Jr, D. Guedes
The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.
{"title":"Diagnosing Performance Bottlenecks in Massive Data Parallel Programs","authors":"Vinícius Dias, R. Moreira, Wagner Meira Jr, D. Guedes","doi":"10.1109/CCGrid.2016.81","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.81","url":null,"abstract":"The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130977403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.
{"title":"Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language","authors":"Yiqun Zhang, C. Ordonez, Wellington Cabrera","doi":"10.1109/CCGrid.2016.94","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.94","url":null,"abstract":"Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125656708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data grids are used in large scale scientific experiments to access and store nontrivial amounts of data by combining the storage resources from multiple data centers in one system. This enables users and automated services to use the storage resources in a common and efficient way. However, as data grids grow it becomes a hard problem for developers and operators to estimate how modifications in policy, hardware, and software affect the performance metrics of the data grid. In this paper we address the modeling of operational data grids. We first analyze the data grid middleware system of the ATLAS experiment at the Large Hadron Collider to identify components relevant to the data grid performance. We describe existing modeling approaches for pre-transfer, network, storage, and validation components, and build black-box models for these components. Consequently, we present a novel hybrid model, which unifies these separate component models, and we evaluate the model using an event simulator. The evaluation is based on historic workloads extracted from the ATLAS data grid. The median evaluation error of the hybrid model is at 22%.
{"title":"A Hybrid Simulation Model for Data Grids","authors":"M. Barisits, E. Kühn, M. Lassnig","doi":"10.1109/CCGrid.2016.36","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.36","url":null,"abstract":"Data grids are used in large scale scientific experiments to access and store nontrivial amounts of data by combining the storage resources from multiple data centers in one system. This enables users and automated services to use the storage resources in a common and efficient way. However, as data grids grow it becomes a hard problem for developers and operators to estimate how modifications in policy, hardware, and software affect the performance metrics of the data grid. In this paper we address the modeling of operational data grids. We first analyze the data grid middleware system of the ATLAS experiment at the Large Hadron Collider to identify components relevant to the data grid performance. We describe existing modeling approaches for pre-transfer, network, storage, and validation components, and build black-box models for these components. Consequently, we present a novel hybrid model, which unifies these separate component models, and we evaluate the model using an event simulator. The evaluation is based on historic workloads extracted from the ATLAS data grid. The median evaluation error of the hybrid model is at 22%.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123017274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Omer Subasi, S. Di, L. Bautista-Gomez, Prasanna Balaprakash, O. Unsal, Jesús Labarta, A. Cristal, F. Cappello
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.
{"title":"Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era","authors":"Omer Subasi, S. Di, L. Bautista-Gomez, Prasanna Balaprakash, O. Unsal, Jesús Labarta, A. Cristal, F. Cappello","doi":"10.1109/CCGrid.2016.33","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.33","url":null,"abstract":"As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114644263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matheus Santos, Wagner Meira Jr, D. Guedes, Virgílio A. F. Almeida
With the recent accelerated increase in the amount of social data available in the Internet, several big data distributed processing frameworks have been proposed and implemented. Hadoop has been used widely to process all kinds of data, not only from social media. Spark is gaining popularity for offering a more flexible, object-functional, programming interface, and also by improving performance in many cases. However, not all data analysis algorithms perform well on Hadoop or Spark. For instance, graph algorithms tend to generate large amounts of messages between processing elements, which may result in poor performance even in Spark. We introduce Faster, a low latency distributed processing framework, designed to explore data locality to reduce processing costs in such algorithms. It offers an API similar to Spark, but with a slightly different execution model and new operators. Our results show that it can significantly outperform Spark on large graphs, being up to one orders of magnitude faster when running PageRank in a partial Google+ friendship graph with more than one billion edges.
{"title":"Faster: A Low Overhead Framework for Massive Data Analysis","authors":"Matheus Santos, Wagner Meira Jr, D. Guedes, Virgílio A. F. Almeida","doi":"10.1109/CCGrid.2016.90","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.90","url":null,"abstract":"With the recent accelerated increase in the amount of social data available in the Internet, several big data distributed processing frameworks have been proposed and implemented. Hadoop has been used widely to process all kinds of data, not only from social media. Spark is gaining popularity for offering a more flexible, object-functional, programming interface, and also by improving performance in many cases. However, not all data analysis algorithms perform well on Hadoop or Spark. For instance, graph algorithms tend to generate large amounts of messages between processing elements, which may result in poor performance even in Spark. We introduce Faster, a low latency distributed processing framework, designed to explore data locality to reduce processing costs in such algorithms. It offers an API similar to Spark, but with a slightly different execution model and new operators. Our results show that it can significantly outperform Spark on large graphs, being up to one orders of magnitude faster when running PageRank in a partial Google+ friendship graph with more than one billion edges.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121114188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the current era of Big Data, high volumes of a wide variety of valuable data can be easily collected and generated from a broad range of data sources of different veracities at a high velocity. Due to the well-known 5V's of these Big Data, many traditional data management approaches may not be suitable for handling the Big Data. Over the past few years, several applications and systems have developed to use cluster, cloud or grid computing to manage Big Data so as to support data science, Big Data analytics, as well as knowledge discovery and data mining. In this paper, we focus on distributed Big Data management. Specifically, we present our method for Big Data representation and management of distributed Big Data from social networks. We represent such big graph data in distributed settings so as to support big data mining of frequently occurring patterns from social networks.
{"title":"Management of Distributed Big Data for Social Networks","authors":"C. Leung, Hao Zhang","doi":"10.1109/CCGrid.2016.107","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.107","url":null,"abstract":"In the current era of Big Data, high volumes of a wide variety of valuable data can be easily collected and generated from a broad range of data sources of different veracities at a high velocity. Due to the well-known 5V's of these Big Data, many traditional data management approaches may not be suitable for handling the Big Data. Over the past few years, several applications and systems have developed to use cluster, cloud or grid computing to manage Big Data so as to support data science, Big Data analytics, as well as knowledge discovery and data mining. In this paper, we focus on distributed Big Data management. Specifically, we present our method for Big Data representation and management of distributed Big Data from social networks. We represent such big graph data in distributed settings so as to support big data mining of frequently occurring patterns from social networks.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123474310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accelerating sequential algorithms in order to achieve high performance is often a nontrivial task. However, there are certain properties that can exacerbate this process and make it particularly daunting. For example, building an efficient parallel solution for a data-intensive algorithm requires a deep analysis of the memory access patterns and data reuse potential. Attempting to scale out the computations on clusters of machines introduces further complications due to network speed limitations. In this context, the optimization landscape can be extremely complex owing to the large number of trade-off decisions. In this paper, we discuss our experience designing two parallel implementations of an existing data-intensive machine learning algorithm that detects overlapping communities in graphs. The first design uses a single GPU to accelerate the computations of small data sets. We employed a code generation strategy in order to test and identify the best performing combination of optimizations. The second design uses a cluster of machines to scale out the computations for larger problem sizes. We used a mixture of MPI, RDMA and pipelining in order to circumvent networking overhead. Both these efforts bring us closer to understanding the complex relationships hidden within networks of entities.
{"title":"Towards Fast Overlapping Community Detection","authors":"I. El-Helw, Rutger F. H. Hofman, H. Bal","doi":"10.1109/CCGrid.2016.98","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.98","url":null,"abstract":"Accelerating sequential algorithms in order to achieve high performance is often a nontrivial task. However, there are certain properties that can exacerbate this process and make it particularly daunting. For example, building an efficient parallel solution for a data-intensive algorithm requires a deep analysis of the memory access patterns and data reuse potential. Attempting to scale out the computations on clusters of machines introduces further complications due to network speed limitations. In this context, the optimization landscape can be extremely complex owing to the large number of trade-off decisions. In this paper, we discuss our experience designing two parallel implementations of an existing data-intensive machine learning algorithm that detects overlapping communities in graphs. The first design uses a single GPU to accelerate the computations of small data sets. We employed a code generation strategy in order to test and identify the best performing combination of optimizations. The second design uses a cluster of machines to scale out the computations for larger problem sizes. We used a mixture of MPI, RDMA and pipelining in order to circumvent networking overhead. Both these efforts bring us closer to understanding the complex relationships hidden within networks of entities.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125927097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kernel-based neural networks are popular machine learning approach with many successful applications. Regularization networks represent a their special subclass with solid theoretical background and a variety of learning possibilities. In this paper, we focus on single and multi-kernel units, in particular, we describe the architecture of a product unit network, and describe an evolutionary learning algorithm for setting its parameters including different kernels from a dictionary, and optimal split of inputs into individual products. The approach is tested on real-world data from calibration of air-pollution sensor networks, and the performance is compared to several different regression tools.
{"title":"Sensor Data Air Pollution Prediction by Kernel Models","authors":"P. Vidnerová, Roman Neruda","doi":"10.1109/CCGrid.2016.80","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.80","url":null,"abstract":"Kernel-based neural networks are popular machine learning approach with many successful applications. Regularization networks represent a their special subclass with solid theoretical background and a variety of learning possibilities. In this paper, we focus on single and multi-kernel units, in particular, we describe the architecture of a product unit network, and describe an evolutionary learning algorithm for setting its parameters including different kernels from a dictionary, and optimal split of inputs into individual products. The approach is tested on real-world data from calibration of air-pollution sensor networks, and the performance is compared to several different regression tools.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128081459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate power estimation at runtime is essential for the efficient functioning of a power management system. While years of research have yielded accurate power models for the online prediction of instantaneous power for CPUs, such power models for graphics processing units (GPUs) are lacking. GPUs rely on low-resolution power meters that only nominally support basic power management. To address this, we propose an instantaneous power model, and in turn, a power estimator, that uses performance counters in a novel way so as to deliver accurate power estimation at runtime. Our power estimator runs on two real NVIDIA GPUs to show that accurate runtime estimation is possible without the need for the high-fidelity details that are assumed on simulation-based power models. To construct our power model, we first use correlation analysis to identify a concise set of performance counters that work well despite GPU device limitations. Next, we explore several statistical regression techniques and identify the best one. Then, to improve the prediction accuracy, we propose a novel application-dependent modeling technique, where the model is constructed online at runtime, based on the readings from a low-resolution, built-in GPU power meter. Our quantitative results show that a multi-linear model, which produces a mean absolute error of 6%, works the best in practice. An application-specific quadratic model reduces the error to nearly 1%. We show that this model can be constructed with low overhead and high accuracy at runtime. To the best of our knowledge, this is the first work attempting to model the instantaneous power of a real GPU system, earlier related work focused on average power.
{"title":"Online Power Estimation of Graphics Processing Units","authors":"Vignesh Adhinarayanan, Balaji Subramaniam, Wu-chun Feng","doi":"10.1109/CCGrid.2016.93","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.93","url":null,"abstract":"Accurate power estimation at runtime is essential for the efficient functioning of a power management system. While years of research have yielded accurate power models for the online prediction of instantaneous power for CPUs, such power models for graphics processing units (GPUs) are lacking. GPUs rely on low-resolution power meters that only nominally support basic power management. To address this, we propose an instantaneous power model, and in turn, a power estimator, that uses performance counters in a novel way so as to deliver accurate power estimation at runtime. Our power estimator runs on two real NVIDIA GPUs to show that accurate runtime estimation is possible without the need for the high-fidelity details that are assumed on simulation-based power models. To construct our power model, we first use correlation analysis to identify a concise set of performance counters that work well despite GPU device limitations. Next, we explore several statistical regression techniques and identify the best one. Then, to improve the prediction accuracy, we propose a novel application-dependent modeling technique, where the model is constructed online at runtime, based on the readings from a low-resolution, built-in GPU power meter. Our quantitative results show that a multi-linear model, which produces a mean absolute error of 6%, works the best in practice. An application-specific quadratic model reduces the error to nearly 1%. We show that this model can be constructed with low overhead and high accuracy at runtime. To the best of our knowledge, this is the first work attempting to model the instantaneous power of a real GPU system, earlier related work focused on average power.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134110803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaodong Yu, Hao Wang, Wu-chun Feng, H. Gong, Guohua Cao
Algebraic reconstruction technique (ART) is an iterative algorithm for computed tomography (CT) image reconstruction. Due to the high computational cost, researchers turn to modern HPC systems with GPUs to accelerate the ART algorithm. However, the existing proposals suffer from inefficient designs of compressed data structure and computational kernel on GPUs. In this paper, we identify the computational patterns in the ART as the product of a sparse matrix (and its transpose) with multiple vectors (SpMV and SpMV_T). Because the implementations with well-tuned libraries, including cuSPARSE, BRC, and CSR5, underperform the expectations, we propose cuART, a complete compression and parallelization solution for the ART-based CT on GPUs. Based on the physical characteristics, i.e., the symmetries in the system matrix, we propose the symmetry-based CSR format (SCSR), which can further compress data storage by removing symmetric but redundant non-zero elements. Leveraging the sparsity patterns of X-ray projection, wetransform the CSR format to multiple dense sub-matrices in SCSR. We then design a transposition-free kernel to optimize the data access for both SpMV and SpMV_T. The experimental results illustrate that our mechanism can reduce memory usage significantly and make practical datasets fit into a single GPU. Our results also illustrate the superior performance of cuART compared to the existing methods on CPU and GPU.
{"title":"cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs","authors":"Xiaodong Yu, Hao Wang, Wu-chun Feng, H. Gong, Guohua Cao","doi":"10.1109/CCGrid.2016.96","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.96","url":null,"abstract":"Algebraic reconstruction technique (ART) is an iterative algorithm for computed tomography (CT) image reconstruction. Due to the high computational cost, researchers turn to modern HPC systems with GPUs to accelerate the ART algorithm. However, the existing proposals suffer from inefficient designs of compressed data structure and computational kernel on GPUs. In this paper, we identify the computational patterns in the ART as the product of a sparse matrix (and its transpose) with multiple vectors (SpMV and SpMV_T). Because the implementations with well-tuned libraries, including cuSPARSE, BRC, and CSR5, underperform the expectations, we propose cuART, a complete compression and parallelization solution for the ART-based CT on GPUs. Based on the physical characteristics, i.e., the symmetries in the system matrix, we propose the symmetry-based CSR format (SCSR), which can further compress data storage by removing symmetric but redundant non-zero elements. Leveraging the sparsity patterns of X-ray projection, wetransform the CSR format to multiple dense sub-matrices in SCSR. We then design a transposition-free kernel to optimize the data access for both SpMV and SpMV_T. The experimental results illustrate that our mechanism can reduce memory usage significantly and make practical datasets fit into a single GPU. Our results also illustrate the superior performance of cuART compared to the existing methods on CPU and GPU.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116109076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}