2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)最新文献

英文中文

Diagnosing Performance Bottlenecks in Massive Data Parallel Programs 海量数据并行程序的性能瓶颈诊断

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.81

Vinícius Dias, R. Moreira, Wagner Meira Jr, D. Guedes

The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.

存储的数据量的增加以及最近提出的利用这些数据的各种应用程序使新一代并行编程环境和范式成为可能。尽管这些新环境中的大多数都提供了抽象的编程接口，并嵌入了一些运行时策略，以简化并行和分布式系统中的一些典型任务，但实现良好的性能仍然是一个挑战。在本文中，我们确定了Spark编程环境中性能下降的一些常见来源，并讨论了一些可以用来更好地理解这种下降的诊断维度。然后，我们描述了我们在使用这些维度来驱动识别性能问题方面的经验，并建议如何考虑实际应用程序来最小化它们的影响。

引用次数: 7

Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language 集成并行列式DBMS和R语言的大数据分析

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.94

Yiqun Zhang, C. Ordonez, Wellington Cabrera

Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.

大多数研究都提出了在DBMS之外工作的可扩展和并行分析算法。另一方面，R已经成为一种非常流行的执行机器学习分析的系统，但它受到主内存和单线程处理的限制。最近，新的列式dbms显示出在SQL查询处理速度方面提供数量级的改进，同时保留了基于行的并行dbms的并行加速。考虑到这一动机，我们提出了COLUMNAR，这是一个集成了并行列式DBMS和R的系统，它可以直接在存储为关系表的大型数据集上计算模型。我们的算法基于SQL查询、用户定义函数(udf)和R调用的组合，其中SQL查询和udf计算数据集摘要，这些摘要发送给R以在RAM中计算模型。由于我们的混合算法利用DBMS进行涉及数据集的最苛刻的计算，因此它们显示出线性可伸缩性并且高度并行。我们的算法通常需要对数据集进行一次传递或几次传递(即比传统方法更少的传递)。我们的系统可以比R更快地分析数据集，即使它们适合RAM，并且当数据集超过RAM大小时，它还消除了R中的内存限制。另一方面，它比Spark(一个著名的Hadoop系统)和传统的基于行的DBMS要快一个数量级。

{"title":"Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language","authors":"Yiqun Zhang, C. Ordonez, Wellington Cabrera","doi":"10.1109/CCGrid.2016.94","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.94","url":null,"abstract":"Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125656708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Hybrid Simulation Model for Data Grids 数据网格的混合仿真模型

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.36

M. Barisits, E. Kühn, M. Lassnig

Data grids are used in large scale scientific experiments to access and store nontrivial amounts of data by combining the storage resources from multiple data centers in one system. This enables users and automated services to use the storage resources in a common and efficient way. However, as data grids grow it becomes a hard problem for developers and operators to estimate how modifications in policy, hardware, and software affect the performance metrics of the data grid. In this paper we address the modeling of operational data grids. We first analyze the data grid middleware system of the ATLAS experiment at the Large Hadron Collider to identify components relevant to the data grid performance. We describe existing modeling approaches for pre-transfer, network, storage, and validation components, and build black-box models for these components. Consequently, we present a novel hybrid model, which unifies these separate component models, and we evaluate the model using an event simulator. The evaluation is based on historic workloads extracted from the ATLAS data grid. The median evaluation error of the hybrid model is at 22%.

在大规模科学实验中，通过将多个数据中心的存储资源组合在一个系统中，数据网格被用于访问和存储大量数据。使用户和自动化业务能够更通用、更高效地使用存储资源。然而，随着数据网格的增长，开发人员和操作人员很难估计策略、硬件和软件的修改如何影响数据网格的性能指标。在本文中，我们讨论了操作数据网格的建模。首先对大型强子对撞机ATLAS实验的数据网格中间件系统进行分析，找出与数据网格性能相关的组件。我们描述了用于预传输、网络、存储和验证组件的现有建模方法，并为这些组件构建了黑盒模型。因此，我们提出了一种新的混合模型，将这些独立的组件模型统一起来，并使用事件模拟器对模型进行评估。评估基于从ATLAS数据网格中提取的历史工作负载。混合模型的估计误差中位数为22%。

引用次数: 4

Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era 空间支持向量回归在百亿亿次时代检测无声错误

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.33

Omer Subasi, S. Di, L. Bautista-Gomez, Prasanna Balaprakash, O. Unsal, Jesús Labarta, A. Cristal, F. Cappello

As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.

随着百亿亿次时代的临近，具有目标功率和能源预算目标的高性能计算(HPC)系统的容量不断增加，对可靠性提出了重大挑战。静默数据损坏(sdc)或静默错误是破坏HPC应用程序执行结果而不被检测到的主要来源之一。在这项工作中，我们探索了一种低内存开销的SDC检测器，通过利用对epsilon不敏感的支持向量机回归，来检测HPC应用中发生的SDC，这些SDC可以以影响错误界限为特征。主要贡献有三个方面。(1)我们的设计将空间特征(即快照中每个数据点的相邻数据值)纳入训练数据，这样就引入了很少的内存开销(小于1%)。(2)对不同参数下的检测能力和性能进行了深入研究，并对检测距离进行了精心优化。(3) 8个实际HPC应用的实验表明，我们的检测器可以实现高达99%的检测灵敏度(即召回率)，并且大多数情况下的假阳性率小于1%。对于本文研究的所有基准测试，我们的检测器产生的性能开销很低，平均为5%。与其他最先进的技术相比，考虑到检测能力和开销，我们的检测器表现出最佳的权衡。

{"title":"Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era","authors":"Omer Subasi, S. Di, L. Bautista-Gomez, Prasanna Balaprakash, O. Unsal, Jesús Labarta, A. Cristal, F. Cappello","doi":"10.1109/CCGrid.2016.33","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.33","url":null,"abstract":"As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114644263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Faster: A Low Overhead Framework for Massive Data Analysis 更快:用于大规模数据分析的低开销框架

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.90

Matheus Santos, Wagner Meira Jr, D. Guedes, Virgílio A. F. Almeida

With the recent accelerated increase in the amount of social data available in the Internet, several big data distributed processing frameworks have been proposed and implemented. Hadoop has been used widely to process all kinds of data, not only from social media. Spark is gaining popularity for offering a more flexible, object-functional, programming interface, and also by improving performance in many cases. However, not all data analysis algorithms perform well on Hadoop or Spark. For instance, graph algorithms tend to generate large amounts of messages between processing elements, which may result in poor performance even in Spark. We introduce Faster, a low latency distributed processing framework, designed to explore data locality to reduce processing costs in such algorithms. It offers an API similar to Spark, but with a slightly different execution model and new operators. Our results show that it can significantly outperform Spark on large graphs, being up to one orders of magnitude faster when running PageRank in a partial Google+ friendship graph with more than one billion edges.

随着近年来互联网上可用的社会数据量的加速增长，一些大数据分布式处理框架被提出并实现。Hadoop被广泛用于处理各种数据，而不仅仅是来自社交媒体的数据。Spark因为提供更灵活的、对象函数式的编程接口，以及在许多情况下提高性能而越来越受欢迎。然而，并不是所有的数据分析算法在Hadoop或Spark上都表现良好。例如，图算法倾向于在处理元素之间生成大量消息，这可能导致即使在Spark中性能也很差。我们介绍了Faster，一个低延迟的分布式处理框架，旨在探索数据局部性以降低此类算法的处理成本。它提供了一个类似于Spark的API，但执行模型和新的操作符略有不同。我们的结果表明，它在大型图上的表现明显优于Spark，当在超过10亿个边的部分Google+友谊图中运行PageRank时，速度要快一个数量级。

{"title":"Faster: A Low Overhead Framework for Massive Data Analysis","authors":"Matheus Santos, Wagner Meira Jr, D. Guedes, Virgílio A. F. Almeida","doi":"10.1109/CCGrid.2016.90","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.90","url":null,"abstract":"With the recent accelerated increase in the amount of social data available in the Internet, several big data distributed processing frameworks have been proposed and implemented. Hadoop has been used widely to process all kinds of data, not only from social media. Spark is gaining popularity for offering a more flexible, object-functional, programming interface, and also by improving performance in many cases. However, not all data analysis algorithms perform well on Hadoop or Spark. For instance, graph algorithms tend to generate large amounts of messages between processing elements, which may result in poor performance even in Spark. We introduce Faster, a low latency distributed processing framework, designed to explore data locality to reduce processing costs in such algorithms. It offers an API similar to Spark, but with a slightly different execution model and new operators. Our results show that it can significantly outperform Spark on large graphs, being up to one orders of magnitude faster when running PageRank in a partial Google+ friendship graph with more than one billion edges.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121114188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Management of Distributed Big Data for Social Networks 面向社交网络的分布式大数据管理

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.107

C. Leung, Hao Zhang

In the current era of Big Data, high volumes of a wide variety of valuable data can be easily collected and generated from a broad range of data sources of different veracities at a high velocity. Due to the well-known 5V's of these Big Data, many traditional data management approaches may not be suitable for handling the Big Data. Over the past few years, several applications and systems have developed to use cluster, cloud or grid computing to manage Big Data so as to support data science, Big Data analytics, as well as knowledge discovery and data mining. In this paper, we focus on distributed Big Data management. Specifically, we present our method for Big Data representation and management of distributed Big Data from social networks. We represent such big graph data in distributed settings so as to support big data mining of frequently occurring patterns from social networks.

在当前的大数据时代，可以很容易地从不同真实性的广泛数据源中以高速度收集和生成大量、种类繁多的有价值数据。由于这些大数据众所周知的5V，许多传统的数据管理方法可能不适合处理大数据。在过去的几年中，已经开发了一些应用程序和系统来使用集群，云或网格计算来管理大数据，以支持数据科学，大数据分析以及知识发现和数据挖掘。本文主要研究分布式大数据管理。具体来说，我们提出了大数据表示和管理来自社交网络的分布式大数据的方法。我们将这种大图数据在分布式环境中表示出来，从而支持对社交网络中频繁出现的模式进行大数据挖掘。

引用次数: 15

Towards Fast Overlapping Community Detection 快速重叠社团检测研究

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.98

I. El-Helw, Rutger F. H. Hofman, H. Bal

Accelerating sequential algorithms in order to achieve high performance is often a nontrivial task. However, there are certain properties that can exacerbate this process and make it particularly daunting. For example, building an efficient parallel solution for a data-intensive algorithm requires a deep analysis of the memory access patterns and data reuse potential. Attempting to scale out the computations on clusters of machines introduces further complications due to network speed limitations. In this context, the optimization landscape can be extremely complex owing to the large number of trade-off decisions. In this paper, we discuss our experience designing two parallel implementations of an existing data-intensive machine learning algorithm that detects overlapping communities in graphs. The first design uses a single GPU to accelerate the computations of small data sets. We employed a code generation strategy in order to test and identify the best performing combination of optimizations. The second design uses a cluster of machines to scale out the computations for larger problem sizes. We used a mixture of MPI, RDMA and pipelining in order to circumvent networking overhead. Both these efforts bring us closer to understanding the complex relationships hidden within networks of entities.

加速顺序算法以获得高性能通常是一项非常重要的任务。然而，某些属性会加剧这一过程，并使其特别令人生畏。例如，为数据密集型算法构建有效的并行解决方案需要对内存访问模式和数据重用潜力进行深入分析。由于网络速度限制，尝试在机器集群上扩展计算会带来进一步的复杂性。在这种情况下，由于大量的权衡决策，优化环境可能非常复杂。在本文中，我们讨论了我们设计现有数据密集型机器学习算法的两个并行实现的经验，该算法可以检测图中的重叠社区。第一个设计使用单个GPU来加速小数据集的计算。我们采用了代码生成策略来测试和确定最佳性能的优化组合。第二种设计使用一组机器来扩展计算以解决更大的问题。为了避免网络开销，我们混合使用了MPI、RDMA和流水线。这两种努力都使我们更接近于理解隐藏在实体网络中的复杂关系。

{"title":"Towards Fast Overlapping Community Detection","authors":"I. El-Helw, Rutger F. H. Hofman, H. Bal","doi":"10.1109/CCGrid.2016.98","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.98","url":null,"abstract":"Accelerating sequential algorithms in order to achieve high performance is often a nontrivial task. However, there are certain properties that can exacerbate this process and make it particularly daunting. For example, building an efficient parallel solution for a data-intensive algorithm requires a deep analysis of the memory access patterns and data reuse potential. Attempting to scale out the computations on clusters of machines introduces further complications due to network speed limitations. In this context, the optimization landscape can be extremely complex owing to the large number of trade-off decisions. In this paper, we discuss our experience designing two parallel implementations of an existing data-intensive machine learning algorithm that detects overlapping communities in graphs. The first design uses a single GPU to accelerate the computations of small data sets. We employed a code generation strategy in order to test and identify the best performing combination of optimizations. The second design uses a cluster of machines to scale out the computations for larger problem sizes. We used a mixture of MPI, RDMA and pipelining in order to circumvent networking overhead. Both these efforts bring us closer to understanding the complex relationships hidden within networks of entities.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125927097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Sensor Data Air Pollution Prediction by Kernel Models 基于核模型的传感器数据大气污染预测

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.80

P. Vidnerová, Roman Neruda

Kernel-based neural networks are popular machine learning approach with many successful applications. Regularization networks represent a their special subclass with solid theoretical background and a variety of learning possibilities. In this paper, we focus on single and multi-kernel units, in particular, we describe the architecture of a product unit network, and describe an evolutionary learning algorithm for setting its parameters including different kernels from a dictionary, and optimal split of inputs into individual products. The approach is tested on real-world data from calibration of air-pollution sensor networks, and the performance is compared to several different regression tools.

基于核的神经网络是一种流行的机器学习方法，有许多成功的应用。正则化网络是其特殊的子类，具有坚实的理论背景和多种学习可能性。本文重点研究了单核和多核单元，特别描述了产品单元网络的结构，并描述了一种进化学习算法，用于设置其参数，包括来自字典的不同核，以及将输入最佳分割为单个产品。该方法在来自空气污染传感器网络校准的实际数据上进行了测试，并将其性能与几种不同的回归工具进行了比较。

引用次数: 13

Online Power Estimation of Graphics Processing Units 图形处理单元在线功率估计

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.93

Vignesh Adhinarayanan, Balaji Subramaniam, Wu-chun Feng

Accurate power estimation at runtime is essential for the efficient functioning of a power management system. While years of research have yielded accurate power models for the online prediction of instantaneous power for CPUs, such power models for graphics processing units (GPUs) are lacking. GPUs rely on low-resolution power meters that only nominally support basic power management. To address this, we propose an instantaneous power model, and in turn, a power estimator, that uses performance counters in a novel way so as to deliver accurate power estimation at runtime. Our power estimator runs on two real NVIDIA GPUs to show that accurate runtime estimation is possible without the need for the high-fidelity details that are assumed on simulation-based power models. To construct our power model, we first use correlation analysis to identify a concise set of performance counters that work well despite GPU device limitations. Next, we explore several statistical regression techniques and identify the best one. Then, to improve the prediction accuracy, we propose a novel application-dependent modeling technique, where the model is constructed online at runtime, based on the readings from a low-resolution, built-in GPU power meter. Our quantitative results show that a multi-linear model, which produces a mean absolute error of 6%, works the best in practice. An application-specific quadratic model reduces the error to nearly 1%. We show that this model can be constructed with low overhead and high accuracy at runtime. To the best of our knowledge, this is the first work attempting to model the instantaneous power of a real GPU system, earlier related work focused on average power.

准确的运行时功率估计对于电源管理系统的有效运行至关重要。虽然多年的研究已经为在线预测cpu的瞬时功耗提供了准确的功耗模型，但图形处理单元(gpu)的功耗模型仍然缺乏。gpu依赖于低分辨率的功耗表，它只在名义上支持基本的电源管理。为了解决这个问题，我们提出了一个瞬时功率模型，以及一个功率估计器，它以一种新颖的方式使用性能计数器，以便在运行时提供准确的功率估计。我们的功率估计器在两个真正的NVIDIA gpu上运行，以显示准确的运行时估计是可能的，而不需要在基于仿真的功率模型上假设的高保真细节。为了构建我们的功率模型，我们首先使用相关分析来确定一组简洁的性能计数器，这些计数器在GPU设备限制下仍能很好地工作。接下来，我们探讨几种统计回归技术，并确定最佳的一种。然后，为了提高预测精度，我们提出了一种新的基于应用的建模技术，该技术基于低分辨率内置GPU功率计的读数在运行时在线构建模型。我们的定量结果表明，在实际应用中，平均绝对误差为6%的多线性模型效果最好。特定于应用程序的二次型模型将误差降低到近1%。结果表明，该模型可以在运行时以低开销和高精度构建。据我们所知，这是第一个试图模拟真实GPU系统的瞬时功率的工作，早期的相关工作集中在平均功率上。

{"title":"Online Power Estimation of Graphics Processing Units","authors":"Vignesh Adhinarayanan, Balaji Subramaniam, Wu-chun Feng","doi":"10.1109/CCGrid.2016.93","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.93","url":null,"abstract":"Accurate power estimation at runtime is essential for the efficient functioning of a power management system. While years of research have yielded accurate power models for the online prediction of instantaneous power for CPUs, such power models for graphics processing units (GPUs) are lacking. GPUs rely on low-resolution power meters that only nominally support basic power management. To address this, we propose an instantaneous power model, and in turn, a power estimator, that uses performance counters in a novel way so as to deliver accurate power estimation at runtime. Our power estimator runs on two real NVIDIA GPUs to show that accurate runtime estimation is possible without the need for the high-fidelity details that are assumed on simulation-based power models. To construct our power model, we first use correlation analysis to identify a concise set of performance counters that work well despite GPU device limitations. Next, we explore several statistical regression techniques and identify the best one. Then, to improve the prediction accuracy, we propose a novel application-dependent modeling technique, where the model is constructed online at runtime, based on the readings from a low-resolution, built-in GPU power meter. Our quantitative results show that a multi-linear model, which produces a mean absolute error of 6%, works the best in practice. An application-specific quadratic model reduces the error to nearly 1%. We show that this model can be constructed with low overhead and high accuracy at runtime. To the best of our knowledge, this is the first work attempting to model the instantaneous power of a real GPU system, earlier related work focused on average power.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134110803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs gpu上计算机断层扫描图像的细粒度代数重建技术

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.96

Xiaodong Yu, Hao Wang, Wu-chun Feng, H. Gong, Guohua Cao

Algebraic reconstruction technique (ART) is an iterative algorithm for computed tomography (CT) image reconstruction. Due to the high computational cost, researchers turn to modern HPC systems with GPUs to accelerate the ART algorithm. However, the existing proposals suffer from inefficient designs of compressed data structure and computational kernel on GPUs. In this paper, we identify the computational patterns in the ART as the product of a sparse matrix (and its transpose) with multiple vectors (SpMV and SpMV_T). Because the implementations with well-tuned libraries, including cuSPARSE, BRC, and CSR5, underperform the expectations, we propose cuART, a complete compression and parallelization solution for the ART-based CT on GPUs. Based on the physical characteristics, i.e., the symmetries in the system matrix, we propose the symmetry-based CSR format (SCSR), which can further compress data storage by removing symmetric but redundant non-zero elements. Leveraging the sparsity patterns of X-ray projection, wetransform the CSR format to multiple dense sub-matrices in SCSR. We then design a transposition-free kernel to optimize the data access for both SpMV and SpMV_T. The experimental results illustrate that our mechanism can reduce memory usage significantly and make practical datasets fit into a single GPU. Our results also illustrate the superior performance of cuART compared to the existing methods on CPU and GPU.

代数重建技术(ART)是一种用于计算机断层扫描(CT)图像重建的迭代算法。由于计算成本高，研究人员转向带有gpu的现代高性能计算系统来加速ART算法。然而，现有的方案存在压缩数据结构设计和gpu计算内核设计效率低下的问题。在本文中，我们将ART中的计算模式识别为稀疏矩阵(及其转置)与多个向量(SpMV和SpMV_T)的乘积。由于包括cuSPARSE、BRC和CSR5在内的优化库的实现低于预期，因此我们提出了基于gpu的基于art的CT的完整压缩和并行化解决方案cuART。基于系统矩阵的对称性这一物理特性，提出了基于对称性的CSR格式(SCSR)，该格式通过去除对称但冗余的非零元素进一步压缩数据存储。利用x射线投影的稀疏模式，我们将CSR格式转换为SCSR中的多个密集子矩阵。然后，我们设计了一个无转置的内核来优化SpMV和SpMV_T的数据访问。实验结果表明，该机制可以显著降低内存使用，并使实际数据集适合单个GPU。我们的结果也表明，与现有的CPU和GPU上的方法相比，cuART的性能更优越。

{"title":"cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs","authors":"Xiaodong Yu, Hao Wang, Wu-chun Feng, H. Gong, Guohua Cao","doi":"10.1109/CCGrid.2016.96","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.96","url":null,"abstract":"Algebraic reconstruction technique (ART) is an iterative algorithm for computed tomography (CT) image reconstruction. Due to the high computational cost, researchers turn to modern HPC systems with GPUs to accelerate the ART algorithm. However, the existing proposals suffer from inefficient designs of compressed data structure and computational kernel on GPUs. In this paper, we identify the computational patterns in the ART as the product of a sparse matrix (and its transpose) with multiple vectors (SpMV and SpMV_T). Because the implementations with well-tuned libraries, including cuSPARSE, BRC, and CSR5, underperform the expectations, we propose cuART, a complete compression and parallelization solution for the ART-based CT on GPUs. Based on the physical characteristics, i.e., the symmetries in the system matrix, we propose the symmetry-based CSR format (SCSR), which can further compress data storage by removing symmetric but redundant non-zero elements. Leveraging the sparsity patterns of X-ray projection, wetransform the CSR format to multiple dense sub-matrices in SCSR. We then design a transposition-free kernel to optimize the data access for both SpMV and SpMV_T. The experimental results illustrate that our mechanism can reduce memory usage significantly and make practical datasets fit into a single GPU. Our results also illustrate the superior performance of cuART compared to the existing methods on CPU and GPU.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116109076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀