2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)最新文献

英文中文

Federated Campus Cloud Colombian Initiative 联合校园云哥伦比亚倡议

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.48

César O. Díaz, Carlos E. Gómez, Harold E. Castro, Carlos J. Barrios, H. Bolívar

Desktop cloud paradigm arises from combining cloud computing with volunteer computing systems in order to harvest the idle computational resources of volunteers' computers Students usually underuse university computer rooms. As a result, a desktop cloud can be seen as a form of high performance computing (HPC) at a low cost. When the capacity of a desktop cloud is insufficient to execute a HPC project, a new opportunity for collaborative work among universities appears, resulting in a federation of desktop cloud systems to create a significant amount of virtual resources from multiple providers on non-dedicated infrastructure. Even though cloud federation generates research activity today, neither interoperability among several implementations of cloud computing nor the federation of desktop clouds are resolved issues. Therefore, our initiative is related to gathering the existing and idle computer resources provided by the universities that take part to form a cloud federation on non-dedicated infrastructure.

桌面云范式是将云计算与志愿者计算机系统相结合，以获取志愿者计算机的闲置计算资源而产生的。因此，桌面云可以被视为一种低成本的高性能计算(HPC)形式。当桌面云的容量不足以执行HPC项目时，大学之间协作工作的新机会出现了，导致桌面云系统的联合，从而在非专用基础设施上从多个提供商创建大量虚拟资源。尽管云联合在今天产生了研究活动，但云计算的几个实现之间的互操作性和桌面云的联合都没有得到解决。因此，我们的倡议是收集参与的大学提供的现有和空闲计算机资源，在非专用基础设施上形成云联盟。

引用次数: 1

Service Level Agreement Assurance between Cloud Services Providers and Cloud Customers 云服务提供商和云客户之间的服务水平协议保证

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.56

A. A. Ibrahim, D. Kliazovich, P. Bouvry

Cloud services providers deliver cloud services to cloud customers on pay-per-use model while the quality of the provided services are defined using service level agreements also known as SLAs. Unfortunately, there is no standard mechanism which exists to verify and assure that delivered services satisfy the signed SLA agreement in an automatic way. There is no guarantee in terms of quality. Those applications have many performance metrics. In this doctoral thesis, we propose a framework for SLA assurance, which can be used by both cloud providers and cloud users. Inside the proposed framework, we will define the performance metrics for the different applications. We will assess the applications performance in different testing environment to assure good services quality as mentioned in SLA. The proposed framework will be evaluated through simulations and using testbed experiments. After testing the applications performance by measuring the performance metrics, we will review the time correlations between those metrics.

云服务提供商以按使用付费的模式向云客户交付云服务，而所提供服务的质量是使用服务级别协议(也称为sla)来定义的。不幸的是，目前还没有标准的机制来验证和确保交付的服务以自动的方式满足签署的SLA协议。在质量方面没有保证。这些应用程序有许多性能指标。在这篇博士论文中，我们提出了一个SLA保证框架，它可以被云提供商和云用户使用。在建议的框架中，我们将为不同的应用程序定义性能指标。我们将在不同的测试环境中评估应用程序的性能，以确保SLA中提到的良好服务质量。提出的框架将通过模拟和使用试验台实验进行评估。在通过度量性能指标测试应用程序性能之后，我们将回顾这些指标之间的时间相关性。

引用次数: 19

Fostering Collaboration in Energy Research and Technological Developments Applying New Exascale HPC Techniques 促进能源研究和技术发展的合作，应用新的百亿亿级高性能计算技术

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.51

J. Cela, P. Navaux, A. Coutinho, R. Mayo-García

During the last years, High Performance Computing (HPC) resources have undergone a dramatic transformation, with an explosion on the available parallelism and the use of special purpose processors. There are international initiatives focusing on redesigning hardware and software in order to achieve the Exaflop capability. With this aim, the HPC4E project is applying the new exascale HPC techniques to energy industry simulations, customizing them if necessary, and going beyond the state-of-the-art in the required HPC exascale simulations for different energy sources that are the present and the future of energy: wind energy production and design, efficient combustion systems for biomass-derived fuels (biogas), and exploration geophysics for hydrocarbon reservoirs. HPC4E joins efforts of several institutions settled in Brazil and Europe.

在过去的几年中，高性能计算(HPC)资源经历了巨大的转变，可用的并行性和特殊用途处理器的使用激增。有一些国际倡议侧重于重新设计硬件和软件，以实现Exaflop的能力。为此，HPC4E项目正在将新的百亿亿HPC技术应用于能源行业模拟，并在必要时对其进行定制，并在当前和未来能源的不同能源(风能生产和设计，生物质衍生燃料(沼气)的高效燃烧系统，以及碳氢化合物储层的勘探地球物理)的HPC百亿亿级模拟中超越最先进的水平。HPC4E加入了在巴西和欧洲定居的几个机构的努力。

引用次数: 3

Leveraging High Performance Computing for Bioinformatics: A Methodology that Enables a Reliable Decision-Making 利用生物信息学的高性能计算:一种实现可靠决策的方法

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.69

Mariza Ferro, M. Nicolás, Quadalupe Del Rosario Q. Saji, A. Mury, B. Schulze

Bioinformatics could greatly benefit from increased computational resources delivered by High Performance Computing. However, the decision-making about which is the best architecture to deliver good performance for a set of Bioinformatics applications is a hard task. The traditional way is finding the architecture with a high theoretical peak of performance, obtained with benchmark tests. But, this is not an assured way for this decision, because each application of Bioinformatics has different computational requirements, which frequently are much different from usual benchmarks. We developed a methodology that assists researchers, even when their specialty is not high performance computing, to define the best computational infrastructure focused on their set of scientific application requirements. For this purpose, the methodology enables to define representative evaluation tests, including a model to define the correct benchmark, when the tests endorsed by the methodology could not be fully used. Further, a Gain Function allows a reliable decision-making based on the performances of a set of applications and architectures. It is also possible to consider the relative importance between applications and also between cost and performance.

生物信息学可以极大地受益于高性能计算提供的增加的计算资源。然而，为一组生物信息学应用程序提供良好性能的最佳体系结构决策是一项艰巨的任务。传统的方法是寻找具有较高理论性能峰值的架构，通过基准测试获得。但是，这并不是一种确定的方法，因为生物信息学的每个应用都有不同的计算需求，这通常与通常的基准有很大不同。我们开发了一种方法，可以帮助研究人员(即使他们的专业不是高性能计算)定义专注于他们的科学应用需求集的最佳计算基础设施。为此目的，该方法能够确定有代表性的评价测试，包括在该方法认可的测试不能充分使用时确定正确基准的模型。此外，增益函数允许基于一组应用程序和体系结构的性能进行可靠的决策。还可以考虑应用程序之间以及成本和性能之间的相对重要性。

{"title":"Leveraging High Performance Computing for Bioinformatics: A Methodology that Enables a Reliable Decision-Making","authors":"Mariza Ferro, M. Nicolás, Quadalupe Del Rosario Q. Saji, A. Mury, B. Schulze","doi":"10.1109/CCGrid.2016.69","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.69","url":null,"abstract":"Bioinformatics could greatly benefit from increased computational resources delivered by High Performance Computing. However, the decision-making about which is the best architecture to deliver good performance for a set of Bioinformatics applications is a hard task. The traditional way is finding the architecture with a high theoretical peak of performance, obtained with benchmark tests. But, this is not an assured way for this decision, because each application of Bioinformatics has different computational requirements, which frequently are much different from usual benchmarks. We developed a methodology that assists researchers, even when their specialty is not high performance computing, to define the best computational infrastructure focused on their set of scientific application requirements. For this purpose, the methodology enables to define representative evaluation tests, including a model to define the correct benchmark, when the tests endorsed by the methodology could not be fully used. Further, a Gain Function allows a reliable decision-making based on the performances of a set of applications and architectures. It is also possible to consider the relative importance between applications and also between cost and performance.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132453115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Design and Experimental Evaluation of Distributed Heterogeneous Graph-Processing Systems 分布式异构图形处理系统的设计与实验评价

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.53

Yong Guo, A. Varbanescu, D. Epema, A. Iosup

Graph processing is increasingly used in a variety of domains, from engineering to logistics and from scientific computing to online gaming. To process graphs efficiently, GPU-enabled graph-processing systems such as TOTEM and Medusa exploit the GPU or the combined CPU+GPU capabilities of a single machine. Unlike scalable distributed CPU-based systems such as Pregel and GraphX, existing GPU-enabled systems are restricted to the resources of a single machine, including the limited amount of GPU memory, and thus cannot analyze the increasingly large-scale graphs we see in practice. To address this problem, we design and implement three families of distributed heterogeneous graph-processing systems that can use both the CPUs and GPUs of multiple machines. We further focus on graph partitioning, for which we compare existing graph-partitioning policies and a new policy specifically targeted at heterogeneity. We implement all our distributed heterogeneous systems based on the programming model of the single-machine TOTEM, to which we add (1) a new communication layer for CPUs and GPUs across multiple machines to support distributed graphs, and (2) a workload partitioning method that uses offline profiling to distribute the work on the CPUs and the GPUs. We conduct a comprehensive real-world performance evaluation for all three families. To ensure representative results, we select 3 typical algorithms and 5 datasets with different characteristics. Our results include algorithm run time, performance breakdown, scalability, graph partitioning time, and comparison with other graph-processing systems. They demonstrate the feasibility of distributed heterogeneous graph processing and show evidence of the high performance that can be achieved by combining CPUs and GPUs in a distributed environment.

图形处理越来越多地应用于各种领域，从工程到物流，从科学计算到在线游戏。为了有效地处理图形，支持GPU的图形处理系统(如TOTEM和Medusa)利用GPU或单个机器的CPU+GPU组合功能。与Pregel和GraphX等可扩展的分布式cpu系统不同，现有的支持GPU的系统受限于单个机器的资源，包括有限的GPU内存，因此无法分析我们在实践中看到的日益大规模的图形。为了解决这个问题，我们设计并实现了三种分布式异构图形处理系统，它们可以同时使用多台机器的cpu和gpu。我们进一步关注图分区，为此我们比较了现有的图分区策略和专门针对异质性的新策略。我们基于单机TOTEM的编程模型实现了我们所有的分布式异构系统，在此基础上我们增加了(1)跨多台机器的cpu和gpu的新通信层来支持分布式图形，以及(2)使用离线分析的工作负载分区方法来分配cpu和gpu上的工作。我们对这三个家庭进行了全面的实际表现评估。为了确保结果具有代表性，我们选择了3种典型算法和5个不同特征的数据集。我们的结果包括算法运行时间、性能分解、可伸缩性、图分区时间以及与其他图处理系统的比较。他们展示了分布式异构图形处理的可行性，并展示了在分布式环境中结合cpu和gpu可以实现高性能的证据。

{"title":"Design and Experimental Evaluation of Distributed Heterogeneous Graph-Processing Systems","authors":"Yong Guo, A. Varbanescu, D. Epema, A. Iosup","doi":"10.1109/CCGrid.2016.53","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.53","url":null,"abstract":"Graph processing is increasingly used in a variety of domains, from engineering to logistics and from scientific computing to online gaming. To process graphs efficiently, GPU-enabled graph-processing systems such as TOTEM and Medusa exploit the GPU or the combined CPU+GPU capabilities of a single machine. Unlike scalable distributed CPU-based systems such as Pregel and GraphX, existing GPU-enabled systems are restricted to the resources of a single machine, including the limited amount of GPU memory, and thus cannot analyze the increasingly large-scale graphs we see in practice. To address this problem, we design and implement three families of distributed heterogeneous graph-processing systems that can use both the CPUs and GPUs of multiple machines. We further focus on graph partitioning, for which we compare existing graph-partitioning policies and a new policy specifically targeted at heterogeneity. We implement all our distributed heterogeneous systems based on the programming model of the single-machine TOTEM, to which we add (1) a new communication layer for CPUs and GPUs across multiple machines to support distributed graphs, and (2) a workload partitioning method that uses offline profiling to distribute the work on the CPUs and the GPUs. We conduct a comprehensive real-world performance evaluation for all three families. To ensure representative results, we select 3 typical algorithms and 5 datasets with different characteristics. Our results include algorithm run time, performance breakdown, scalability, graph partitioning time, and comparison with other graph-processing systems. They demonstrate the feasibility of distributed heterogeneous graph processing and show evidence of the high performance that can be achieved by combining CPUs and GPUs in a distributed environment.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132237314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

I-HASTREAM: Density-Based Hierarchical Clustering of Big Data Streams and Its Application to Big Graph Analytics Tools I-HASTREAM:基于密度的大数据流分层聚类及其在大图分析工具中的应用

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.102

Marwan Hassani, Pascal Spaus, A. Cuzzocrea, T. Seidl

Big Data Streams are very popular at now, as stirred-up by a plethora of modern applications such as sensor networks, scientific computing tools, Web intelligence, social network analysis and mining tools, and so forth. Here, the main research issue consists in how to effectively and efficiently extract useful knowledge from (streaming) big data, in order to support innovative big data analytics platforms. To this end, clustering analysis is a well-known tool for extracting knowledge from big data streams, as also confirmed by recent trends in active literature. A special applicative case is represented by so-called graph-shaped data (big) streams, which are produced by graph sources providing both structure-and content-oriented knowledge. On top of such sources, big graph analytics is a leading scientific area to be considered. At the convergence of these emerging topics, in this paper we provide the following contributions: (i) I-HASTREAM, a novel density-based hierarchical clustering algorithm for evolving big data streams that founds on it predecessor, namely HASTREAM, (ii) the architecture of a big graph analytics engine that embeds I-HASTREAM in its core layer.

随着传感器网络、科学计算工具、Web智能、社交网络分析和挖掘工具等大量现代应用的兴起，大数据流现在非常流行。在这里，主要的研究问题在于如何有效和高效地从(流)大数据中提取有用的知识，以支持创新的大数据分析平台。为此，聚类分析是一种众所周知的从大数据流中提取知识的工具，最近的活跃文献趋势也证实了这一点。一个特殊的应用案例是所谓的图形数据(大)流，它是由提供面向结构和面向内容知识的图源产生的。在这些资源之上，大图形分析是一个需要考虑的领先科学领域。在这些新兴主题的融合中，本文提供了以下贡献:(i) i -HASTREAM，一种新的基于密度的分层聚类算法，用于发展大数据流，该算法建立在其前身HASTREAM之上，即(ii)将i -HASTREAM嵌入其核心层的大图形分析引擎架构。

{"title":"I-HASTREAM: Density-Based Hierarchical Clustering of Big Data Streams and Its Application to Big Graph Analytics Tools","authors":"Marwan Hassani, Pascal Spaus, A. Cuzzocrea, T. Seidl","doi":"10.1109/CCGrid.2016.102","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.102","url":null,"abstract":"Big Data Streams are very popular at now, as stirred-up by a plethora of modern applications such as sensor networks, scientific computing tools, Web intelligence, social network analysis and mining tools, and so forth. Here, the main research issue consists in how to effectively and efficiently extract useful knowledge from (streaming) big data, in order to support innovative big data analytics platforms. To this end, clustering analysis is a well-known tool for extracting knowledge from big data streams, as also confirmed by recent trends in active literature. A special applicative case is represented by so-called graph-shaped data (big) streams, which are produced by graph sources providing both structure-and content-oriented knowledge. On top of such sources, big graph analytics is a leading scientific area to be considered. At the convergence of these emerging topics, in this paper we provide the following contributions: (i) I-HASTREAM, a novel density-based hierarchical clustering algorithm for evolving big data streams that founds on it predecessor, namely HASTREAM, (ii) the architecture of a big graph analytics engine that embeds I-HASTREAM in its core layer.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115009380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

SHMEMPMI -- Shared Memory Based PMI for Improved Performance and Scalability SHMEMPMI——基于共享内存的PMI，用于改进性能和可伸缩性

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.99

S. Chakraborty, H. Subramoni, Jonathan L. Perkins, D. Panda

Dense systems with large number of cores per node are becoming increasingly popular. Existing designs of the Process Management Interface (PMI) show poor scalability in terms of performance and memory consumption on such systems with large number of processes concurrently accessing the PMI interface. Our analysis shows the local socket-based communication scheme used by PMI to be a major bottleneck. While using a shared memory based channel can avoid this bottleneck and thus reduce memory consumption and improve performance, there are several challenges associated with such a design. We investigate several such alternatives and propose a novel design that is based on a hybrid socket+shared memory based communication protocol and uses multiple shared memory regions. This design can reduce the memory usage per node by a factor of Processes per Node. Our evaluations show that memory consumption per node can be reduced by an estimated 1GB with 1 million MPI processes and 16 processes per node. Additionally, performance of PMI Get is improved by 1,000 times compared to the existing design. The proposed design is backward compatible, secure, and imposes negligible overhead.

每个节点拥有大量核心的密集系统正变得越来越流行。进程管理接口(PMI)的现有设计在性能和内存消耗方面表现出较差的可伸缩性，因为在这样的系统中有大量进程并发地访问PMI接口。我们的分析表明，PMI使用的基于套接字的本地通信方案是一个主要瓶颈。虽然使用基于共享内存的通道可以避免这种瓶颈，从而减少内存消耗并提高性能，但这种设计存在一些挑战。我们研究了几种这样的替代方案，并提出了一种基于混合套接字+基于共享内存的通信协议的新设计，并使用多个共享内存区域。这种设计可以将每个节点的内存使用量降低到每个节点的进程数。我们的评估表明，使用100万个MPI进程和每个节点16个进程，每个节点的内存消耗可以减少约1GB。此外，与现有设计相比，PMI Get的性能提高了1000倍。所建议的设计是向后兼容的、安全的，并且可以忽略开销。

{"title":"SHMEMPMI -- Shared Memory Based PMI for Improved Performance and Scalability","authors":"S. Chakraborty, H. Subramoni, Jonathan L. Perkins, D. Panda","doi":"10.1109/CCGrid.2016.99","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.99","url":null,"abstract":"Dense systems with large number of cores per node are becoming increasingly popular. Existing designs of the Process Management Interface (PMI) show poor scalability in terms of performance and memory consumption on such systems with large number of processes concurrently accessing the PMI interface. Our analysis shows the local socket-based communication scheme used by PMI to be a major bottleneck. While using a shared memory based channel can avoid this bottleneck and thus reduce memory consumption and improve performance, there are several challenges associated with such a design. We investigate several such alternatives and propose a novel design that is based on a hybrid socket+shared memory based communication protocol and uses multiple shared memory regions. This design can reduce the memory usage per node by a factor of Processes per Node. Our evaluations show that memory consumption per node can be reduced by an estimated 1GB with 1 million MPI processes and 16 processes per node. Additionally, performance of PMI Get is improved by 1,000 times compared to the existing design. The proposed design is backward compatible, secure, and imposes negligible overhead.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116450108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Managing Big Data Analytics Workflows with a Database System 使用数据库系统管理大数据分析工作流

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.63

C. Ordonez, Javier García-García

A big data analytics workflow is long and complex, with many programs, tools and scripts interacting together. In general, in modern organizations there is a significant amount of big data analytics processing performed outside a database system, which creates many issues to manage and process big data analytics workflows. In general, data preprocessing is the most time-consuming task in a big data analytics workflow. In this work, we defend the idea of preprocessing, computing models and scoring data sets inside a database system. In addition, we discuss recommendations and experiences to improve big data analytics workflows by pushing data preprocessing (i.e. data cleaning, aggregation and column transformation) into a database system. We present a discussion of practical issues and common solutions when transforming and preparing data sets to improve big data analytics workflows. As a case study validation, based on experience from real-life big data analytics projects, we compare pros and cons between running big data analytics workflows inside and outside the database system. We highlight which tasks in a big data analytics workflow are easier to manage and faster when processed by the database system, compared to external processing.

大数据分析工作流程漫长而复杂，有许多程序、工具和脚本相互作用。一般来说，在现代组织中，有大量的大数据分析处理在数据库系统之外执行，这就产生了许多管理和处理大数据分析工作流的问题。一般来说，数据预处理是大数据分析工作流程中最耗时的任务。在这项工作中，我们捍卫了在数据库系统中对数据集进行预处理、计算模型和评分的思想。此外，我们还讨论了通过将数据预处理(即数据清理、聚合和列转换)推进到数据库系统中来改进大数据分析工作流程的建议和经验。我们讨论了在转换和准备数据集以改进大数据分析工作流程时的实际问题和常见解决方案。作为案例研究验证，基于实际大数据分析项目的经验，我们比较了在数据库系统内外运行大数据分析工作流的利弊。我们强调了与外部处理相比，数据库系统处理大数据分析工作流中的哪些任务更容易管理和更快。

{"title":"Managing Big Data Analytics Workflows with a Database System","authors":"C. Ordonez, Javier García-García","doi":"10.1109/CCGrid.2016.63","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.63","url":null,"abstract":"A big data analytics workflow is long and complex, with many programs, tools and scripts interacting together. In general, in modern organizations there is a significant amount of big data analytics processing performed outside a database system, which creates many issues to manage and process big data analytics workflows. In general, data preprocessing is the most time-consuming task in a big data analytics workflow. In this work, we defend the idea of preprocessing, computing models and scoring data sets inside a database system. In addition, we discuss recommendations and experiences to improve big data analytics workflows by pushing data preprocessing (i.e. data cleaning, aggregation and column transformation) into a database system. We present a discussion of practical issues and common solutions when transforming and preparing data sets to improve big data analytics workflows. As a case study validation, based on experience from real-life big data analytics projects, we compare pros and cons between running big data analytics workflows inside and outside the database system. We highlight which tasks in a big data analytics workflow are easier to manage and faster when processed by the database system, compared to external processing.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128218727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Multiobjective Workflow Scheduling in a Federation of Heterogeneous Green-Powered Data Centers 异构绿色数据中心联盟中的多目标工作流调度

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.34

S. Iturriaga, Sergio Nesmachnow, Andrei Tchernykh, B. Dorronsoro

The energy consumption of large data centers has been increasing for the last decades and currently is a major concern for economic and environmental reasons. Accurate scheduling of the data center operation and use of renewable energy sources present themselves as promising solutions for this problem. In this paper we study the problem of scheduling workflows of tasks in distributed heterogeneous data centers which are partially powered by renewable energy sources. This problem takes into account quality of service, infrastructure usage, and power consumption of machines and cooling devices. We propose a mathematical model for accurate scheduling solutions.

在过去的几十年里，大型数据中心的能源消耗一直在增加，目前是经济和环境原因的主要关注点。数据中心操作的精确调度和可再生能源的使用是解决这一问题的有希望的解决方案。本文研究了部分由可再生能源供电的分布式异构数据中心的任务工作流调度问题。这个问题考虑到服务质量、基础设施的使用、机器和冷却设备的功耗。我们提出了精确调度解的数学模型。

引用次数: 11

Machine Learning Approach for Cloud NoSQL Databases Performance Modeling 云NoSQL数据库性能建模的机器学习方法

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.83

V. A. E. Farias, F. R. C. Sousa, J. G. R. Maia, J. Gomes, Javam C. Machado

Cloud computing is a successful, emerging paradigm that supports on-demand services with pay-as-you-go model. With the exponential growth of data, NoSQL databases have been used to manage data in the cloud. In these newly emerging settings, mechanisms to guarantee Quality of Service heavily relies on performance predictability, i.e., the ability to estimate the impact of concurrent query execution on the performance of individual queries in a continuously evolving workload. This paper presents a performance modeling approach for NoSQL databases in terms of performance metrics which is capable of capturing the non-linear effects caused by concurrency and distribution aspects. Experimental results confirm that our performance modeling can accurately predict mean response time measurements under a wide range of workload configurations.

云计算是一种成功的新兴范例，它支持按需付费的即用型服务。随着数据的指数级增长，NoSQL数据库已被用于管理云中的数据。在这些新出现的设置中，保证服务质量的机制严重依赖于性能可预测性，即在不断变化的工作负载中估计并发查询执行对单个查询性能的影响的能力。本文从性能指标的角度提出了一种NoSQL数据库的性能建模方法，该方法能够捕捉并发和分布方面引起的非线性影响。实验结果证实，我们的性能建模可以准确地预测各种工作负载配置下的平均响应时间测量值。

引用次数: 10

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀