首页 > 最新文献

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing最新文献

英文 中文
Coordinated Resource Management for Large Scale Interactive Data Query Systems 面向大规模交互式数据查询系统的协同资源管理
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.149
Wei Yan, Yuan Xue
Interactive ad hoc data query over massive datasets has recently gained significant traction. Massively parallel data query and analysis frameworks (e.g., Dremel, Impala) are built and deployed to support SQL-like queries over distributed and partitioned data in a clustering environment. As a result, the execution of each query is converted into a set of coordinated tasks including data retrieval, intermediate result computation and transfer, and result aggregation. To support high request rate of concurrent interactive queries, coordinated management of multiple resources (e.g., bandwidth, CPU, memory) of the cluster environment is critical. In this paper, we investigate this resource management problem using an utility-based optimization framework. Our goal is to optimize the resource utilization, and maintain fairness among different types of queries. We present a price-based algorithm which achieves this optimization objective. We implement our algorithm in the open source Impala system and conduct a set of experiments in a clustering environment using the TPC-DS workload. Experimental results show that our coordinated resource management solution can increase the aggregate utility by at least 15.4% compared with simple fair resource share mechanism, and 63.5% compared with the FIFO resource management mechanism.
对海量数据集的交互式临时数据查询最近获得了显著的关注。大规模并行数据查询和分析框架(例如,Dremel, Impala)被构建和部署,以支持在集群环境中对分布式和分区数据进行类似sql的查询。因此,每个查询的执行被转换为一组协调的任务,包括数据检索、中间结果计算和传输以及结果聚合。为了支持并发交互查询的高请求率,集群环境的多种资源(例如带宽、CPU、内存)的协调管理至关重要。在本文中,我们使用基于效用的优化框架来研究这个资源管理问题。我们的目标是优化资源利用率,并在不同类型的查询之间保持公平性。我们提出了一种基于价格的算法来实现这一优化目标。我们在开源的Impala系统中实现了我们的算法,并使用TPC-DS工作负载在集群环境中进行了一组实验。实验结果表明,与简单的公平资源共享机制相比,我们的协调资源管理方案可使总效用至少提高15.4%,与先进先出资源管理机制相比,可使总效用提高63.5%。
{"title":"Coordinated Resource Management for Large Scale Interactive Data Query Systems","authors":"Wei Yan, Yuan Xue","doi":"10.1109/CCGrid.2015.149","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.149","url":null,"abstract":"Interactive ad hoc data query over massive datasets has recently gained significant traction. Massively parallel data query and analysis frameworks (e.g., Dremel, Impala) are built and deployed to support SQL-like queries over distributed and partitioned data in a clustering environment. As a result, the execution of each query is converted into a set of coordinated tasks including data retrieval, intermediate result computation and transfer, and result aggregation. To support high request rate of concurrent interactive queries, coordinated management of multiple resources (e.g., bandwidth, CPU, memory) of the cluster environment is critical. In this paper, we investigate this resource management problem using an utility-based optimization framework. Our goal is to optimize the resource utilization, and maintain fairness among different types of queries. We present a price-based algorithm which achieves this optimization objective. We implement our algorithm in the open source Impala system and conduct a set of experiments in a clustering environment using the TPC-DS workload. Experimental results show that our coordinated resource management solution can increase the aggregate utility by at least 15.4% compared with simple fair resource share mechanism, and 63.5% compared with the FIFO resource management mechanism.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"39 7","pages":"677-686"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91438961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Service Clustering for Autonomic Clouds Using Random Forest 基于随机森林的自主云服务聚类
Rafael Brundo Uriarte, S. Tsaftaris, F. Tiezzi
Managing and optimising cloud services is one of the main challenges faced by industry and academia. A possible solution is resorting to self-management, as fostered by autonomic computing. However, the abstraction layer provided by cloud computing obfuscates several details of the provided services, which, in turn, hinders the effectiveness of autonomic managers. Data-driven approaches, particularly those relying on service clustering based on machine learning techniques, can assist the autonomic management and support decisions concerning, for example, the scheduling and deployment of services. One aspect that complicates this approach is that the information provided by the monitoring contains both continuous (e.g. CPU load) and categorical (e.g. VM instance type) data. Current approaches treat this problem in a heuristic fashion. This paper, instead, proposes an approach, which uses all kinds of data and learns in a data-driven fashion the similarities and resource usage patterns among the services. In particular, we use an unsupervised formulation of the Random Forest algorithm to calculate similarities and provide them as input to a clustering algorithm. For the sake of efficiency and meeting the dynamism requirement of autonomic clouds, our methodology consists of two steps: (i) off-line clustering and (ii) on-line prediction. Using datasets from real-world clouds, we demonstrate the superiority of our solution with respect to others and validate the accuracy of the on-line prediction. Moreover, to show the applicability of our approach, we devise a service scheduler that uses the notion of similarity among services and evaluate it in a cloud test-bed.
管理和优化云服务是工业界和学术界面临的主要挑战之一。一个可能的解决方案是诉诸于自主计算所促进的自我管理。然而,云计算提供的抽象层混淆了所提供服务的几个细节,这反过来又阻碍了自治管理器的有效性。数据驱动的方法,特别是那些依赖于基于机器学习技术的服务集群的方法,可以帮助自主管理和支持决策,例如,服务的调度和部署。使这种方法复杂化的一个方面是,监视提供的信息既包含连续数据(例如CPU负载),也包含分类数据(例如VM实例类型)。目前的方法以启发式的方式处理这个问题。相反,本文提出了一种方法,该方法使用各种数据,并以数据驱动的方式学习服务之间的相似性和资源使用模式。特别是,我们使用随机森林算法的无监督公式来计算相似度,并将其作为聚类算法的输入。为了提高效率和满足自主云的动态要求,我们的方法包括两个步骤:(i)离线聚类和(ii)在线预测。使用来自真实世界云的数据集,我们证明了我们的解决方案相对于其他解决方案的优越性,并验证了在线预测的准确性。此外,为了展示我们方法的适用性,我们设计了一个服务调度器,它使用服务之间的相似性概念,并在云测试平台中对其进行评估。
{"title":"Service Clustering for Autonomic Clouds Using Random Forest","authors":"Rafael Brundo Uriarte, S. Tsaftaris, F. Tiezzi","doi":"10.1109/CCGrid.2015.41","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.41","url":null,"abstract":"Managing and optimising cloud services is one of the main challenges faced by industry and academia. A possible solution is resorting to self-management, as fostered by autonomic computing. However, the abstraction layer provided by cloud computing obfuscates several details of the provided services, which, in turn, hinders the effectiveness of autonomic managers. Data-driven approaches, particularly those relying on service clustering based on machine learning techniques, can assist the autonomic management and support decisions concerning, for example, the scheduling and deployment of services. One aspect that complicates this approach is that the information provided by the monitoring contains both continuous (e.g. CPU load) and categorical (e.g. VM instance type) data. Current approaches treat this problem in a heuristic fashion. This paper, instead, proposes an approach, which uses all kinds of data and learns in a data-driven fashion the similarities and resource usage patterns among the services. In particular, we use an unsupervised formulation of the Random Forest algorithm to calculate similarities and provide them as input to a clustering algorithm. For the sake of efficiency and meeting the dynamism requirement of autonomic clouds, our methodology consists of two steps: (i) off-line clustering and (ii) on-line prediction. Using datasets from real-world clouds, we demonstrate the superiority of our solution with respect to others and validate the accuracy of the on-line prediction. Moreover, to show the applicability of our approach, we devise a service scheduler that uses the notion of similarity among services and evaluate it in a cloud test-bed.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"13 1","pages":"515-524"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86945214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS 大规模表征MPI和混合MPI+线程应用:BFS案例研究
A. Amer, Huiwei Lu, P. Balaji, S. Matsuoka
With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectively utilizing the processing units, but it fails to fully utilize the memory hierarchy and relies on fine-grained internodes communication. Hybrid MPI+threads models, on the other hand, can handle internodes parallelism more effectively and alleviate some of the overheads associated with internodes communication by allowing more coarse-grained data movement between address spaces. The hybrid model, however, can suffer from locking and memory consistency overheads associated with data sharing. In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the performance characteristics of MPI-only and MPI+threads models at scale. We start with a baseline MPI-only implementation and propose MPI+threads extensions where threads independently communicate with remote processes while cooperating for local computation. We demonstrate how the coarse-grained communication of MPI+threads considerably reduces time and space overheads that grow with the number of processes. At large scale, however, these overheads constitute performance barriers for both models and require fixing the root causes, such as the excessive polling for communication progress and inefficient global synchronizations. To this end, we demonstrate various techniques to reduce such overheads and show performance improvements on up to 512K cores of a Blue Gene/Q system.
随着多核架构的日益突出和大型超级计算机上每核资源的减少,许多应用程序开发人员正在研究使用混合MPI+线程编程来利用计算单元,同时共享内存。每个系统核心使用一个MPI进程的仅MPI模型能够有效地利用处理单元,但它不能充分利用内存层次结构,并且依赖于细粒度的节点间通信。另一方面,混合MPI+线程模型可以更有效地处理节点间并行性,并通过允许更粗粒度的数据在地址空间之间移动来减轻与节点间通信相关的一些开销。然而,混合模型可能会受到与数据共享相关的锁定和内存一致性开销的影响。在本文中,我们使用广度优先搜索算法的分布式实现,以便大规模地了解MPI-only和MPI+线程模型的性能特征。我们从一个仅MPI的基线实现开始,并提出MPI+线程扩展,其中线程在协作进行本地计算的同时独立地与远程进程通信。我们将演示MPI+线程的粗粒度通信如何显著减少随着进程数量增加而增加的时间和空间开销。然而,在大规模的情况下,这些开销构成了两种模型的性能障碍,并且需要解决根本原因,例如通信进程的过度轮询和低效的全局同步。为此,我们演示了各种技术来减少此类开销,并在Blue Gene/Q系统的高达512K内核上展示了性能改进。
{"title":"Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS","authors":"A. Amer, Huiwei Lu, P. Balaji, S. Matsuoka","doi":"10.1109/CCGrid.2015.93","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.93","url":null,"abstract":"With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectively utilizing the processing units, but it fails to fully utilize the memory hierarchy and relies on fine-grained internodes communication. Hybrid MPI+threads models, on the other hand, can handle internodes parallelism more effectively and alleviate some of the overheads associated with internodes communication by allowing more coarse-grained data movement between address spaces. The hybrid model, however, can suffer from locking and memory consistency overheads associated with data sharing. In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the performance characteristics of MPI-only and MPI+threads models at scale. We start with a baseline MPI-only implementation and propose MPI+threads extensions where threads independently communicate with remote processes while cooperating for local computation. We demonstrate how the coarse-grained communication of MPI+threads considerably reduces time and space overheads that grow with the number of processes. At large scale, however, these overheads constitute performance barriers for both models and require fixing the root causes, such as the excessive polling for communication progress and inefficient global synchronizations. To this end, we demonstrate various techniques to reduce such overheads and show performance improvements on up to 512K cores of a Blue Gene/Q system.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"7 1","pages":"1075-1083"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87120005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A Framework to Accelerate Protein Structure Comparison Tools 加速蛋白质结构比较工具的框架
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.136
Ahmad Salah, Kenli Li, Tarek F. Gharib
At the center of computational structural biology, protein structure comparison is a key problem. The steady increase in the number of protein structures encourages the development of massively parallel tools. While the focus of research is to propose data-analytical methods to tackle this problem, there are limited research proposing generic tools to run these methods in parallel environments. Herein, we propose a scalable framework to handle this steady increase. The proposed framework runs the sequential tools on parallel environments. It is a GUI-based and requiring no scripting or installation procedures. The framework includes optimally distributing protein structure database over the existing computing resources, tracking the remote processes course of execution, and merging the results to form the final output. The first stage realizes the biological database distribution as an optimization problem in order to maximize the cluster resources utilization and minimize the execution time. The experimental results show linear and nearly optimal speedups with no loss in accuracy. The framework is available at http://biocloud.hnu.edu.cn/ppsc/.
在计算结构生物学的中心,蛋白质结构比较是一个关键问题。蛋白质结构数量的稳定增长鼓励了大规模并行工具的发展。虽然研究的重点是提出数据分析方法来解决这个问题,但提出在并行环境中运行这些方法的通用工具的研究有限。在这里,我们提出了一个可扩展的框架来处理这种稳定的增长。提出的框架在并行环境中运行顺序工具。它是基于gui的,不需要脚本或安装过程。该框架包括在现有计算资源上优化分布蛋白质结构数据库,跟踪远程进程的执行过程,并合并结果形成最终输出。第一阶段将生物数据库的分布作为一个优化问题来实现,以最大化集群资源利用率和最小化执行时间。实验结果表明,在精度没有损失的情况下,加速是线性的,几乎是最优的。该框架可从http://biocloud.hnu.edu.cn/ppsc/获得。
{"title":"A Framework to Accelerate Protein Structure Comparison Tools","authors":"Ahmad Salah, Kenli Li, Tarek F. Gharib","doi":"10.1109/CCGrid.2015.136","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.136","url":null,"abstract":"At the center of computational structural biology, protein structure comparison is a key problem. The steady increase in the number of protein structures encourages the development of massively parallel tools. While the focus of research is to propose data-analytical methods to tackle this problem, there are limited research proposing generic tools to run these methods in parallel environments. Herein, we propose a scalable framework to handle this steady increase. The proposed framework runs the sequential tools on parallel environments. It is a GUI-based and requiring no scripting or installation procedures. The framework includes optimally distributing protein structure database over the existing computing resources, tracking the remote processes course of execution, and merging the results to form the final output. The first stage realizes the biological database distribution as an optimization problem in order to maximize the cluster resources utilization and minimize the execution time. The experimental results show linear and nearly optimal speedups with no loss in accuracy. The framework is available at http://biocloud.hnu.edu.cn/ppsc/.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"8 1","pages":"705-708"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87359793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementation and Evaluation of MPI Nonblocking Collective I/O MPI非阻塞集体I/O的实现与评价
Sangmin Seo, R. Latham, Junchao Zhang, P. Balaji
The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands. In large-scale MPI applications, collective I/O has long been an effective way to achieve higher I/O rates, but it poses two constraints. First, although overlapping collective I/O and computation represents the next logical step toward a faster time to solution, MPI's existing collective I/O API provides only limited support for doing so. Second, collective routines (both for I/O and communication) impose a synchronization cost in addition to a communication cost. The upcoming MPI 3.1 standard will provide a new set of nonblocking collective I/O operations to satisfy the need of applications. We present here initial work on the implementation of MPI nonblocking collective I/O operations in the MPICH MPI library. Our implementation begins with the extended two-phase algorithm used in ROMIO's collective I/O implementation. We then utilize a state machine and the extended generalized request interface to maintain the progress of nonblocking collective I/O operations. The evaluation results indicate that our implementation performs as well as blocking collective I/O in terms of I/O bandwidth and is capable of overlapping I/O and other operations. We believe that our implementation can help users try nonblocking collective I/O operations in their applications.
众所周知,相对CPU速度和存储带宽之间的差距导致需要新的策略来管理I/O需求。在大规模MPI应用中,集体I/O一直是实现更高I/O速率的有效方法,但它存在两个限制。首先,尽管重叠的集合I/O和计算是加快解决方案的下一个合乎逻辑的步骤,但MPI现有的集合I/O API只提供了有限的支持。其次,集体例程(用于I/O和通信)除了通信成本外,还增加了同步成本。即将发布的MPI 3.1标准将提供一组新的非阻塞集体I/O操作,以满足应用程序的需求。我们在这里介绍在MPICH MPI库中实现MPI非阻塞集体I/O操作的初步工作。我们的实现从ROMIO集合I/O实现中使用的扩展的两阶段算法开始。然后,我们利用状态机和扩展的通用请求接口来维护非阻塞集体I/O操作的进度。评估结果表明,我们的实现在阻塞I/O带宽方面表现良好,并且能够重叠I/O和其他操作。我们相信我们的实现可以帮助用户在他们的应用程序中尝试非阻塞的集体I/O操作。
{"title":"Implementation and Evaluation of MPI Nonblocking Collective I/O","authors":"Sangmin Seo, R. Latham, Junchao Zhang, P. Balaji","doi":"10.1109/CCGrid.2015.81","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.81","url":null,"abstract":"The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands. In large-scale MPI applications, collective I/O has long been an effective way to achieve higher I/O rates, but it poses two constraints. First, although overlapping collective I/O and computation represents the next logical step toward a faster time to solution, MPI's existing collective I/O API provides only limited support for doing so. Second, collective routines (both for I/O and communication) impose a synchronization cost in addition to a communication cost. The upcoming MPI 3.1 standard will provide a new set of nonblocking collective I/O operations to satisfy the need of applications. We present here initial work on the implementation of MPI nonblocking collective I/O operations in the MPICH MPI library. Our implementation begins with the extended two-phase algorithm used in ROMIO's collective I/O implementation. We then utilize a state machine and the extended generalized request interface to maintain the progress of nonblocking collective I/O operations. The evaluation results indicate that our implementation performs as well as blocking collective I/O in terms of I/O bandwidth and is capable of overlapping I/O and other operations. We believe that our implementation can help users try nonblocking collective I/O operations in their applications.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"33 1","pages":"1084-1091"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90546554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations MPI-3.0进程级共享内存分析:以模板计算为例
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.131
Xiaomin Zhu, Junchao Zhang, Kazutomo Yoshii, Shigang Li, Yunquan Zhang, P. Balaji
The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabilities and performance implications of using MPI-3.0 shared memory, in the context of a five-point stencil computation. Our analysis reveals that the use of MPI-3.0 shared memory has several unforeseen performance implications including disrupting certain compiler optimizations and incorrectly using suboptimal page sizes inside the OS. Based on this analysis, we propose several methodologies for working around these issues and improving communication performance by 40-85% compared to the current MPI-1.0 based approach.
最近发布的MPI-3.0标准引入了进程级共享内存接口,该接口允许同一节点内的进程直接访问彼此的内存。这样的接口允许应用程序声明由节点上的多个MPI进程共享的数据结构。在本文中,我们研究了在五点模板计算的背景下使用MPI-3.0共享内存的能力和性能影响。我们的分析表明,使用MPI-3.0共享内存有几个不可预见的性能影响,包括破坏某些编译器优化和在操作系统中错误地使用次优页面大小。基于这一分析,我们提出了几种方法来解决这些问题,并与当前基于MPI-1.0的方法相比,将通信性能提高40-85%。
{"title":"Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations","authors":"Xiaomin Zhu, Junchao Zhang, Kazutomo Yoshii, Shigang Li, Yunquan Zhang, P. Balaji","doi":"10.1109/CCGrid.2015.131","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.131","url":null,"abstract":"The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabilities and performance implications of using MPI-3.0 shared memory, in the context of a five-point stencil computation. Our analysis reveals that the use of MPI-3.0 shared memory has several unforeseen performance implications including disrupting certain compiler optimizations and incorrectly using suboptimal page sizes inside the OS. Based on this analysis, we propose several methodologies for working around these issues and improving communication performance by 40-85% compared to the current MPI-1.0 based approach.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"1099-1106"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90791599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators 多核加速器上后缀阵列构建的并行DC3算法
Gang Liao, Longfei Ma, Guangming Zang, L. Tang
In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures.
在生物信息学应用中,后缀阵列广泛用于启发式算法初始精确匹配阶段的DNA序列比对。随着数据的指数级增长和可用性,使用多核加速器(如gpu)来优化现有算法是非常常见的。提出了一种新的后缀数组在GPU上的实现方法。因此,在包含超过1亿个字符的标准大型数据集上,GPU上的后缀数组构建实现了大约10倍的加速。这个想法简单、快速、可扩展,可以很容易地扩展到多核处理器甚至异构架构。
{"title":"Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators","authors":"Gang Liao, Longfei Ma, Guangming Zang, L. Tang","doi":"10.1109/CCGrid.2015.56","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.56","url":null,"abstract":"In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"1 1","pages":"1155-1158"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89703154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Taming Latency in Data Center Networking with Erasure Coded Files 用Erasure编码文件控制数据中心网络中的延迟
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.142
Yu Xiang, V. Aggarwal, Y. Chen, Tian Lan
This paper proposes an approach to minimize service latency in a data center network where erasure-coded files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth available at both top-of-the-rack and aggregation switches, network bandwidth must be apportioned among different intra-and inter-rack data flows in line with their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to derive a closed-form outer-bound of service latency for erasure-coded storage with arbitrary file access patterns and service time distributions. The result enables us to propose a joint latency optimization over three entangled "control knobs": the bandwidth allocation at top-of-the-rack and aggregation switches, the probabilities for scheduling file requests, and the placement of encoded file chunks, which affects data locality. The joint optimization is shown to be a mixed-integer problem. We develop an iterative algorithm which decouples and solves the joint optimization as three sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is prototyped in an open-source, distributed file system, Tahoe, and evaluated on a cloud tested with 16 separate physical hosts in an Open Stack cluster. Experiments validate our theoretical latency analysis and show significant latency reduction for diverse file access patterns. The results provide valuable insight on designing low-latency data center networks with erasure-coded storage.
在数据中心网络中,擦除编码文件存储在分布式磁盘/机架上,访问请求分散在整个网络中,本文提出了一种最小化服务延迟的方法。由于机架顶交换机和汇聚交换机的可用带宽有限,网络带宽必须根据机架内和机架间数据流的流量统计数据进行分配。我们将此问题表述为加权排队,并采用一类概率请求调度策略,推导出具有任意文件访问模式和服务时间分布的擦除编码存储的服务延迟的封闭形式的外部边界。结果使我们能够在三个纠缠的“控制旋钮”上提出联合延迟优化:机架顶部和聚合交换机的带宽分配,调度文件请求的概率,以及影响数据局域性的编码文件块的位置。该联合优化是一个混合整数问题。我们开发了一种迭代算法,将联合优化解耦并求解为三个子问题,这些子问题要么是凸的,要么是在多项式时间内通过二部匹配可解的。提出的算法在一个开源的分布式文件系统Tahoe中进行了原型设计,并在一个开放堆栈集群中使用16个独立的物理主机进行了云测试。实验验证了我们的理论延迟分析,并显示了不同文件访问模式的显着延迟减少。研究结果为设计具有擦除编码存储的低延迟数据中心网络提供了有价值的见解。
{"title":"Taming Latency in Data Center Networking with Erasure Coded Files","authors":"Yu Xiang, V. Aggarwal, Y. Chen, Tian Lan","doi":"10.1109/CCGrid.2015.142","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.142","url":null,"abstract":"This paper proposes an approach to minimize service latency in a data center network where erasure-coded files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth available at both top-of-the-rack and aggregation switches, network bandwidth must be apportioned among different intra-and inter-rack data flows in line with their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to derive a closed-form outer-bound of service latency for erasure-coded storage with arbitrary file access patterns and service time distributions. The result enables us to propose a joint latency optimization over three entangled \"control knobs\": the bandwidth allocation at top-of-the-rack and aggregation switches, the probabilities for scheduling file requests, and the placement of encoded file chunks, which affects data locality. The joint optimization is shown to be a mixed-integer problem. We develop an iterative algorithm which decouples and solves the joint optimization as three sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is prototyped in an open-source, distributed file system, Tahoe, and evaluated on a cloud tested with 16 separate physical hosts in an Open Stack cluster. Experiments validate our theoretical latency analysis and show significant latency reduction for diverse file access patterns. The results provide valuable insight on designing low-latency data center networks with erasure-coded storage.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"45 1","pages":"241-250"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88296218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Scheduling Workloads of Workflows with Unknown Task Runtimes 具有未知任务运行时的工作流的调度工作负载
A. Ilyushkin, Bogdan Ghit, D. Epema
Workflows are important computational tools in many branches of science, and because of the dependencies among their tasks and their widely different characteristics, scheduling them is a difficult problem. Most research on scheduling workflows has focused on the offline problem of minimizing the make span of single workflows with known task runtimes. The problem of scheduling multiple workflows has been addressed either in an offline fashion, or still with the assumption of known task runtimes. In this paper, we study the problem of scheduling workloads consisting of an arrival stream of workflows without task runtime estimates. The resource requirements of a workflow can significantly fluctuate during its execution. Thus, we present four scheduling policies for workloads of workflows with as their main feature the extent to which they reserve processors to workflows to deal with these fluctuations. We perform simulations with realistic synthetic workloads and we show that any form of processor reservation only decreases the overall system performance and that a greedy backfilling-like policy performs best.
工作流是许多科学分支中重要的计算工具,由于其任务之间的依赖性和它们广泛不同的特征,对它们进行调度是一个难题。大多数关于工作流调度的研究都集中在最小化具有已知任务运行时的单个工作流的生成跨度的离线问题上。调度多个工作流的问题已经以离线方式解决,或者仍然假设已知任务运行时。在本文中,我们研究了由工作流到达流组成的无任务运行时估计的工作负载调度问题。工作流的资源需求在执行过程中会有很大的波动。因此,我们为工作流的工作负载提出了四种调度策略,其主要特征是它们为工作流保留处理器以处理这些波动的程度。我们对真实的合成工作负载进行了模拟,结果表明,任何形式的处理器预留只会降低系统的整体性能,而类似贪婪回填的策略表现最好。
{"title":"Scheduling Workloads of Workflows with Unknown Task Runtimes","authors":"A. Ilyushkin, Bogdan Ghit, D. Epema","doi":"10.1109/CCGrid.2015.27","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.27","url":null,"abstract":"Workflows are important computational tools in many branches of science, and because of the dependencies among their tasks and their widely different characteristics, scheduling them is a difficult problem. Most research on scheduling workflows has focused on the offline problem of minimizing the make span of single workflows with known task runtimes. The problem of scheduling multiple workflows has been addressed either in an offline fashion, or still with the assumption of known task runtimes. In this paper, we study the problem of scheduling workloads consisting of an arrival stream of workflows without task runtime estimates. The resource requirements of a workflow can significantly fluctuate during its execution. Thus, we present four scheduling policies for workloads of workflows with as their main feature the extent to which they reserve processors to workflows to deal with these fluctuations. We perform simulations with realistic synthetic workloads and we show that any form of processor reservation only decreases the overall system performance and that a greedy backfilling-like policy performs best.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"114 1","pages":"606-616"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79884728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA 基于MPI RMA的高效可移植异步通信扩展NWChem
Min Si, Antonio J. Peña, J. Hammond, P. Balaji, Y. Ishikawa
NWChem is one of the most widely used computational chemistry application suites for chemical and biological systems. Despite its vast success, the computational efficiency of NWChem is still low. This is especially true in higher accuracy methods such as the CCSD(T) coupled cluster method, where it currently achieves a mere 50% computational efficiency when run at large scales. In this paper, we demonstrate the most computationally efficient scaling of NWChem CCSD(T) to date, and use it to solve large water clusters. We use our recently proposed process-based asynchronous progress framework for MPI RMA, called Casper, to scale the computation on water clusters at near-100% computational efficiency on up to 12288 cores.
NWChem是化学和生物系统中使用最广泛的计算化学应用套件之一。尽管取得了巨大的成功,但NWChem的计算效率仍然很低。这在精度更高的方法中尤其如此,例如CCSD(T)耦合簇方法,在大规模运行时,它目前的计算效率仅为50%。在本文中,我们展示了迄今为止计算效率最高的NWChem CCSD(T)缩放,并将其用于求解大型水簇。我们使用我们最近提出的基于进程的MPI RMA异步进程框架,称为Casper,在高达12288核的水集群上以接近100%的计算效率扩展计算。
{"title":"Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA","authors":"Min Si, Antonio J. Peña, J. Hammond, P. Balaji, Y. Ishikawa","doi":"10.1109/CCGrid.2015.48","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.48","url":null,"abstract":"NWChem is one of the most widely used computational chemistry application suites for chemical and biological systems. Despite its vast success, the computational efficiency of NWChem is still low. This is especially true in higher accuracy methods such as the CCSD(T) coupled cluster method, where it currently achieves a mere 50% computational efficiency when run at large scales. In this paper, we demonstrate the most computationally efficient scaling of NWChem CCSD(T) to date, and use it to solve large water clusters. We use our recently proposed process-based asynchronous progress framework for MPI RMA, called Casper, to scale the computation on water clusters at near-100% computational efficiency on up to 12288 cores.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"3 1","pages":"811-816"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90186568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1