Interactive ad hoc data query over massive datasets has recently gained significant traction. Massively parallel data query and analysis frameworks (e.g., Dremel, Impala) are built and deployed to support SQL-like queries over distributed and partitioned data in a clustering environment. As a result, the execution of each query is converted into a set of coordinated tasks including data retrieval, intermediate result computation and transfer, and result aggregation. To support high request rate of concurrent interactive queries, coordinated management of multiple resources (e.g., bandwidth, CPU, memory) of the cluster environment is critical. In this paper, we investigate this resource management problem using an utility-based optimization framework. Our goal is to optimize the resource utilization, and maintain fairness among different types of queries. We present a price-based algorithm which achieves this optimization objective. We implement our algorithm in the open source Impala system and conduct a set of experiments in a clustering environment using the TPC-DS workload. Experimental results show that our coordinated resource management solution can increase the aggregate utility by at least 15.4% compared with simple fair resource share mechanism, and 63.5% compared with the FIFO resource management mechanism.
{"title":"Coordinated Resource Management for Large Scale Interactive Data Query Systems","authors":"Wei Yan, Yuan Xue","doi":"10.1109/CCGrid.2015.149","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.149","url":null,"abstract":"Interactive ad hoc data query over massive datasets has recently gained significant traction. Massively parallel data query and analysis frameworks (e.g., Dremel, Impala) are built and deployed to support SQL-like queries over distributed and partitioned data in a clustering environment. As a result, the execution of each query is converted into a set of coordinated tasks including data retrieval, intermediate result computation and transfer, and result aggregation. To support high request rate of concurrent interactive queries, coordinated management of multiple resources (e.g., bandwidth, CPU, memory) of the cluster environment is critical. In this paper, we investigate this resource management problem using an utility-based optimization framework. Our goal is to optimize the resource utilization, and maintain fairness among different types of queries. We present a price-based algorithm which achieves this optimization objective. We implement our algorithm in the open source Impala system and conduct a set of experiments in a clustering environment using the TPC-DS workload. Experimental results show that our coordinated resource management solution can increase the aggregate utility by at least 15.4% compared with simple fair resource share mechanism, and 63.5% compared with the FIFO resource management mechanism.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"39 7","pages":"677-686"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91438961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Managing and optimising cloud services is one of the main challenges faced by industry and academia. A possible solution is resorting to self-management, as fostered by autonomic computing. However, the abstraction layer provided by cloud computing obfuscates several details of the provided services, which, in turn, hinders the effectiveness of autonomic managers. Data-driven approaches, particularly those relying on service clustering based on machine learning techniques, can assist the autonomic management and support decisions concerning, for example, the scheduling and deployment of services. One aspect that complicates this approach is that the information provided by the monitoring contains both continuous (e.g. CPU load) and categorical (e.g. VM instance type) data. Current approaches treat this problem in a heuristic fashion. This paper, instead, proposes an approach, which uses all kinds of data and learns in a data-driven fashion the similarities and resource usage patterns among the services. In particular, we use an unsupervised formulation of the Random Forest algorithm to calculate similarities and provide them as input to a clustering algorithm. For the sake of efficiency and meeting the dynamism requirement of autonomic clouds, our methodology consists of two steps: (i) off-line clustering and (ii) on-line prediction. Using datasets from real-world clouds, we demonstrate the superiority of our solution with respect to others and validate the accuracy of the on-line prediction. Moreover, to show the applicability of our approach, we devise a service scheduler that uses the notion of similarity among services and evaluate it in a cloud test-bed.
{"title":"Service Clustering for Autonomic Clouds Using Random Forest","authors":"Rafael Brundo Uriarte, S. Tsaftaris, F. Tiezzi","doi":"10.1109/CCGrid.2015.41","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.41","url":null,"abstract":"Managing and optimising cloud services is one of the main challenges faced by industry and academia. A possible solution is resorting to self-management, as fostered by autonomic computing. However, the abstraction layer provided by cloud computing obfuscates several details of the provided services, which, in turn, hinders the effectiveness of autonomic managers. Data-driven approaches, particularly those relying on service clustering based on machine learning techniques, can assist the autonomic management and support decisions concerning, for example, the scheduling and deployment of services. One aspect that complicates this approach is that the information provided by the monitoring contains both continuous (e.g. CPU load) and categorical (e.g. VM instance type) data. Current approaches treat this problem in a heuristic fashion. This paper, instead, proposes an approach, which uses all kinds of data and learns in a data-driven fashion the similarities and resource usage patterns among the services. In particular, we use an unsupervised formulation of the Random Forest algorithm to calculate similarities and provide them as input to a clustering algorithm. For the sake of efficiency and meeting the dynamism requirement of autonomic clouds, our methodology consists of two steps: (i) off-line clustering and (ii) on-line prediction. Using datasets from real-world clouds, we demonstrate the superiority of our solution with respect to others and validate the accuracy of the on-line prediction. Moreover, to show the applicability of our approach, we devise a service scheduler that uses the notion of similarity among services and evaluate it in a cloud test-bed.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"13 1","pages":"515-524"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86945214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectively utilizing the processing units, but it fails to fully utilize the memory hierarchy and relies on fine-grained internodes communication. Hybrid MPI+threads models, on the other hand, can handle internodes parallelism more effectively and alleviate some of the overheads associated with internodes communication by allowing more coarse-grained data movement between address spaces. The hybrid model, however, can suffer from locking and memory consistency overheads associated with data sharing. In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the performance characteristics of MPI-only and MPI+threads models at scale. We start with a baseline MPI-only implementation and propose MPI+threads extensions where threads independently communicate with remote processes while cooperating for local computation. We demonstrate how the coarse-grained communication of MPI+threads considerably reduces time and space overheads that grow with the number of processes. At large scale, however, these overheads constitute performance barriers for both models and require fixing the root causes, such as the excessive polling for communication progress and inefficient global synchronizations. To this end, we demonstrate various techniques to reduce such overheads and show performance improvements on up to 512K cores of a Blue Gene/Q system.
{"title":"Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS","authors":"A. Amer, Huiwei Lu, P. Balaji, S. Matsuoka","doi":"10.1109/CCGrid.2015.93","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.93","url":null,"abstract":"With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectively utilizing the processing units, but it fails to fully utilize the memory hierarchy and relies on fine-grained internodes communication. Hybrid MPI+threads models, on the other hand, can handle internodes parallelism more effectively and alleviate some of the overheads associated with internodes communication by allowing more coarse-grained data movement between address spaces. The hybrid model, however, can suffer from locking and memory consistency overheads associated with data sharing. In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the performance characteristics of MPI-only and MPI+threads models at scale. We start with a baseline MPI-only implementation and propose MPI+threads extensions where threads independently communicate with remote processes while cooperating for local computation. We demonstrate how the coarse-grained communication of MPI+threads considerably reduces time and space overheads that grow with the number of processes. At large scale, however, these overheads constitute performance barriers for both models and require fixing the root causes, such as the excessive polling for communication progress and inefficient global synchronizations. To this end, we demonstrate various techniques to reduce such overheads and show performance improvements on up to 512K cores of a Blue Gene/Q system.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"7 1","pages":"1075-1083"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87120005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At the center of computational structural biology, protein structure comparison is a key problem. The steady increase in the number of protein structures encourages the development of massively parallel tools. While the focus of research is to propose data-analytical methods to tackle this problem, there are limited research proposing generic tools to run these methods in parallel environments. Herein, we propose a scalable framework to handle this steady increase. The proposed framework runs the sequential tools on parallel environments. It is a GUI-based and requiring no scripting or installation procedures. The framework includes optimally distributing protein structure database over the existing computing resources, tracking the remote processes course of execution, and merging the results to form the final output. The first stage realizes the biological database distribution as an optimization problem in order to maximize the cluster resources utilization and minimize the execution time. The experimental results show linear and nearly optimal speedups with no loss in accuracy. The framework is available at http://biocloud.hnu.edu.cn/ppsc/.
{"title":"A Framework to Accelerate Protein Structure Comparison Tools","authors":"Ahmad Salah, Kenli Li, Tarek F. Gharib","doi":"10.1109/CCGrid.2015.136","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.136","url":null,"abstract":"At the center of computational structural biology, protein structure comparison is a key problem. The steady increase in the number of protein structures encourages the development of massively parallel tools. While the focus of research is to propose data-analytical methods to tackle this problem, there are limited research proposing generic tools to run these methods in parallel environments. Herein, we propose a scalable framework to handle this steady increase. The proposed framework runs the sequential tools on parallel environments. It is a GUI-based and requiring no scripting or installation procedures. The framework includes optimally distributing protein structure database over the existing computing resources, tracking the remote processes course of execution, and merging the results to form the final output. The first stage realizes the biological database distribution as an optimization problem in order to maximize the cluster resources utilization and minimize the execution time. The experimental results show linear and nearly optimal speedups with no loss in accuracy. The framework is available at http://biocloud.hnu.edu.cn/ppsc/.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"8 1","pages":"705-708"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87359793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands. In large-scale MPI applications, collective I/O has long been an effective way to achieve higher I/O rates, but it poses two constraints. First, although overlapping collective I/O and computation represents the next logical step toward a faster time to solution, MPI's existing collective I/O API provides only limited support for doing so. Second, collective routines (both for I/O and communication) impose a synchronization cost in addition to a communication cost. The upcoming MPI 3.1 standard will provide a new set of nonblocking collective I/O operations to satisfy the need of applications. We present here initial work on the implementation of MPI nonblocking collective I/O operations in the MPICH MPI library. Our implementation begins with the extended two-phase algorithm used in ROMIO's collective I/O implementation. We then utilize a state machine and the extended generalized request interface to maintain the progress of nonblocking collective I/O operations. The evaluation results indicate that our implementation performs as well as blocking collective I/O in terms of I/O bandwidth and is capable of overlapping I/O and other operations. We believe that our implementation can help users try nonblocking collective I/O operations in their applications.
{"title":"Implementation and Evaluation of MPI Nonblocking Collective I/O","authors":"Sangmin Seo, R. Latham, Junchao Zhang, P. Balaji","doi":"10.1109/CCGrid.2015.81","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.81","url":null,"abstract":"The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands. In large-scale MPI applications, collective I/O has long been an effective way to achieve higher I/O rates, but it poses two constraints. First, although overlapping collective I/O and computation represents the next logical step toward a faster time to solution, MPI's existing collective I/O API provides only limited support for doing so. Second, collective routines (both for I/O and communication) impose a synchronization cost in addition to a communication cost. The upcoming MPI 3.1 standard will provide a new set of nonblocking collective I/O operations to satisfy the need of applications. We present here initial work on the implementation of MPI nonblocking collective I/O operations in the MPICH MPI library. Our implementation begins with the extended two-phase algorithm used in ROMIO's collective I/O implementation. We then utilize a state machine and the extended generalized request interface to maintain the progress of nonblocking collective I/O operations. The evaluation results indicate that our implementation performs as well as blocking collective I/O in terms of I/O bandwidth and is capable of overlapping I/O and other operations. We believe that our implementation can help users try nonblocking collective I/O operations in their applications.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"33 1","pages":"1084-1091"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90546554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabilities and performance implications of using MPI-3.0 shared memory, in the context of a five-point stencil computation. Our analysis reveals that the use of MPI-3.0 shared memory has several unforeseen performance implications including disrupting certain compiler optimizations and incorrectly using suboptimal page sizes inside the OS. Based on this analysis, we propose several methodologies for working around these issues and improving communication performance by 40-85% compared to the current MPI-1.0 based approach.
{"title":"Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations","authors":"Xiaomin Zhu, Junchao Zhang, Kazutomo Yoshii, Shigang Li, Yunquan Zhang, P. Balaji","doi":"10.1109/CCGrid.2015.131","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.131","url":null,"abstract":"The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabilities and performance implications of using MPI-3.0 shared memory, in the context of a five-point stencil computation. Our analysis reveals that the use of MPI-3.0 shared memory has several unforeseen performance implications including disrupting certain compiler optimizations and incorrectly using suboptimal page sizes inside the OS. Based on this analysis, we propose several methodologies for working around these issues and improving communication performance by 40-85% compared to the current MPI-1.0 based approach.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"1099-1106"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90791599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures.
{"title":"Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators","authors":"Gang Liao, Longfei Ma, Guangming Zang, L. Tang","doi":"10.1109/CCGrid.2015.56","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.56","url":null,"abstract":"In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"1 1","pages":"1155-1158"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89703154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes an approach to minimize service latency in a data center network where erasure-coded files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth available at both top-of-the-rack and aggregation switches, network bandwidth must be apportioned among different intra-and inter-rack data flows in line with their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to derive a closed-form outer-bound of service latency for erasure-coded storage with arbitrary file access patterns and service time distributions. The result enables us to propose a joint latency optimization over three entangled "control knobs": the bandwidth allocation at top-of-the-rack and aggregation switches, the probabilities for scheduling file requests, and the placement of encoded file chunks, which affects data locality. The joint optimization is shown to be a mixed-integer problem. We develop an iterative algorithm which decouples and solves the joint optimization as three sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is prototyped in an open-source, distributed file system, Tahoe, and evaluated on a cloud tested with 16 separate physical hosts in an Open Stack cluster. Experiments validate our theoretical latency analysis and show significant latency reduction for diverse file access patterns. The results provide valuable insight on designing low-latency data center networks with erasure-coded storage.
{"title":"Taming Latency in Data Center Networking with Erasure Coded Files","authors":"Yu Xiang, V. Aggarwal, Y. Chen, Tian Lan","doi":"10.1109/CCGrid.2015.142","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.142","url":null,"abstract":"This paper proposes an approach to minimize service latency in a data center network where erasure-coded files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth available at both top-of-the-rack and aggregation switches, network bandwidth must be apportioned among different intra-and inter-rack data flows in line with their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to derive a closed-form outer-bound of service latency for erasure-coded storage with arbitrary file access patterns and service time distributions. The result enables us to propose a joint latency optimization over three entangled \"control knobs\": the bandwidth allocation at top-of-the-rack and aggregation switches, the probabilities for scheduling file requests, and the placement of encoded file chunks, which affects data locality. The joint optimization is shown to be a mixed-integer problem. We develop an iterative algorithm which decouples and solves the joint optimization as three sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is prototyped in an open-source, distributed file system, Tahoe, and evaluated on a cloud tested with 16 separate physical hosts in an Open Stack cluster. Experiments validate our theoretical latency analysis and show significant latency reduction for diverse file access patterns. The results provide valuable insight on designing low-latency data center networks with erasure-coded storage.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"45 1","pages":"241-250"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88296218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Workflows are important computational tools in many branches of science, and because of the dependencies among their tasks and their widely different characteristics, scheduling them is a difficult problem. Most research on scheduling workflows has focused on the offline problem of minimizing the make span of single workflows with known task runtimes. The problem of scheduling multiple workflows has been addressed either in an offline fashion, or still with the assumption of known task runtimes. In this paper, we study the problem of scheduling workloads consisting of an arrival stream of workflows without task runtime estimates. The resource requirements of a workflow can significantly fluctuate during its execution. Thus, we present four scheduling policies for workloads of workflows with as their main feature the extent to which they reserve processors to workflows to deal with these fluctuations. We perform simulations with realistic synthetic workloads and we show that any form of processor reservation only decreases the overall system performance and that a greedy backfilling-like policy performs best.
{"title":"Scheduling Workloads of Workflows with Unknown Task Runtimes","authors":"A. Ilyushkin, Bogdan Ghit, D. Epema","doi":"10.1109/CCGrid.2015.27","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.27","url":null,"abstract":"Workflows are important computational tools in many branches of science, and because of the dependencies among their tasks and their widely different characteristics, scheduling them is a difficult problem. Most research on scheduling workflows has focused on the offline problem of minimizing the make span of single workflows with known task runtimes. The problem of scheduling multiple workflows has been addressed either in an offline fashion, or still with the assumption of known task runtimes. In this paper, we study the problem of scheduling workloads consisting of an arrival stream of workflows without task runtime estimates. The resource requirements of a workflow can significantly fluctuate during its execution. Thus, we present four scheduling policies for workloads of workflows with as their main feature the extent to which they reserve processors to workflows to deal with these fluctuations. We perform simulations with realistic synthetic workloads and we show that any form of processor reservation only decreases the overall system performance and that a greedy backfilling-like policy performs best.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"114 1","pages":"606-616"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79884728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Si, Antonio J. Peña, J. Hammond, P. Balaji, Y. Ishikawa
NWChem is one of the most widely used computational chemistry application suites for chemical and biological systems. Despite its vast success, the computational efficiency of NWChem is still low. This is especially true in higher accuracy methods such as the CCSD(T) coupled cluster method, where it currently achieves a mere 50% computational efficiency when run at large scales. In this paper, we demonstrate the most computationally efficient scaling of NWChem CCSD(T) to date, and use it to solve large water clusters. We use our recently proposed process-based asynchronous progress framework for MPI RMA, called Casper, to scale the computation on water clusters at near-100% computational efficiency on up to 12288 cores.
{"title":"Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA","authors":"Min Si, Antonio J. Peña, J. Hammond, P. Balaji, Y. Ishikawa","doi":"10.1109/CCGrid.2015.48","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.48","url":null,"abstract":"NWChem is one of the most widely used computational chemistry application suites for chemical and biological systems. Despite its vast success, the computational efficiency of NWChem is still low. This is especially true in higher accuracy methods such as the CCSD(T) coupled cluster method, where it currently achieves a mere 50% computational efficiency when run at large scales. In this paper, we demonstrate the most computationally efficient scaling of NWChem CCSD(T) to date, and use it to solve large water clusters. We use our recently proposed process-based asynchronous progress framework for MPI RMA, called Casper, to scale the computation on water clusters at near-100% computational efficiency on up to 12288 cores.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"3 1","pages":"811-816"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90186568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}