With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA) as an online public repository of availability traces taken from diverse parallel and distributed systems. Our main contributions in this study are the following. First, we describe the design of the archive, in particular the rationale of the standard FTA format, and the design of a toolbox that facilitates automated analysis of trace data sets. Second, applying the toolbox, we present a uniform comparative analysis with statistics and models of failures in nine distributed systems. Third, we show how different interpretations of these data sets can result in different conclusions. This emphasizes the critical need for the public availability of trace data and methods for their analysis.
{"title":"The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems","authors":"Derrick Kondo, Bahman Javadi, A. Iosup, D. Epema","doi":"10.1109/CCGRID.2010.71","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.71","url":null,"abstract":"With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA) as an online public repository of availability traces taken from diverse parallel and distributed systems. Our main contributions in this study are the following. First, we describe the design of the archive, in particular the rationale of the standard FTA format, and the design of a toolbox that facilitates automated analysis of trace data sets. Second, applying the toolbox, we present a uniform comparative analysis with statistics and models of failures in nine distributed systems. Third, we show how different interpretations of these data sets can result in different conclusions. This emphasizes the critical need for the public availability of trace data and methods for their analysis.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117175771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Our on-going project, Unibus, aims to facilitate provisioning and aggregation of multifaceted resources from resource providers and end-users’ perspectives. To achieve that, Unibus proposes (1) the Capability Model and mediators (resource drivers) to virtualize access to diverse resources, and (2) soft and successive conditioning to enable automatic and user-transparent resource provisioning. In this paper we examine the Unibus concepts and prototype in a real situation of aggregation of two commercial clouds and execution of benchmarks on aggregated resources. We also present and discuss benchmarks’ results.
{"title":"Unibus-managed Execution of Scientific Applications on Aggregated Clouds","authors":"Jaroslaw Slawinski, M. Slawinska, V. Sunderam","doi":"10.1109/CCGRID.2010.53","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.53","url":null,"abstract":"Our on-going project, Unibus, aims to facilitate provisioning and aggregation of multifaceted resources from resource providers and end-users’ perspectives. To achieve that, Unibus proposes (1) the Capability Model and mediators (resource drivers) to virtualize access to diverse resources, and (2) soft and successive conditioning to enable automatic and user-transparent resource provisioning. In this paper we examine the Unibus concepts and prototype in a real situation of aggregation of two commercial clouds and execution of benchmarks on aggregated resources. We also present and discuss benchmarks’ results.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121160905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takahiro Hirofuchi, H. Nakada, S. Itoh, S. Sekiguchi
We are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it is difficult to optimize VM packing on physical nodes dynamically, corresponding to ever-changing resource usage. In this paper, we propose an advanced live migration mechanism enabling instantaneous relocation of VMs. To minimize the time needed for switching the execution host, memory pages are transferred after a VM resumes at a destination host. A special character device driver allows transparent memory page retrievals from a source host for the running VM at the destination. In comparison with related work, the proposed mechanism supports guest operating systems without any modifications to them (i.e, no special device drivers and programs are needed in VMs). It is implemented as a lightweight extension to KVM (Kernel-based Virtual Machine Monitor). It is not required to modify critical parts of the VMM code. Experiments were conducted using the SPECweb2005 benchmark. A running VM with heavily-loaded web servers was successfully relocated to a destination within one second. Temporal performance degradation after relocation was resolved by means of a precaching mechanism for memory pages. In addition, for memory intensive workloads, our migration mechanism moved all the states of a VM faster than existing migration technology.
{"title":"Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension","authors":"Takahiro Hirofuchi, H. Nakada, S. Itoh, S. Sekiguchi","doi":"10.1109/CCGRID.2010.42","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.42","url":null,"abstract":"We are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it is difficult to optimize VM packing on physical nodes dynamically, corresponding to ever-changing resource usage. In this paper, we propose an advanced live migration mechanism enabling instantaneous relocation of VMs. To minimize the time needed for switching the execution host, memory pages are transferred after a VM resumes at a destination host. A special character device driver allows transparent memory page retrievals from a source host for the running VM at the destination. In comparison with related work, the proposed mechanism supports guest operating systems without any modifications to them (i.e, no special device drivers and programs are needed in VMs). It is implemented as a lightweight extension to KVM (Kernel-based Virtual Machine Monitor). It is not required to modify critical parts of the VMM code. Experiments were conducted using the SPECweb2005 benchmark. A running VM with heavily-loaded web servers was successfully relocated to a destination within one second. Temporal performance degradation after relocation was resolved by means of a precaching mechanism for memory pages. In addition, for memory intensive workloads, our migration mechanism moved all the states of a VM faster than existing migration technology.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122403188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose several algorithms for topology aggregation (TA) to effectively summarize large-scale networks. These TA techniques are shown to significantly better for path requests in e-Science that may consist of simultaneous reservation of multiple paths and/or simultaneous reservation for multiple requests. Our extensive simulation demonstrates the benefits of our algorithms both in terms of accuracy and performance.
{"title":"Topology Aggregation for E-science Networks","authors":"Eun-Sung Jung, S. Ranka, S. Sahni","doi":"10.1109/CCGRID.2010.113","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.113","url":null,"abstract":"We propose several algorithms for topology aggregation (TA) to effectively summarize large-scale networks. These TA techniques are shown to significantly better for path requests in e-Science that may consist of simultaneous reservation of multiple paths and/or simultaneous reservation for multiple requests. Our extensive simulation demonstrates the benefits of our algorithms both in terms of accuracy and performance.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122547108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, improving the energy efficiency of high performance PC clusters has become important. In order to reduce the energy consumption of the microprocessor, many high performance microprocessors have a Dynamic Voltage and Frequency Scaling (DVFS) mechanism. This paper proposes a new DVFS method called the Code-Instrumented Runtime (CIRuntime) DVFS method, in which a combination of voltage and frequency, which is called a P-State, is managed in the instrumented code at runtime. The proposed CI-Runtime DVFS method achieves better energy saving than the Interrupt based Runtime DVFS method, since it selects the appropriate P-State in each defined region based on the characteristics of program execution. Moreover, the proposed CI-Runtime DVFS method is more useful than the Static DVFS method, since it does not acquire exhaustive profiles for each P-State. The method consists of two parts. In the first part of the proposed CI-Runtime DVFS method, the instrumented codes are inserted by defining regions that have almost the same characteristics. The instrumented code must be inserted at the appropriate point, because the performance of the application decreases greatly if the instrumented code is called too many times in a short period. A method for automatically defining regions is proposed in this paper. The second part of the proposed method is the energy adaptation algorithm which is used at runtime. Two types of DVFS control algorithms energy adaptation with estimated energy consumption and energy adaptation with only performance information, are compared. The proposed CIRuntime DVFS method was implemented on a power-scalable PC cluster. The results show that the proposed CI-Runtime with energy adaptation using estimated energy consumption could achieve an energy saving of 14.2% which is close to the optimal value, without obtaining exhaustive profiles for every available P-State setting.
{"title":"Runtime Energy Adaptation with Low-Impact Instrumented Code in a Power-Scalable Cluster System","authors":"Hideaki Kimura, Takayuki Imada, M. Sato","doi":"10.1109/CCGRID.2010.70","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.70","url":null,"abstract":"Recently, improving the energy efficiency of high performance PC clusters has become important. In order to reduce the energy consumption of the microprocessor, many high performance microprocessors have a Dynamic Voltage and Frequency Scaling (DVFS) mechanism. This paper proposes a new DVFS method called the Code-Instrumented Runtime (CIRuntime) DVFS method, in which a combination of voltage and frequency, which is called a P-State, is managed in the instrumented code at runtime. The proposed CI-Runtime DVFS method achieves better energy saving than the Interrupt based Runtime DVFS method, since it selects the appropriate P-State in each defined region based on the characteristics of program execution. Moreover, the proposed CI-Runtime DVFS method is more useful than the Static DVFS method, since it does not acquire exhaustive profiles for each P-State. The method consists of two parts. In the first part of the proposed CI-Runtime DVFS method, the instrumented codes are inserted by defining regions that have almost the same characteristics. The instrumented code must be inserted at the appropriate point, because the performance of the application decreases greatly if the instrumented code is called too many times in a short period. A method for automatically defining regions is proposed in this paper. The second part of the proposed method is the energy adaptation algorithm which is used at runtime. Two types of DVFS control algorithms energy adaptation with estimated energy consumption and energy adaptation with only performance information, are compared. The proposed CIRuntime DVFS method was implemented on a power-scalable PC cluster. The results show that the proposed CI-Runtime with energy adaptation using estimated energy consumption could achieve an energy saving of 14.2% which is close to the optimal value, without obtaining exhaustive profiles for every available P-State setting.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117108626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Key/value stores which are built on structured overlay networks often lack support for atomic transactions and strong data consistency among replicas. This is unfortunate, because consistency guarantees and transactions would allow a wide range of additional application domains to benefit from the inherent scalability and fault-tolerance of DHTs. The Scalaris key/value store supports strong data consistency and atomic transactions. It uses an enhanced Paxos Commit protocol with only four communication steps rather than six. This improvement was possible by exploiting information from the replica distribution in the DHT. Scalaris enables implementation of more reliable and scalable infrastructure for collaborative Web services that require strong consistency and atomic changes across multiple items.
{"title":"Enhanced Paxos Commit for Transactions on DHTs","authors":"F. Schintke, A. Reinefeld, Seif Haridi, T. Schütt","doi":"10.1109/CCGRID.2010.41","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.41","url":null,"abstract":"Key/value stores which are built on structured overlay networks often lack support for atomic transactions and strong data consistency among replicas. This is unfortunate, because consistency guarantees and transactions would allow a wide range of additional application domains to benefit from the inherent scalability and fault-tolerance of DHTs. The Scalaris key/value store supports strong data consistency and atomic transactions. It uses an enhanced Paxos Commit protocol with only four communication steps rather than six. This improvement was possible by exploiting information from the replica distribution in the DHT. Scalaris enables implementation of more reliable and scalable infrastructure for collaborative Web services that require strong consistency and atomic changes across multiple items.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"308 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115551565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although shared memory programming models show good programmability compared to message passing programming models, their implementation by page-based software distributed shared memory systems usually suffers from high memory consistency costs. The major part of these costs is inter-node data transfer for keeping virtual shared memory consistent. A good prefetch strategy can reduce this cost. We develop two prefetch techniques, TReP and HReP, which are based on the execution history of each parallel region. These techniques are evaluated using offline simulations with the NAS Parallel Benchmarks and the LINPACK benchmark. On average, TReP achieves an efficiency (ratio of pages prefetched that were subsequently accessed) of 96% and a coverage (ratio of access faults avoided by prefetches) of 65%. HReP achieves an efficiency of 91% but has a coverage of 79%. Treating the cost of an incorrectly prefetched page to be equivalent to that of a miss, these techniques have an effective page miss rate of 63% and 71% respectively. Additionally, these two techniques are compared with two well-known software distributed shared memory (sDSM) prefetch techniques, Adaptive++ and TODFCM. TReP effectively reduces page miss rate by 53% and 34% more, and HReP effectively reduces page miss rate by 62% and 43% more, compared to Adaptive++ and TODFCM respectively. As for Adaptive++, these techniques also permit bulk prefetching for pages predicted using temporal locality, amortizing network communication costs and permitting bandwidth improvement from multi-rail network interfaces.
{"title":"Region-Based Prefetch Techniques for Software Distributed Shared Memory Systems","authors":"Jie Cai, P. Strazdins, Alistair P. Rendell","doi":"10.1109/CCGRID.2010.16","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.16","url":null,"abstract":"Although shared memory programming models show good programmability compared to message passing programming models, their implementation by page-based software distributed shared memory systems usually suffers from high memory consistency costs. The major part of these costs is inter-node data transfer for keeping virtual shared memory consistent. A good prefetch strategy can reduce this cost. We develop two prefetch techniques, TReP and HReP, which are based on the execution history of each parallel region. These techniques are evaluated using offline simulations with the NAS Parallel Benchmarks and the LINPACK benchmark. On average, TReP achieves an efficiency (ratio of pages prefetched that were subsequently accessed) of 96% and a coverage (ratio of access faults avoided by prefetches) of 65%. HReP achieves an efficiency of 91% but has a coverage of 79%. Treating the cost of an incorrectly prefetched page to be equivalent to that of a miss, these techniques have an effective page miss rate of 63% and 71% respectively. Additionally, these two techniques are compared with two well-known software distributed shared memory (sDSM) prefetch techniques, Adaptive++ and TODFCM. TReP effectively reduces page miss rate by 53% and 34% more, and HReP effectively reduces page miss rate by 62% and 43% more, compared to Adaptive++ and TODFCM respectively. As for Adaptive++, these techniques also permit bulk prefetching for pages predicted using temporal locality, amortizing network communication costs and permitting bandwidth improvement from multi-rail network interfaces.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115945758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the last decade or so, clusters have observed a tremendous rise in popularity due to the excellent price to performance ratio. A variety of Interconnects have been proposed during this period, with InfiniBand leading the way due to its high performance and open standard. At the same time, multiple programming models have emerged in order to meet the requirements of various applications and their programming models. To support requirements of multiple programming models, InfiniBand provides multiple transport semantics, ranging from unreliable connectionless to reliable connected characteristics. Among them, the reliable connection (RC) semantics is being widely used due to its high performance and support for novel features like Remote Direct Memory Acesss (RDMA), hardware atomics and Network Fault Tolerance. However, the pair wise connection oriented nature of the RC transport semantics limits its scalability and usage at the increasing processor counts. In this paper, we design and implement on-demand connection management approaches in the context of Partitioned Global Address Space (PGAS) programming models, which provided shared memory abstraction and one-sided communication semantics, leading to the development of multiple languages (UPC, X10, Chapel) and libraries (Global Arrays, MPI-RMA). Using Global Arrays as the research vehicle, we implement this approach with Aggregate Remote Memory Copy Interface (ARMCI), the runtime system of Global Arrays. We evaluate our approach, ARMCI-On Demand Connection Management (ARMCI-ODCM) using various micro benchmarks and benchmarks (LU Factorization, Random-Access and Lennard Jones simulation) and application (Subsurface transport over multiple phases (STOMP)). With the performance evaluation for up to 4096 processors, we are able to have a multi-fold reduction in connection memory with a negligible degradation in performance. Using STOMP at 4096 processors, reduces the overall connection memory by 66 times with no performance degradation. To the best of our knowledge, this is the first design, implementation and evaluation of on-demand connection management with InfiniBand using PGAS models.
{"title":"Efficient On-Demand Connection Management Mechanisms with PGAS Models over InfiniBand","authors":"Abhinav Vishnu, M. Krishnan","doi":"10.1109/CCGRID.2010.58","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.58","url":null,"abstract":"In the last decade or so, clusters have observed a tremendous rise in popularity due to the excellent price to performance ratio. A variety of Interconnects have been proposed during this period, with InfiniBand leading the way due to its high performance and open standard. At the same time, multiple programming models have emerged in order to meet the requirements of various applications and their programming models. To support requirements of multiple programming models, InfiniBand provides multiple transport semantics, ranging from unreliable connectionless to reliable connected characteristics. Among them, the reliable connection (RC) semantics is being widely used due to its high performance and support for novel features like Remote Direct Memory Acesss (RDMA), hardware atomics and Network Fault Tolerance. However, the pair wise connection oriented nature of the RC transport semantics limits its scalability and usage at the increasing processor counts. In this paper, we design and implement on-demand connection management approaches in the context of Partitioned Global Address Space (PGAS) programming models, which provided shared memory abstraction and one-sided communication semantics, leading to the development of multiple languages (UPC, X10, Chapel) and libraries (Global Arrays, MPI-RMA). Using Global Arrays as the research vehicle, we implement this approach with Aggregate Remote Memory Copy Interface (ARMCI), the runtime system of Global Arrays. We evaluate our approach, ARMCI-On Demand Connection Management (ARMCI-ODCM) using various micro benchmarks and benchmarks (LU Factorization, Random-Access and Lennard Jones simulation) and application (Subsurface transport over multiple phases (STOMP)). With the performance evaluation for up to 4096 processors, we are able to have a multi-fold reduction in connection memory with a negligible degradation in performance. Using STOMP at 4096 processors, reduces the overall connection memory by 66 times with no performance degradation. To the best of our knowledge, this is the first design, implementation and evaluation of on-demand connection management with InfiniBand using PGAS models.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126636427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grid users may experience inconsistent performance due to specific characteristics of grids, such as fluctuating workloads, high failure rates, and high resource heterogeneity. Although extensive research has been done in grids, providing consistent performance remains largely an unsolved problem. In this study we use overdimensioning, a simple but cost-ineffective solution, to solve the performance inconsistency problem in grids. To this end, we propose several overdimensioning strategies, and we evaluate these strategies through simulations with workloads consisting of Bag-of-Tasks. We find that although overdimensioning is a simple solution, it is a viable solution to provide consistent performance in grids.
{"title":"Overdimensioning for Consistent Performance in Grids","authors":"N. Yigitbasi, D. Epema","doi":"10.1109/CCGRID.2010.44","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.44","url":null,"abstract":"Grid users may experience inconsistent performance due to specific characteristics of grids, such as fluctuating workloads, high failure rates, and high resource heterogeneity. Although extensive research has been done in grids, providing consistent performance remains largely an unsolved problem. In this study we use overdimensioning, a simple but cost-ineffective solution, to solve the performance inconsistency problem in grids. To this end, we propose several overdimensioning strategies, and we evaluate these strategies through simulations with workloads consisting of Bag-of-Tasks. We find that although overdimensioning is a simple solution, it is a viable solution to provide consistent performance in grids.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128154372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is a critical need to develop new programming paradigms for grid middleware tools and applications to harness the opportunities presented by emerging multi-core processors. Implementations of grid middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. We focus on the utilization of the L2 cache, which is a critical shared resource on Chip Multiprocessors. The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or undermine the ability to hide memory latency on a multi-core processor. None of the current grid simulators and emulators provides feedback and fine-grained performance data that is essential for a detailed analysis. Using the feedback from an emulation framework, we present performance analysis and provide recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of grid applications that requires processing of large-scale XML data. In particular, we discuss the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time.
{"title":"Cache Performance Optimization for Processing XML-Based Application Data on Multi-core Processors","authors":"Rajdeep Bhowmik, M. Govindaraju","doi":"10.1109/CCGRID.2010.122","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.122","url":null,"abstract":"There is a critical need to develop new programming paradigms for grid middleware tools and applications to harness the opportunities presented by emerging multi-core processors. Implementations of grid middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. We focus on the utilization of the L2 cache, which is a critical shared resource on Chip Multiprocessors. The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or undermine the ability to hide memory latency on a multi-core processor. None of the current grid simulators and emulators provides feedback and fine-grained performance data that is essential for a detailed analysis. Using the feedback from an emulation framework, we present performance analysis and provide recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of grid applications that requires processing of large-scale XML data. In particular, we discuss the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123986294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}