We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the "next" workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combine the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. So-called workload categorization problem plays a critical role towards improving the efficiency and the reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate to the distributed classification approach on top of virtual machines, which represent classical "commodity" settings for Cloud-based big data applications. Preliminary experimental assessment and analysis clearly confirm the benefits deriving from our classification framework.
{"title":"Cloud-Based Machine Learning Tools for Enhanced Big Data Applications","authors":"A. Cuzzocrea, E. Mumolo, P. Corona","doi":"10.1109/CCGrid.2015.170","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.170","url":null,"abstract":"We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the \"next\" workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combine the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. So-called workload categorization problem plays a critical role towards improving the efficiency and the reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate to the distributed classification approach on top of virtual machines, which represent classical \"commodity\" settings for Cloud-based big data applications. Preliminary experimental assessment and analysis clearly confirm the benefits deriving from our classification framework.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"60 1","pages":"908-914"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78854926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Kulikov, I. Chernykh, B. Glinsky, D. Weins, A. Shmelev
AstroPhi code is designed for simulation of astrophysical objects dynamics on hybrid supercomputers equipped with Intel Xenon Phi computation accelerators. New RSC PetaStream massively parallel architecture used for simulation. The results of AstroPhi acceleration for Intel Xeon Phi native and offload execution modes are presented in this paper. RSC PetaStream architecture gives possibility of astrophysical problems simulation in high resolution. AGNES simulation tool was used for scalability simulation of AstroPhi code. The are some gravitational collapse problems presented as demonstration of AstroPhi code.
{"title":"Astrophysics Simulation on RSC Massively Parallel Architecture","authors":"I. Kulikov, I. Chernykh, B. Glinsky, D. Weins, A. Shmelev","doi":"10.1109/CCGrid.2015.102","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.102","url":null,"abstract":"AstroPhi code is designed for simulation of astrophysical objects dynamics on hybrid supercomputers equipped with Intel Xenon Phi computation accelerators. New RSC PetaStream massively parallel architecture used for simulation. The results of AstroPhi acceleration for Intel Xeon Phi native and offload execution modes are presented in this paper. RSC PetaStream architecture gives possibility of astrophysical problems simulation in high resolution. AGNES simulation tool was used for scalability simulation of AstroPhi code. The are some gravitational collapse problems presented as demonstration of AstroPhi code.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"25 1","pages":"1131-1134"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77476869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Zhang, Xiaoyi Lu, Mark Daniel Arnold, D. Panda
Cloud Computing with Virtualization offers attractive flexibility and elasticity to deliver resources by providing a platform for consolidating complex IT resources in a scalable manner. However, efficiently running HPC applications on Cloud Computing systems is still full of challenges. One of the biggest hurdles in building efficient HPC clouds is the unsatisfactory performance offered by underlying virtualized environments, more specifically, virtualized I/O devices. Recently, Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high-performance interconnects such as InfiniBand and 10GigE. Due to its near native performance for inter-node communication, many cloud systems such as Amazon EC2 have been using SR-IOV in their production environments. Nevertheless, recent studies have shown that the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within the same physical node. In this paper, we propose an efficient approach to build HPC clouds based on MVAPICH2 over Open Stack with SR-IOV. We first propose an extension for Open Stack Nova system to enable the IV Shmem channel in deployed virtual machines. We further present and discuss our high-performance design of virtual machine aware MVAPICH2 library over Open Stack-based HPC Clouds. Our design can fully take advantage of high-performance SR-IOV communication for inter-node communication as well as Inter-VM Shmem (IVShmem) for intra-node communication. A comprehensive performance evaluation with micro-benchmarks and HPC applications has been conducted on an experimental Open Stack-based HPC cloud and Amazon EC2. The evaluation results on the experimental HPC cloud show that our design and extension can deliver near bare-metal performance for implementing SR-IOV-based HPC clouds with virtualization. Further, compared with the performance on EC2, our experimental HPC cloud can exhibit up to 160X, 65X, 12X improvement potential in terms of point-to-point, collective and application for future HPC clouds.
{"title":"MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds","authors":"Jie Zhang, Xiaoyi Lu, Mark Daniel Arnold, D. Panda","doi":"10.1109/CCGrid.2015.166","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.166","url":null,"abstract":"Cloud Computing with Virtualization offers attractive flexibility and elasticity to deliver resources by providing a platform for consolidating complex IT resources in a scalable manner. However, efficiently running HPC applications on Cloud Computing systems is still full of challenges. One of the biggest hurdles in building efficient HPC clouds is the unsatisfactory performance offered by underlying virtualized environments, more specifically, virtualized I/O devices. Recently, Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high-performance interconnects such as InfiniBand and 10GigE. Due to its near native performance for inter-node communication, many cloud systems such as Amazon EC2 have been using SR-IOV in their production environments. Nevertheless, recent studies have shown that the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within the same physical node. In this paper, we propose an efficient approach to build HPC clouds based on MVAPICH2 over Open Stack with SR-IOV. We first propose an extension for Open Stack Nova system to enable the IV Shmem channel in deployed virtual machines. We further present and discuss our high-performance design of virtual machine aware MVAPICH2 library over Open Stack-based HPC Clouds. Our design can fully take advantage of high-performance SR-IOV communication for inter-node communication as well as Inter-VM Shmem (IVShmem) for intra-node communication. A comprehensive performance evaluation with micro-benchmarks and HPC applications has been conducted on an experimental Open Stack-based HPC cloud and Amazon EC2. The evaluation results on the experimental HPC cloud show that our design and extension can deliver near bare-metal performance for implementing SR-IOV-based HPC clouds with virtualization. Further, compared with the performance on EC2, our experimental HPC cloud can exhibit up to 160X, 65X, 12X improvement potential in terms of point-to-point, collective and application for future HPC clouds.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"31 1","pages":"71-80"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79333798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today's big-data analysis systems achieve performance and scalability by requiring end users to embrace a novel programming model. This approach is highly effective whose the objective is to compute relatively simple functions on colossal amounts of data, but it is not a good match for a scientific computing environment which depends on complex applications written for the conventional POSIX environment. To address this gap, we introduce Conjugal, a scalable data-intensive computing system that is largely compatible with the POSIX environment. Conjugal brings together the workflow model of scientific computing with the storage architecture of other big data systems. Conjugal accepts large workflows of standard POSIX applications arranged into graphs, and then executes them in a cluster, exploiting both parallelism and data-locality. By making use of the workload structure, Conjugal is able to avoid the long-standing problems of metadata scalability and load instability found in many large scale computing and storage systems. We show that CompUSA's approach to load control offers improvements of up to 228% in cluster network utilization and 23% reductions in workflow execution time.
{"title":"Confuga: Scalable Data Intensive Computing for POSIX Workflows","authors":"P. Donnelly, Nicholas L. Hazekamp, D. Thain","doi":"10.1109/CCGrid.2015.95","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.95","url":null,"abstract":"Today's big-data analysis systems achieve performance and scalability by requiring end users to embrace a novel programming model. This approach is highly effective whose the objective is to compute relatively simple functions on colossal amounts of data, but it is not a good match for a scientific computing environment which depends on complex applications written for the conventional POSIX environment. To address this gap, we introduce Conjugal, a scalable data-intensive computing system that is largely compatible with the POSIX environment. Conjugal brings together the workflow model of scientific computing with the storage architecture of other big data systems. Conjugal accepts large workflows of standard POSIX applications arranged into graphs, and then executes them in a cluster, exploiting both parallelism and data-locality. By making use of the workload structure, Conjugal is able to avoid the long-standing problems of metadata scalability and load instability found in many large scale computing and storage systems. We show that CompUSA's approach to load control offers improvements of up to 228% in cluster network utilization and 23% reductions in workflow execution time.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"392-401"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75366089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Gomez-Folgar, A. García-Loureiro, T. F. Pena, J. I. Zablah, N. Seoane
Nowadays, there are several open-source solutions for building private, public and even hybrid clouds such as Eucalyptus, Apache Cloud Stack and Open Stack. KVM is one of the supported hypervisors for these cloud platforms. Different KVM configurations are being supplied by these platforms and, in some cases, a subset of CPU features are being presented to guest systems, providing a basic abstraction of the underlying CPU. One of the reasons for limiting the features of the Virtual CPU is to guarantee the guest compatibility with different hardware in heterogeneous environments. However, in a large number of situations, the cloud is deployed on an homogeneous set of hosts. In these cases, this limitation can affect the performance of applications being executed in guest systems. In this paper, we have analyzed the architecture, the KVM setup, and the performance of the Virtual Machines deployed by three popular cloud management platforms: Eucalyptus, Apache Cloud Stack and Open Stack, employing a representative set of applications.
{"title":"Study of the KVM CPU Performance of Open-Source Cloud Management Platforms","authors":"F. Gomez-Folgar, A. García-Loureiro, T. F. Pena, J. I. Zablah, N. Seoane","doi":"10.1109/CCGrid.2015.103","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.103","url":null,"abstract":"Nowadays, there are several open-source solutions for building private, public and even hybrid clouds such as Eucalyptus, Apache Cloud Stack and Open Stack. KVM is one of the supported hypervisors for these cloud platforms. Different KVM configurations are being supplied by these platforms and, in some cases, a subset of CPU features are being presented to guest systems, providing a basic abstraction of the underlying CPU. One of the reasons for limiting the features of the Virtual CPU is to guarantee the guest compatibility with different hardware in heterogeneous environments. However, in a large number of situations, the cloud is deployed on an homogeneous set of hosts. In these cases, this limitation can affect the performance of applications being executed in guest systems. In this paper, we have analyzed the architecture, the KVM setup, and the performance of the Virtual Machines deployed by three popular cloud management platforms: Eucalyptus, Apache Cloud Stack and Open Stack, employing a representative set of applications.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"12 2","pages":"1225-1228"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72614819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Rajachandrasekar, Akshay Venkatesh, Khaled Hamidouche, D. Panda
Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.
{"title":"Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters","authors":"R. Rajachandrasekar, Akshay Venkatesh, Khaled Hamidouche, D. Panda","doi":"10.1109/CCGrid.2015.169","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.169","url":null,"abstract":"Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"256 1","pages":"261-270"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73125767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many-core architecture provides a massively parallel environment with dozens of cores and hundreds of hardware threads. Scientific application programmers are increasingly looking at ways to utilize such large numbers of lightweight cores for various programming models. Efficiently executing these models on massively parallel many-core environments is not easy, however and performance may be degraded in various ways. The first author's doctoral research focuses on exploiting the capabilities of many-core architectures on widely used MPI implementations. While application programmers have studied several approaches to achieve better parallelism and resource sharing, many of those approaches still face communication problems that degrade performance. In the thesis, we investigate the characteristics of MPI on such massively threaded architectures and propose two efficient strategies -- a multi-threaded MPI approach and a process-based asynchronous model -- to optimize MPI communication for modern scientific applications.
{"title":"Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures","authors":"Min Si, P. Balaji, Y. Ishikawa","doi":"10.1109/CCGrid.2015.68","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.68","url":null,"abstract":"Many-core architecture provides a massively parallel environment with dozens of cores and hundreds of hardware threads. Scientific application programmers are increasingly looking at ways to utilize such large numbers of lightweight cores for various programming models. Efficiently executing these models on massively parallel many-core environments is not easy, however and performance may be degraded in various ways. The first author's doctoral research focuses on exploiting the capabilities of many-core architectures on widely used MPI implementations. While application programmers have studied several approaches to achieve better parallelism and resource sharing, many of those approaches still face communication problems that degrade performance. In the thesis, we investigate the characteristics of MPI on such massively threaded architectures and propose two efficient strategies -- a multi-threaded MPI approach and a process-based asynchronous model -- to optimize MPI communication for modern scientific applications.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"43 1","pages":"697-700"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86551749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Entity resolution is the basic operation of data quality management, and the key step to find the value of data. The parallel data processing framework based on MapReduce can deal with the challenge brought by big data. However, there exist two important issues, avoiding redundant pairs led by the multi-pass blocking method and optimizing candidate pairs based on the transitive relations of similarity. In this paper, we propose a multi-signature based parallel entity resolution method, called multi-sig-er, which supports unstructured data and structured data. Two redundancy elimination strategies are adopted to prune the candidate pairs and reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that our method tends to handle large datasets and it is more suitable for complex similarity computation than simple object matching.
{"title":"Eliminating the Redundancy in MapReduce-Based Entity Resolution","authors":"Cairong Yan, Yalong Song, Jian Wang, Wenjing Guo","doi":"10.1109/CCGrid.2015.24","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.24","url":null,"abstract":"Entity resolution is the basic operation of data quality management, and the key step to find the value of data. The parallel data processing framework based on MapReduce can deal with the challenge brought by big data. However, there exist two important issues, avoiding redundant pairs led by the multi-pass blocking method and optimizing candidate pairs based on the transitive relations of similarity. In this paper, we propose a multi-signature based parallel entity resolution method, called multi-sig-er, which supports unstructured data and structured data. Two redundancy elimination strategies are adopted to prune the candidate pairs and reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that our method tends to handle large datasets and it is more suitable for complex similarity computation than simple object matching.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"6 1","pages":"1233-1236"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90429897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Portals 4 network specification is a low-levelAPI for high-performance networks developed by Sandia National Laboratories, Intel Corporation, and the University of NewMexico. Portals 4 is specifically designed to support both the MPIand PGAS programming models efficiently by providing building blocks upon which to implement their particular features. In this paper we discuss our ongoing efforts to add efficient and robust support for Portals 4 networks inside MPICH, and we describe how the API semantics influenced our design. In particular, we found the lack of reliability guarantees from the Portals4 layer challenging to address. To tackle this situation, we implemented an intermediate layer - Rportals (reliable Portals), which modularizes the reliability functionality within our Portals network module for MPICH. In this paper we present theRportals design and its performance impact.
{"title":"Toward Implementing Robust Support for Portals 4 Networks in MPICH","authors":"Kenneth Raffenetti, Antonio J. Peña, P. Balaji","doi":"10.1109/CCGrid.2015.79","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.79","url":null,"abstract":"The Portals 4 network specification is a low-levelAPI for high-performance networks developed by Sandia National Laboratories, Intel Corporation, and the University of NewMexico. Portals 4 is specifically designed to support both the MPIand PGAS programming models efficiently by providing building blocks upon which to implement their particular features. In this paper we discuss our ongoing efforts to add efficient and robust support for Portals 4 networks inside MPICH, and we describe how the API semantics influenced our design. In particular, we found the lack of reliability guarantees from the Portals4 layer challenging to address. To tackle this situation, we implemented an intermediate layer - Rportals (reliable Portals), which modularizes the reliability functionality within our Portals network module for MPICH. In this paper we present theRportals design and its performance impact.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"76 1","pages":"1173-1176"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90587044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapel's performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.
{"title":"Assessing Memory Access Performance of Chapel through Synthetic Benchmarks","authors":"Engin Kayraklioglu, T. El-Ghazawi","doi":"10.1109/CCGrid.2015.157","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.157","url":null,"abstract":"The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapel's performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"7 1","pages":"1147-1150"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78436529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}