首页 > 最新文献

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing最新文献

英文 中文
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA 基于MPI RMA的高效可移植异步通信扩展NWChem
Min Si, Antonio J. Peña, J. Hammond, P. Balaji, Y. Ishikawa
NWChem is one of the most widely used computational chemistry application suites for chemical and biological systems. Despite its vast success, the computational efficiency of NWChem is still low. This is especially true in higher accuracy methods such as the CCSD(T) coupled cluster method, where it currently achieves a mere 50% computational efficiency when run at large scales. In this paper, we demonstrate the most computationally efficient scaling of NWChem CCSD(T) to date, and use it to solve large water clusters. We use our recently proposed process-based asynchronous progress framework for MPI RMA, called Casper, to scale the computation on water clusters at near-100% computational efficiency on up to 12288 cores.
NWChem是化学和生物系统中使用最广泛的计算化学应用套件之一。尽管取得了巨大的成功,但NWChem的计算效率仍然很低。这在精度更高的方法中尤其如此,例如CCSD(T)耦合簇方法,在大规模运行时,它目前的计算效率仅为50%。在本文中,我们展示了迄今为止计算效率最高的NWChem CCSD(T)缩放,并将其用于求解大型水簇。我们使用我们最近提出的基于进程的MPI RMA异步进程框架,称为Casper,在高达12288核的水集群上以接近100%的计算效率扩展计算。
{"title":"Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA","authors":"Min Si, Antonio J. Peña, J. Hammond, P. Balaji, Y. Ishikawa","doi":"10.1109/CCGrid.2015.48","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.48","url":null,"abstract":"NWChem is one of the most widely used computational chemistry application suites for chemical and biological systems. Despite its vast success, the computational efficiency of NWChem is still low. This is especially true in higher accuracy methods such as the CCSD(T) coupled cluster method, where it currently achieves a mere 50% computational efficiency when run at large scales. In this paper, we demonstrate the most computationally efficient scaling of NWChem CCSD(T) to date, and use it to solve large water clusters. We use our recently proposed process-based asynchronous progress framework for MPI RMA, called Casper, to scale the computation on water clusters at near-100% computational efficiency on up to 12288 cores.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"3 1","pages":"811-816"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90186568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Taming Latency in Data Center Networking with Erasure Coded Files 用Erasure编码文件控制数据中心网络中的延迟
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.142
Yu Xiang, V. Aggarwal, Y. Chen, Tian Lan
This paper proposes an approach to minimize service latency in a data center network where erasure-coded files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth available at both top-of-the-rack and aggregation switches, network bandwidth must be apportioned among different intra-and inter-rack data flows in line with their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to derive a closed-form outer-bound of service latency for erasure-coded storage with arbitrary file access patterns and service time distributions. The result enables us to propose a joint latency optimization over three entangled "control knobs": the bandwidth allocation at top-of-the-rack and aggregation switches, the probabilities for scheduling file requests, and the placement of encoded file chunks, which affects data locality. The joint optimization is shown to be a mixed-integer problem. We develop an iterative algorithm which decouples and solves the joint optimization as three sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is prototyped in an open-source, distributed file system, Tahoe, and evaluated on a cloud tested with 16 separate physical hosts in an Open Stack cluster. Experiments validate our theoretical latency analysis and show significant latency reduction for diverse file access patterns. The results provide valuable insight on designing low-latency data center networks with erasure-coded storage.
在数据中心网络中,擦除编码文件存储在分布式磁盘/机架上,访问请求分散在整个网络中,本文提出了一种最小化服务延迟的方法。由于机架顶交换机和汇聚交换机的可用带宽有限,网络带宽必须根据机架内和机架间数据流的流量统计数据进行分配。我们将此问题表述为加权排队,并采用一类概率请求调度策略,推导出具有任意文件访问模式和服务时间分布的擦除编码存储的服务延迟的封闭形式的外部边界。结果使我们能够在三个纠缠的“控制旋钮”上提出联合延迟优化:机架顶部和聚合交换机的带宽分配,调度文件请求的概率,以及影响数据局域性的编码文件块的位置。该联合优化是一个混合整数问题。我们开发了一种迭代算法,将联合优化解耦并求解为三个子问题,这些子问题要么是凸的,要么是在多项式时间内通过二部匹配可解的。提出的算法在一个开源的分布式文件系统Tahoe中进行了原型设计,并在一个开放堆栈集群中使用16个独立的物理主机进行了云测试。实验验证了我们的理论延迟分析,并显示了不同文件访问模式的显着延迟减少。研究结果为设计具有擦除编码存储的低延迟数据中心网络提供了有价值的见解。
{"title":"Taming Latency in Data Center Networking with Erasure Coded Files","authors":"Yu Xiang, V. Aggarwal, Y. Chen, Tian Lan","doi":"10.1109/CCGrid.2015.142","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.142","url":null,"abstract":"This paper proposes an approach to minimize service latency in a data center network where erasure-coded files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth available at both top-of-the-rack and aggregation switches, network bandwidth must be apportioned among different intra-and inter-rack data flows in line with their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to derive a closed-form outer-bound of service latency for erasure-coded storage with arbitrary file access patterns and service time distributions. The result enables us to propose a joint latency optimization over three entangled \"control knobs\": the bandwidth allocation at top-of-the-rack and aggregation switches, the probabilities for scheduling file requests, and the placement of encoded file chunks, which affects data locality. The joint optimization is shown to be a mixed-integer problem. We develop an iterative algorithm which decouples and solves the joint optimization as three sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is prototyped in an open-source, distributed file system, Tahoe, and evaluated on a cloud tested with 16 separate physical hosts in an Open Stack cluster. Experiments validate our theoretical latency analysis and show significant latency reduction for diverse file access patterns. The results provide valuable insight on designing low-latency data center networks with erasure-coded storage.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"45 1","pages":"241-250"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88296218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Cloud-Based Machine Learning Tools for Enhanced Big Data Applications 增强大数据应用的基于云的机器学习工具
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.170
A. Cuzzocrea, E. Mumolo, P. Corona
We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the "next" workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combine the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. So-called workload categorization problem plays a critical role towards improving the efficiency and the reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate to the distributed classification approach on top of virtual machines, which represent classical "commodity" settings for Cloud-based big data applications. Preliminary experimental assessment and analysis clearly confirm the benefits deriving from our classification framework.
我们提出了用于增强大数据应用的基于云的机器学习工具,其主要思想是通过一种创新的基于集成的方法来预测针对目标云基础设施发生的“下一个”工作负载,该方法结合了不同知名分类器的有效性,以提高最终分类的整体准确性,这在目前的大数据特定背景下非常相关。所谓的工作负载分类问题对于提高基于云的大数据应用的效率和可靠性起着至关重要的作用。在实现方面,我们的方法建议在虚拟机上部署参与分布式分类方法的云实体,这代表了基于云的大数据应用程序的经典“商品”设置。初步的实验评估和分析清楚地证实了我们的分类框架所带来的好处。
{"title":"Cloud-Based Machine Learning Tools for Enhanced Big Data Applications","authors":"A. Cuzzocrea, E. Mumolo, P. Corona","doi":"10.1109/CCGrid.2015.170","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.170","url":null,"abstract":"We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the \"next\" workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combine the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. So-called workload categorization problem plays a critical role towards improving the efficiency and the reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate to the distributed classification approach on top of virtual machines, which represent classical \"commodity\" settings for Cloud-based big data applications. Preliminary experimental assessment and analysis clearly confirm the benefits deriving from our classification framework.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"60 1","pages":"908-914"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78854926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Study of the KVM CPU Performance of Open-Source Cloud Management Platforms 开源云管理平台的KVM CPU性能研究
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.103
F. Gomez-Folgar, A. García-Loureiro, T. F. Pena, J. I. Zablah, N. Seoane
Nowadays, there are several open-source solutions for building private, public and even hybrid clouds such as Eucalyptus, Apache Cloud Stack and Open Stack. KVM is one of the supported hypervisors for these cloud platforms. Different KVM configurations are being supplied by these platforms and, in some cases, a subset of CPU features are being presented to guest systems, providing a basic abstraction of the underlying CPU. One of the reasons for limiting the features of the Virtual CPU is to guarantee the guest compatibility with different hardware in heterogeneous environments. However, in a large number of situations, the cloud is deployed on an homogeneous set of hosts. In these cases, this limitation can affect the performance of applications being executed in guest systems. In this paper, we have analyzed the architecture, the KVM setup, and the performance of the Virtual Machines deployed by three popular cloud management platforms: Eucalyptus, Apache Cloud Stack and Open Stack, employing a representative set of applications.
如今,有几个开源解决方案用于构建私有云、公共云甚至混合云,如Eucalyptus、Apache Cloud Stack和Open Stack。KVM是这些云平台支持的管理程序之一。这些平台提供了不同的KVM配置,在某些情况下,CPU特性的子集被呈现给客户机系统,提供底层CPU的基本抽象。限制虚拟CPU特性的原因之一是为了保证客户机与异构环境中不同硬件的兼容性。然而,在很多情况下,云部署在一组同质的主机上。在这些情况下,此限制可能会影响在来宾系统中执行的应用程序的性能。在本文中,我们分析了三种流行的云管理平台(Eucalyptus、Apache cloud Stack和Open Stack)部署的虚拟机的架构、KVM设置和性能,并采用了一组具有代表性的应用程序。
{"title":"Study of the KVM CPU Performance of Open-Source Cloud Management Platforms","authors":"F. Gomez-Folgar, A. García-Loureiro, T. F. Pena, J. I. Zablah, N. Seoane","doi":"10.1109/CCGrid.2015.103","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.103","url":null,"abstract":"Nowadays, there are several open-source solutions for building private, public and even hybrid clouds such as Eucalyptus, Apache Cloud Stack and Open Stack. KVM is one of the supported hypervisors for these cloud platforms. Different KVM configurations are being supplied by these platforms and, in some cases, a subset of CPU features are being presented to guest systems, providing a basic abstraction of the underlying CPU. One of the reasons for limiting the features of the Virtual CPU is to guarantee the guest compatibility with different hardware in heterogeneous environments. However, in a large number of situations, the cloud is deployed on an homogeneous set of hosts. In these cases, this limitation can affect the performance of applications being executed in guest systems. In this paper, we have analyzed the architecture, the KVM setup, and the performance of the Virtual Machines deployed by three popular cloud management platforms: Eucalyptus, Apache Cloud Stack and Open Stack, employing a representative set of applications.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"12 2","pages":"1225-1228"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72614819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations MPI-3.0进程级共享内存分析:以模板计算为例
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.131
Xiaomin Zhu, Junchao Zhang, Kazutomo Yoshii, Shigang Li, Yunquan Zhang, P. Balaji
The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabilities and performance implications of using MPI-3.0 shared memory, in the context of a five-point stencil computation. Our analysis reveals that the use of MPI-3.0 shared memory has several unforeseen performance implications including disrupting certain compiler optimizations and incorrectly using suboptimal page sizes inside the OS. Based on this analysis, we propose several methodologies for working around these issues and improving communication performance by 40-85% compared to the current MPI-1.0 based approach.
最近发布的MPI-3.0标准引入了进程级共享内存接口,该接口允许同一节点内的进程直接访问彼此的内存。这样的接口允许应用程序声明由节点上的多个MPI进程共享的数据结构。在本文中,我们研究了在五点模板计算的背景下使用MPI-3.0共享内存的能力和性能影响。我们的分析表明,使用MPI-3.0共享内存有几个不可预见的性能影响,包括破坏某些编译器优化和在操作系统中错误地使用次优页面大小。基于这一分析,我们提出了几种方法来解决这些问题,并与当前基于MPI-1.0的方法相比,将通信性能提高40-85%。
{"title":"Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations","authors":"Xiaomin Zhu, Junchao Zhang, Kazutomo Yoshii, Shigang Li, Yunquan Zhang, P. Balaji","doi":"10.1109/CCGrid.2015.131","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.131","url":null,"abstract":"The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabilities and performance implications of using MPI-3.0 shared memory, in the context of a five-point stencil computation. Our analysis reveals that the use of MPI-3.0 shared memory has several unforeseen performance implications including disrupting certain compiler optimizations and incorrectly using suboptimal page sizes inside the OS. Based on this analysis, we propose several methodologies for working around these issues and improving communication performance by 40-85% compared to the current MPI-1.0 based approach.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"1099-1106"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90791599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators 多核加速器上后缀阵列构建的并行DC3算法
Gang Liao, Longfei Ma, Guangming Zang, L. Tang
In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures.
在生物信息学应用中,后缀阵列广泛用于启发式算法初始精确匹配阶段的DNA序列比对。随着数据的指数级增长和可用性,使用多核加速器(如gpu)来优化现有算法是非常常见的。提出了一种新的后缀数组在GPU上的实现方法。因此,在包含超过1亿个字符的标准大型数据集上,GPU上的后缀数组构建实现了大约10倍的加速。这个想法简单、快速、可扩展,可以很容易地扩展到多核处理器甚至异构架构。
{"title":"Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators","authors":"Gang Liao, Longfei Ma, Guangming Zang, L. Tang","doi":"10.1109/CCGrid.2015.56","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.56","url":null,"abstract":"In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"1 1","pages":"1155-1158"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89703154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds 基于SR-IOV的MVAPICH2 over OpenStack:构建高性能计算云的有效方法
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.166
Jie Zhang, Xiaoyi Lu, Mark Daniel Arnold, D. Panda
Cloud Computing with Virtualization offers attractive flexibility and elasticity to deliver resources by providing a platform for consolidating complex IT resources in a scalable manner. However, efficiently running HPC applications on Cloud Computing systems is still full of challenges. One of the biggest hurdles in building efficient HPC clouds is the unsatisfactory performance offered by underlying virtualized environments, more specifically, virtualized I/O devices. Recently, Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high-performance interconnects such as InfiniBand and 10GigE. Due to its near native performance for inter-node communication, many cloud systems such as Amazon EC2 have been using SR-IOV in their production environments. Nevertheless, recent studies have shown that the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within the same physical node. In this paper, we propose an efficient approach to build HPC clouds based on MVAPICH2 over Open Stack with SR-IOV. We first propose an extension for Open Stack Nova system to enable the IV Shmem channel in deployed virtual machines. We further present and discuss our high-performance design of virtual machine aware MVAPICH2 library over Open Stack-based HPC Clouds. Our design can fully take advantage of high-performance SR-IOV communication for inter-node communication as well as Inter-VM Shmem (IVShmem) for intra-node communication. A comprehensive performance evaluation with micro-benchmarks and HPC applications has been conducted on an experimental Open Stack-based HPC cloud and Amazon EC2. The evaluation results on the experimental HPC cloud show that our design and extension can deliver near bare-metal performance for implementing SR-IOV-based HPC clouds with virtualization. Further, compared with the performance on EC2, our experimental HPC cloud can exhibit up to 160X, 65X, 12X improvement potential in terms of point-to-point, collective and application for future HPC clouds.
虚拟化云计算提供了一个平台,以可扩展的方式整合复杂的IT资源,从而为交付资源提供了极具吸引力的灵活性和弹性。然而,在云计算系统上高效运行HPC应用程序仍然充满挑战。构建高效HPC云的最大障碍之一是底层虚拟化环境(更具体地说,是虚拟化I/O设备)提供的令人不满意的性能。最近,单根I/O虚拟化(SR-IOV)技术在InfiniBand和10GigE等高性能互连中得到了稳步发展。由于SR-IOV在节点间通信方面的性能接近原生,许多云系统(如Amazon EC2)已经在其生产环境中使用SR-IOV。然而,最近的研究表明,SR-IOV方案缺乏对位置感知的通信支持,这导致了同一物理节点内vm间通信的性能开销。在本文中,我们提出了一种基于MVAPICH2和SR-IOV在Open Stack上构建高性能计算云的有效方法。我们首先为Open Stack Nova系统提出了一个扩展,以在已部署的虚拟机中启用IV Shmem通道。我们进一步介绍和讨论了基于Open stack的高性能计算云的虚拟机感知MVAPICH2库的高性能设计。我们的设计可以充分利用节点间通信的高性能SR-IOV通信和节点内通信的Inter-VM Shmem (IVShmem)。在基于Open stack的实验HPC云和Amazon EC2上进行了微基准测试和HPC应用的综合性能评估。在实验HPC云上的评估结果表明,我们的设计和扩展可以提供接近裸机的性能,实现基于sr - iov的虚拟化HPC云。此外,与EC2上的性能相比,我们的实验HPC云在点对点、集合体和未来HPC云的应用方面都有高达160倍、65倍、12倍的提升潜力。
{"title":"MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds","authors":"Jie Zhang, Xiaoyi Lu, Mark Daniel Arnold, D. Panda","doi":"10.1109/CCGrid.2015.166","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.166","url":null,"abstract":"Cloud Computing with Virtualization offers attractive flexibility and elasticity to deliver resources by providing a platform for consolidating complex IT resources in a scalable manner. However, efficiently running HPC applications on Cloud Computing systems is still full of challenges. One of the biggest hurdles in building efficient HPC clouds is the unsatisfactory performance offered by underlying virtualized environments, more specifically, virtualized I/O devices. Recently, Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high-performance interconnects such as InfiniBand and 10GigE. Due to its near native performance for inter-node communication, many cloud systems such as Amazon EC2 have been using SR-IOV in their production environments. Nevertheless, recent studies have shown that the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within the same physical node. In this paper, we propose an efficient approach to build HPC clouds based on MVAPICH2 over Open Stack with SR-IOV. We first propose an extension for Open Stack Nova system to enable the IV Shmem channel in deployed virtual machines. We further present and discuss our high-performance design of virtual machine aware MVAPICH2 library over Open Stack-based HPC Clouds. Our design can fully take advantage of high-performance SR-IOV communication for inter-node communication as well as Inter-VM Shmem (IVShmem) for intra-node communication. A comprehensive performance evaluation with micro-benchmarks and HPC applications has been conducted on an experimental Open Stack-based HPC cloud and Amazon EC2. The evaluation results on the experimental HPC cloud show that our design and extension can deliver near bare-metal performance for implementing SR-IOV-based HPC clouds with virtualization. Further, compared with the performance on EC2, our experimental HPC cloud can exhibit up to 160X, 65X, 12X improvement potential in terms of point-to-point, collective and application for future HPC clouds.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"31 1","pages":"71-80"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79333798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Eliminating the Redundancy in MapReduce-Based Entity Resolution 消除基于mapreduce的实体解析中的冗余
Cairong Yan, Yalong Song, Jian Wang, Wenjing Guo
Entity resolution is the basic operation of data quality management, and the key step to find the value of data. The parallel data processing framework based on MapReduce can deal with the challenge brought by big data. However, there exist two important issues, avoiding redundant pairs led by the multi-pass blocking method and optimizing candidate pairs based on the transitive relations of similarity. In this paper, we propose a multi-signature based parallel entity resolution method, called multi-sig-er, which supports unstructured data and structured data. Two redundancy elimination strategies are adopted to prune the candidate pairs and reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that our method tends to handle large datasets and it is more suitable for complex similarity computation than simple object matching.
实体解析是数据质量管理的基本操作,是发现数据价值的关键步骤。基于MapReduce的并行数据处理框架可以应对大数据带来的挑战。然而,存在两个重要问题,即避免多通道阻塞法导致的冗余对和基于相似性传递关系的候选对优化。本文提出了一种基于多重签名的并行实体解析方法,即multi-sign -er,该方法支持非结构化数据和结构化数据。采用两种冗余消除策略,在不影响分辨率精度的前提下,对候选对进行修剪,减少相似性计算次数。在实际数据集上的实验结果表明,该方法倾向于处理大型数据集,比简单的对象匹配更适合复杂的相似度计算。
{"title":"Eliminating the Redundancy in MapReduce-Based Entity Resolution","authors":"Cairong Yan, Yalong Song, Jian Wang, Wenjing Guo","doi":"10.1109/CCGrid.2015.24","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.24","url":null,"abstract":"Entity resolution is the basic operation of data quality management, and the key step to find the value of data. The parallel data processing framework based on MapReduce can deal with the challenge brought by big data. However, there exist two important issues, avoiding redundant pairs led by the multi-pass blocking method and optimizing candidate pairs based on the transitive relations of similarity. In this paper, we propose a multi-signature based parallel entity resolution method, called multi-sig-er, which supports unstructured data and structured data. Two redundancy elimination strategies are adopted to prune the candidate pairs and reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that our method tends to handle large datasets and it is more suitable for complex similarity computation than simple object matching.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"6 1","pages":"1233-1236"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90429897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Toward Implementing Robust Support for Portals 4 Networks in MPICH 在MPICH中实现对门户网络的强大支持
Kenneth Raffenetti, Antonio J. Peña, P. Balaji
The Portals 4 network specification is a low-levelAPI for high-performance networks developed by Sandia National Laboratories, Intel Corporation, and the University of NewMexico. Portals 4 is specifically designed to support both the MPIand PGAS programming models efficiently by providing building blocks upon which to implement their particular features. In this paper we discuss our ongoing efforts to add efficient and robust support for Portals 4 networks inside MPICH, and we describe how the API semantics influenced our design. In particular, we found the lack of reliability guarantees from the Portals4 layer challenging to address. To tackle this situation, we implemented an intermediate layer - Rportals (reliable Portals), which modularizes the reliability functionality within our Portals network module for MPICH. In this paper we present theRportals design and its performance impact.
门户4网络规范是由Sandia国家实验室、Intel公司和新墨西哥大学开发的用于高性能网络的低级api。Portals 4是专门为有效地支持mpi和PGAS编程模型而设计的,通过提供构建块来实现它们的特定功能。在本文中,我们讨论了在MPICH中为portal 4网络添加高效和健壮支持的持续努力,并描述了API语义如何影响我们的设计。特别是,我们发现portal4层缺乏可靠性保证的问题很难解决。为了解决这种情况,我们实现了一个中间层——Rportals(可靠的门户),它在我们的门户网络模块中为MPICH模块化了可靠性功能。在本文中,我们介绍了theRportals的设计及其性能影响。
{"title":"Toward Implementing Robust Support for Portals 4 Networks in MPICH","authors":"Kenneth Raffenetti, Antonio J. Peña, P. Balaji","doi":"10.1109/CCGrid.2015.79","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.79","url":null,"abstract":"The Portals 4 network specification is a low-levelAPI for high-performance networks developed by Sandia National Laboratories, Intel Corporation, and the University of NewMexico. Portals 4 is specifically designed to support both the MPIand PGAS programming models efficiently by providing building blocks upon which to implement their particular features. In this paper we discuss our ongoing efforts to add efficient and robust support for Portals 4 networks inside MPICH, and we describe how the API semantics influenced our design. In particular, we found the lack of reliability guarantees from the Portals4 layer challenging to address. To tackle this situation, we implemented an intermediate layer - Rportals (reliable Portals), which modularizes the reliability functionality within our Portals network module for MPICH. In this paper we present theRportals design and its performance impact.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"76 1","pages":"1173-1176"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90587044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Assessing Memory Access Performance of Chapel through Synthetic Benchmarks 通过综合基准评估Chapel的内存访问性能
Pub Date : 2015-05-04 DOI: 10.1109/CCGrid.2015.157
Engin Kayraklioglu, T. El-Ghazawi
The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapel's performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.
分区全局地址空间(PGAS)编程模型在高性能和局域意识之间取得了平衡。作为一种PGAS语言,Chapel通过在执行实体之间逻辑分区的平面内存空间,将程序员从处理分布式内存环境中数据移动的细节中解脱出来。遍历这样的空间需要将地址映射到系统虚拟地址空间,因此,这种抽象不可避免地会导致内存访问期间的主要开销。在本文中,我们通过实现一个微基准来测试在Chapel中可以观察到的不同类型的内存访问,从而分析了这种开销的程度。我们表明,随着局部性得到充分利用,加速增益可以达到35倍。然而,这是通过手动调优来证明的。应该提供更有效的方法来交付这样的性能改进,而不会给程序员带来过多的负担。因此,我们还讨论了通过标准库、编译器、运行时和/或硬件支持来提高Chapel性能的可能性,以更有效地处理不同类型的内存访问。
{"title":"Assessing Memory Access Performance of Chapel through Synthetic Benchmarks","authors":"Engin Kayraklioglu, T. El-Ghazawi","doi":"10.1109/CCGrid.2015.157","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.157","url":null,"abstract":"The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapel's performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"7 1","pages":"1147-1150"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78436529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1