首页 > 最新文献

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Fault-Tolerant Dynamic Task Graph Scheduling 容错动态任务图调度
Mehmet Can Kurt, S. Krishnamoorthy, Kunal Agrawal, G. Agrawal
In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. From users, we elicit the basic task graph structure in terms of successor and predecessor relationships. The work-stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and metadata associated with a task get corrupted. We use this redundancy, and knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.
本文提出了一种利用工作窃取调度的动态任务图的容错执行方法。我们特别关注在存在软故障的情况下任务的选择性和局部恢复。从用户那里,我们得到了基于后继关系和前导关系的基本任务图结构。用于调度这种任务图的基于工作窃取的算法得到了增强,以便在与任务关联的数据和元数据损坏时实现恢复。我们利用这种冗余和任务图结构的知识,以低空间和时间开销有选择地从故障中恢复。研究表明,容错设计保留了基于工作窃取的底层任务调度算法的基本特性,并且在考虑任务重执行的情况下,容错执行是渐近最优的。实验评估表明,在各种故障情况下,恢复成本较低。
{"title":"Fault-Tolerant Dynamic Task Graph Scheduling","authors":"Mehmet Can Kurt, S. Krishnamoorthy, Kunal Agrawal, G. Agrawal","doi":"10.1109/SC.2014.64","DOIUrl":"https://doi.org/10.1109/SC.2014.64","url":null,"abstract":"In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. From users, we elicit the basic task graph structure in terms of successor and predecessor relationships. The work-stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and metadata associated with a task get corrupted. We use this redundancy, and knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131385060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers 现有PDE求解器中平衡并行性、数据局部性和重计算的研究
C. Olschanowsky, M. Strout, S. Guzik, J. Loffeld, J. Hittinger
Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of cells or faces in a structured grid. In the Chombo framework, the box sizes are typically 163 or 323, but larger box sizes such as 1283 would result in less surface area and therefore less storage, copying, and/or ghost cells communication overhead. Unfortunately, current on node parallelization schemes perform poorly for these larger box sizes. In this paper, we investigate 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 1283 boxes result in close to ideal parallel scaling and come close to matching the performance of 163 boxes on three different multicore systems for a benchmark that is a proxy for program idioms found in Computational Fluid Dynamic (CFD) codes.
结构网格PDE求解器框架在盒上并行化,盒是结构网格中的单元或面的矩形域。在Chombo框架中,盒子大小通常为163或323,但更大的盒子大小(如1283)会导致更小的表面积,从而减少存储、复制和/或鬼细胞通信开销。不幸的是,当前的节点并行化方案对于这些较大的盒子尺寸表现不佳。在本文中,我们研究了30种不同的环间优化策略,并展示了其中一些变体在NUMA多核节点上的并行扩展优势。1283个盒子的移位、融合和避免通信的变体导致接近理想的并行缩放,并且接近于在三个不同的多核系统上匹配163个盒子的性能,这是计算流体动力学(CFD)代码中程序习惯的代理基准。
{"title":"A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers","authors":"C. Olschanowsky, M. Strout, S. Guzik, J. Loffeld, J. Hittinger","doi":"10.1109/SC.2014.70","DOIUrl":"https://doi.org/10.1109/SC.2014.70","url":null,"abstract":"Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of cells or faces in a structured grid. In the Chombo framework, the box sizes are typically 163 or 323, but larger box sizes such as 1283 would result in less surface area and therefore less storage, copying, and/or ghost cells communication overhead. Unfortunately, current on node parallelization schemes perform poorly for these larger box sizes. In this paper, we investigate 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 1283 boxes result in close to ideal parallel scaling and come close to matching the performance of 163 boxes on three different multicore systems for a benchmark that is a proxy for program idioms found in Computational Fluid Dynamic (CFD) codes.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124310566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Recycled Error Bits: Energy-Efficient Architectural Support for Floating Point Accuracy 回收错误位:对浮点精度的节能架构支持
Ralph Nathan, Bryan Anthonio, Shih-Lien Lu, Helia Naeimi, Daniel J. Sorin, Xiaobai Sun
In this work, we provide energy-efficient architectural support for floating point accuracy. For each floating point addition performed, we "recycle" that operation's rounding error. We make this error architecturally visible such that it can be used, whenever desired, by software. We also design a compiler pass that allows software to automatically use this feature. Experimental results on physical hardware show that software that exploits architecturally recycled error bits can (a) achieve accuracy comparable to a 64-bit FPU with performance and energy that are comparable to a 32-bit FPU, and (b) achieve accuracy comparable to an all-software scheme for 128-bit accuracy with far better performance and energy usage.
在这项工作中,我们为浮点精度提供了节能的体系结构支持。对于每次执行的浮点加法,我们都“回收”该操作的舍入误差。我们使这个错误在体系结构上可见,以便软件可以随时使用它。我们还设计了一个编译器通道,允许软件自动使用该特性。物理硬件上的实验结果表明,利用架构上回收的错误位的软件可以(a)达到与64位FPU相当的精度,性能和能量与32位FPU相当,并且(b)达到与128位精度相当的全软件方案的精度,性能和能耗要好得多。
{"title":"Recycled Error Bits: Energy-Efficient Architectural Support for Floating Point Accuracy","authors":"Ralph Nathan, Bryan Anthonio, Shih-Lien Lu, Helia Naeimi, Daniel J. Sorin, Xiaobai Sun","doi":"10.1109/SC.2014.15","DOIUrl":"https://doi.org/10.1109/SC.2014.15","url":null,"abstract":"In this work, we provide energy-efficient architectural support for floating point accuracy. For each floating point addition performed, we \"recycle\" that operation's rounding error. We make this error architecturally visible such that it can be used, whenever desired, by software. We also design a compiler pass that allows software to automatically use this feature. Experimental results on physical hardware show that software that exploits architecturally recycled error bits can (a) achieve accuracy comparable to a 64-bit FPU with performance and energy that are comparable to a 32-bit FPU, and (b) achieve accuracy comparable to an all-software scheme for 128-bit accuracy with far better performance and energy usage.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121722863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Optimized Scheduling Strategies for Hybrid Density Functional theory Electronic Structure Calculations 混合密度泛函理论电子结构计算的优化调度策略
William Dawson, F. Gygi
Hybrid Density Functional Theory (DFT) has recently gained popularity as an accurate model of electronic interactions in chemistry and materials science applications. The most computationally expensive part of hybrid DFT simulations is the calculation of exchange integrals between pairs of electrons. We present strategies to achieve improved load balancing and scalability for the parallel computation of these integrals. First, we develop a cost model for the calculation, and utilize random search algorithms to optimize the data distribution and calculation schedule. Second, we further improve performance using partial data-replication to increase data availability across cores. We demonstrate these improvements using an implementation in the Qbox Density Functional Theory code on the Mira Blue Gene/Q computer at Argonne National Laboratory. We perform calculations in the range of 8k to 128k cores on two representative simulation samples from materials science and chemistry applications: liquid water and a metal-water interface.
混合密度泛函理论(DFT)作为一种精确的电子相互作用模型近年来在化学和材料科学应用中得到了广泛的应用。混合DFT模拟中计算开销最大的部分是计算电子对之间的交换积分。我们提出了实现这些积分并行计算的改进负载平衡和可扩展性的策略。首先,我们建立了计算成本模型,并利用随机搜索算法优化数据分布和计算进度。其次,我们使用部分数据复制进一步提高性能,以增加跨核心的数据可用性。我们在阿贡国家实验室的Mira Blue Gene/Q计算机上使用Qbox密度泛函理论代码中的实现来演示这些改进。我们对来自材料科学和化学应用的两个代表性模拟样本(液态水和金属-水界面)在8k至128k核范围内进行计算。
{"title":"Optimized Scheduling Strategies for Hybrid Density Functional theory Electronic Structure Calculations","authors":"William Dawson, F. Gygi","doi":"10.1109/SC.2014.61","DOIUrl":"https://doi.org/10.1109/SC.2014.61","url":null,"abstract":"Hybrid Density Functional Theory (DFT) has recently gained popularity as an accurate model of electronic interactions in chemistry and materials science applications. The most computationally expensive part of hybrid DFT simulations is the calculation of exchange integrals between pairs of electrons. We present strategies to achieve improved load balancing and scalability for the parallel computation of these integrals. First, we develop a cost model for the calculation, and utilize random search algorithms to optimize the data distribution and calculation schedule. Second, we further improve performance using partial data-replication to increase data availability across cores. We demonstrate these improvements using an implementation in the Qbox Density Functional Theory code on the Mira Blue Gene/Q computer at Argonne National Laboratory. We perform calculations in the range of 8k to 128k cores on two representative simulation samples from materials science and chemistry applications: liquid water and a metal-water interface.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122560676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing 使用受限工作窃取优化Fork/Join程序的数据位置
J. Lifflander, S. Krishnamoorthy, L. Kalé
We present an approach to improving data locality across different phases of fork/join programs scheduled using work stealing. The approach consists of: (1) user-specified and automated approaches to constructing a steal tree, the schedule of steal operations, and (2) constrained work-stealing algorithms that constrain the actions of the scheduler to mirror a given steal tree. These are combined to construct work-stealing schedules that maximize data locality across computation phases while ensuring load balance within each phase. These algorithms are also used to demonstrate dynamic coarsening, an optimization to improve spatial locality and sequential overheads by combining many finer-grained tasks into coarser tasks while ensuring sufficient concurrency for locality-optimized load balance. Implementation and evaluation in Cilk demonstrate performance improvements of up to 2.5x on 80 cores. We also demonstrate that dynamic coarsening can combine the performance benefits of coarse task specification with the adaptability of finer tasks.
我们提出了一种使用工作窃取来改善fork/join程序不同阶段的数据局部性的方法。该方法包括:(1)用户指定的和自动化的方法来构造窃取树,即窃取操作的调度,以及(2)约束的工作窃取算法,该算法约束调度程序的操作以镜像给定的窃取树。将这些组合起来构建工作窃取计划,从而最大化跨计算阶段的数据局部性,同时确保每个阶段内的负载平衡。这些算法还用于演示动态粗化,这是一种通过将许多细粒度任务组合为粗粒度任务来改进空间局部性和顺序开销的优化,同时确保足够的并发性以实现局部性优化的负载平衡。在Cilk中的实现和评估表明,在80核上的性能提高高达2.5倍。我们还证明了动态粗化可以将粗任务规范的性能优势与细任务的适应性相结合。
{"title":"Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing","authors":"J. Lifflander, S. Krishnamoorthy, L. Kalé","doi":"10.1109/SC.2014.75","DOIUrl":"https://doi.org/10.1109/SC.2014.75","url":null,"abstract":"We present an approach to improving data locality across different phases of fork/join programs scheduled using work stealing. The approach consists of: (1) user-specified and automated approaches to constructing a steal tree, the schedule of steal operations, and (2) constrained work-stealing algorithms that constrain the actions of the scheduler to mirror a given steal tree. These are combined to construct work-stealing schedules that maximize data locality across computation phases while ensuring load balance within each phase. These algorithms are also used to demonstrate dynamic coarsening, an optimization to improve spatial locality and sequential overheads by combining many finer-grained tasks into coarser tasks while ensuring sufficient concurrency for locality-optimized load balance. Implementation and evaluation in Cilk demonstrate performance improvements of up to 2.5x on 80 cores. We also demonstrate that dynamic coarsening can combine the performance benefits of coarse task specification with the adaptability of finer tasks.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132538421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
MSL: A Synthesis Enabled Language for Distributed Implementations MSL:一种用于分布式实现的综合语言
Zhilei Xu, S. Kamil, Armando Solar-Lezama
This paper demonstrates how ideas from generative programming and software synthesis can help support the development of bulk-synchronous distributed memory kernels. These ideas are realized in a new language called MSL, a C-like language that combines synthesis features with high level notations for array manipulation and bulk-synchronous parallelism to simplify the semantic analysis required for synthesis. The paper shows that by leveraging these high level notations, it is possible to scale the synthesis and automated bug-finding technologies that underlie MSL to realistic computational kernels. Specifically, we demonstrate the methodology through case studies implementing non-trivial distributed kernels -- both regular and irregular -- from the NAS parallel benchmarks. We show that our approach can automatically infer many challenging details from these benchmarks and can enable high level implementation ideas to be reused between similar kernels. We also demonstrate that these high level notations map easily to low level C code and show that the performance of this generated code matches that of handwritten Fortran.
本文演示了生成式编程和软件合成的思想如何帮助支持大容量同步分布式内存内核的开发。这些想法是在一种名为MSL的新语言中实现的,MSL是一种类似c的语言,它将合成特性与用于数组操作的高级符号和批量同步并行性结合起来,以简化合成所需的语义分析。本文表明,通过利用这些高级符号,可以将MSL基础上的合成和自动bug查找技术扩展到现实的计算内核。具体来说,我们通过案例研究来演示该方法,这些案例研究实现了NAS并行基准测试中的非平凡分布式内核(包括规则的和不规则的)。我们表明,我们的方法可以从这些基准测试中自动推断出许多具有挑战性的细节,并且可以在类似的内核之间重用高级实现思想。我们还演示了这些高级符号很容易映射到低级C代码,并展示了生成的代码的性能与手写Fortran的性能相匹配。
{"title":"MSL: A Synthesis Enabled Language for Distributed Implementations","authors":"Zhilei Xu, S. Kamil, Armando Solar-Lezama","doi":"10.1109/SC.2014.31","DOIUrl":"https://doi.org/10.1109/SC.2014.31","url":null,"abstract":"This paper demonstrates how ideas from generative programming and software synthesis can help support the development of bulk-synchronous distributed memory kernels. These ideas are realized in a new language called MSL, a C-like language that combines synthesis features with high level notations for array manipulation and bulk-synchronous parallelism to simplify the semantic analysis required for synthesis. The paper shows that by leveraging these high level notations, it is possible to scale the synthesis and automated bug-finding technologies that underlie MSL to realistic computational kernels. Specifically, we demonstrate the methodology through case studies implementing non-trivial distributed kernels -- both regular and irregular -- from the NAS parallel benchmarks. We show that our approach can automatically infer many challenging details from these benchmarks and can enable high level implementation ideas to be reused between similar kernels. We also demonstrate that these high level notations map easily to low level C code and show that the performance of this generated code matches that of handwritten Fortran.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115644435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Fence Scoping 围墙范围
Changhui Lin, V. Nagarajan, Rajiv Gupta
We observe that fence instructions used by programmers are usually only intended to order memory accesses within a limited scope. Based on this observation, we propose the concept fence scope which defines the scope within which a fence enforces the order of memory accesses, called scoped fence (S-Fence). S-Fence is a customizable fence, which enables programmers to express ordering demands by specifying the scope of fences when they only want to order part of memory accesses. At runtime, hardware uses the scope information conveyed by programmers to execute fence instructions in a manner that imposes fewer memory ordering constraints than a traditional fence, and hence improves program performance. Our experimental results show that the benefit of S-Fence hinges on the characteristics of applications and hardware parameters. A group of lock-free algorithms achieve peak speedups ranging from 1.13x to 1.34x, while full applications achieve speedups ranging from 1.04x to 1.23x.
我们注意到,程序员使用的栅栏指令通常只是为了在有限的范围内对内存访问进行排序。基于这一观察,我们提出了fence作用域的概念,它定义了fence强制内存访问顺序的作用域,称为作用域fence (S-Fence)。S-Fence是一种可定制的fence,当程序员只希望对部分内存访问进行排序时,可以通过指定fence的作用域来表达排序需求。在运行时,硬件使用程序员传递的作用域信息来执行fence指令,这种方式比传统的fence施加更少的内存排序约束,从而提高了程序性能。实验结果表明,S-Fence的优点取决于应用特性和硬件参数。一组无锁算法的峰值加速范围从1.13倍到1.34倍,而完整应用程序的峰值加速范围从1.04倍到1.23倍。
{"title":"Fence Scoping","authors":"Changhui Lin, V. Nagarajan, Rajiv Gupta","doi":"10.1109/SC.2014.14","DOIUrl":"https://doi.org/10.1109/SC.2014.14","url":null,"abstract":"We observe that fence instructions used by programmers are usually only intended to order memory accesses within a limited scope. Based on this observation, we propose the concept fence scope which defines the scope within which a fence enforces the order of memory accesses, called scoped fence (S-Fence). S-Fence is a customizable fence, which enables programmers to express ordering demands by specifying the scope of fences when they only want to order part of memory accesses. At runtime, hardware uses the scope information conveyed by programmers to execute fence instructions in a manner that imposes fewer memory ordering constraints than a traditional fence, and hence improves program performance. Our experimental results show that the benefit of S-Fence hinges on the characteristics of applications and hardware parameters. A group of lock-free algorithms achieve peak speedups ranging from 1.13x to 1.34x, while full applications achieve speedups ranging from 1.04x to 1.23x.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123802138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Reciprocal Resource Fairness: Towards Cooperative Multiple-Resource Fair Sharing in IaaS Clouds 互惠资源公平:迈向IaaS云环境下协同多资源公平共享
Haikun Liu, Bingsheng He
Resource sharing in virtualized environments have been demonstrated significant benefits to improve application performance and resource/energy efficiency. However, resource sharing, especially for multiple resource types, poses several severe and challenging problems in pay-as-you-use cloud environments, such as sharing incentive, free-riding, lying and economic fairness. To address those problems, we propose Reciprocal Resource Fairness (RRF), a novel resource allocation mechanism to enable fair sharing multiple types of resource among multiple tenants in new-generation cloud environments. RRF implements two complementary and hierarchical mechanisms for resource sharing: inter-tenant resource trading and intra-tenant weight adjustment. We show that RRF satisfies several highly desirable properties to ensure fairness. Experimental results show that RRF is promising for both cloud providers and tenants. Compared to existing cloud models, RRF improves virtual machine (VM) density and cloud providers' revenue by 2.2X. For tenants, RRF improves application performance by 45% and guarantees 95% economic fairness among multiple tenants.
虚拟化环境中的资源共享已被证明对提高应用程序性能和资源/能源效率有显著好处。然而,资源共享,特别是针对多种资源类型的资源共享,在按需付费的云环境中提出了一些严峻而具有挑战性的问题,如共享激励、搭便车、说谎和经济公平。为了解决这些问题,我们提出了互惠资源公平(RRF),这是一种新的资源分配机制,可以在新一代云环境中在多个租户之间公平共享多种类型的资源。RRF实现了两种互补的分层资源共享机制:租户间资源交易和租户内部权重调整。我们证明RRF满足几个非常理想的性质,以确保公平性。实验结果表明,RRF对云提供商和租户都很有前景。与现有的云模型相比,RRF将虚拟机(VM)密度和云提供商的收入提高了2.2倍。对于租户,RRF可以将应用程序性能提高45%,并保证多个租户之间95%的经济公平性。
{"title":"Reciprocal Resource Fairness: Towards Cooperative Multiple-Resource Fair Sharing in IaaS Clouds","authors":"Haikun Liu, Bingsheng He","doi":"10.1109/SC.2014.84","DOIUrl":"https://doi.org/10.1109/SC.2014.84","url":null,"abstract":"Resource sharing in virtualized environments have been demonstrated significant benefits to improve application performance and resource/energy efficiency. However, resource sharing, especially for multiple resource types, poses several severe and challenging problems in pay-as-you-use cloud environments, such as sharing incentive, free-riding, lying and economic fairness. To address those problems, we propose Reciprocal Resource Fairness (RRF), a novel resource allocation mechanism to enable fair sharing multiple types of resource among multiple tenants in new-generation cloud environments. RRF implements two complementary and hierarchical mechanisms for resource sharing: inter-tenant resource trading and intra-tenant weight adjustment. We show that RRF satisfies several highly desirable properties to ensure fairness. Experimental results show that RRF is promising for both cloud providers and tenants. Compared to existing cloud models, RRF improves virtual machine (VM) density and cloud providers' revenue by 2.2X. For tenants, RRF improves application performance by 45% and guarantees 95% economic fairness among multiple tenants.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121054413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Parallel Programming with Migratable Objects: Charm++ in Practice 可迁移对象并行编程:charm++在实践中的应用
Bilge Acun, Abhishek K. Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael P. Robson, Yanhua Sun, E. Totoni, Lukasz Wesolowski, L. Kalé
The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede.
千兆级计算的出现为编程可扩展的并行应用程序带来了新的挑战(例如异构性、系统故障)。当今科学和工程应用中日益增加的复杂性和动态性进一步加剧了这种情况。解决这些挑战需要更多地强调以前是次要的概念,包括可移植性、适应性和运行时系统自省。在本文中,我们利用我们对这些概念的经验来演示它们在现实世界应用程序中的适用性和有效性。通过使用charm++并行编程框架,我们详细介绍了这些概念如何导致应用程序的开发,而不管超级计算技术的大致情况如何。本文提出的经验评估涵盖了许多微型应用程序和在现代超级计算机上执行的实际应用程序,包括Blue Gene/Q, Cray XE6和Stampede。
{"title":"Parallel Programming with Migratable Objects: Charm++ in Practice","authors":"Bilge Acun, Abhishek K. Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael P. Robson, Yanhua Sun, E. Totoni, Lukasz Wesolowski, L. Kalé","doi":"10.1109/SC.2014.58","DOIUrl":"https://doi.org/10.1109/SC.2014.58","url":null,"abstract":"The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121433133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 147
Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory 洛斯阿拉莫斯国家实验室生产和退役高性能计算平台的正确性现场测试
S. Michalak, W. Rust, John T. Dal, Rew J. Dubois, D. Dubois
Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.
静默数据损坏(SDC)会威胁到在高性能计算(HPC)平台和其他系统上执行的科学计算的完整性。为了描述这个问题,在洛斯阿拉莫斯国家实验室对高性能计算平台进行了正确性现场测试。这项工作展示了12个平台的结果,包括在超过8,750个计算节点上执行的超过1,000个节点年的计算和涉及近6,000个计算节点的超过260 pb的数据传输,以及相关的经验教训。观察到瞬态误差和间歇误差的不正确结果。这些结果是弹性工作的关键基础,因为它们提供了在现场条件下观察到的不正确结果的标志。观察到与瞬态误差机制一致的五个错误结果,表明瞬态误差的影响可以减轻。然而,观察到的与间歇性错误机制一致的错误结果的数量表明,间歇性错误可能会严重影响计算的正确性。
{"title":"Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory","authors":"S. Michalak, W. Rust, John T. Dal, Rew J. Dubois, D. Dubois","doi":"10.1109/SC.2014.55","DOIUrl":"https://doi.org/10.1109/SC.2014.55","url":null,"abstract":"Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122660814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1