首页 > 最新文献

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Two Roads to Parallelism: From Serial Code to Programming with STAPL 通往并行的两条道路:从串行代码到STAPL编程
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00048
Lawrence Rauchwerger
Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential.
并行计算机已经成熟,需要并行软件来证明它们的实用性。让程序并行运行有两种主要途径:并行编译器和并行语言和/或库。在这次演讲中,我们将介绍我们使用这两种方法的最新结果,并就它们的相对有效性和潜力得出一些结论。
{"title":"Two Roads to Parallelism: From Serial Code to Programming with STAPL","authors":"Lawrence Rauchwerger","doi":"10.1109/IPDPS.2019.00048","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00048","url":null,"abstract":"Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123654150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effects and Benefits of Node Sharing Strategies in HPC Batch Systems 节点共享策略在HPC批处理系统中的作用和收益
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00016
Alvaro Frank, Tim Süß, A. Brinkmann
Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing and also fear that interference between co-allocated jobs in practice leads to worse performance. This paper focuses on sharing nodes by oversubscribing cores through hyper-threading. We introduce new node sharing strategies for batch systems by deriving extensions to the well-known backfill and first fit algorithms. These strategies have been implemented in the SLURM workload manager and the evaluation is based on NERSC Trinity scientific mini applications. The evaluation of our node sharing strategies shows no overhead when using co-allocation, but an increased computational efficiency of 19% and an increased scheduling efficiency of 25.2% compared to standard node allocation.
目前,处理器制造商通过增加每个CPU上的核心数量来扩展性能。不幸的是,并不是所有的HPC应用程序都能有效地饱和单个节点的所有核心,即使它们成功地扩展到数千个节点。对于这些应用程序,与其他应用程序共享节点可以帮助在节点上强调不同的资源,从而更有效地使用它们。以前的研究表明,节点共享对性能的影响非常依赖于应用程序,但很少有工作研究它在批处理系统和复杂并行应用程序混合中的影响。因此,管理员通常担心运行支持节点共享的批处理系统的复杂性,也担心实际中共同分配的作业之间的干扰会导致性能下降。本文的重点是通过超线程超额订阅内核来共享节点。通过对已知的回填和首次拟合算法的扩展,我们引入了新的节点共享策略。这些策略已在SLURM工作负载管理器中实现,并基于NERSC Trinity科学迷你应用程序进行评估。对我们的节点共享策略的评估表明,与标准节点分配相比,使用共同分配时没有任何开销,但计算效率提高了19%,调度效率提高了25.2%。
{"title":"Effects and Benefits of Node Sharing Strategies in HPC Batch Systems","authors":"Alvaro Frank, Tim Süß, A. Brinkmann","doi":"10.1109/IPDPS.2019.00016","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00016","url":null,"abstract":"Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing and also fear that interference between co-allocated jobs in practice leads to worse performance. This paper focuses on sharing nodes by oversubscribing cores through hyper-threading. We introduce new node sharing strategies for batch systems by deriving extensions to the well-known backfill and first fit algorithms. These strategies have been implemented in the SLURM workload manager and the evaluation is based on NERSC Trinity scientific mini applications. The evaluation of our node sharing strategies shows no overhead when using co-allocation, but an increased computational efficiency of 19% and an increased scheduling efficiency of 25.2% compared to standard node allocation.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128458700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
SprintCon: Controllable and Efficient Computational Sprinting for Data Center Servers SprintCon:数据中心服务器的可控高效计算冲刺
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00090
Wenli Zheng, Xiaorui Wang, Yue Ma, Chao Li, Hao Lin, Bin Yao, Jianfeng Zhang, M. Guo
Computational sprinting is an effective mechanism to temporarily boost the performance of data center servers. However, given the great effect on performance improvement, how to make the sprinting process controllable and how to maximize the sprinting efficiency have not been well discussed yet. Those can be significant problems for a data center when computational sprinting is needed for more than a few minutes, since it requires the support of energy storage, whose capacity is limited. The control and efficiency of sprinting not only involve how fast to run servers and how to allocate resources to co-running workloads, but also the impact on power overload, and how to handle the overload with circuit breakers and energy storage to ensure power safety. Different workloads can impact sprinting in different ways, and hence efficient sprinting requires workload-specific strategies. In this paper, we propose SprintCon to realize controllable and efficient computational sprinting for data center servers. SprintCon mainly consists of a power load allocator and two different power controllers. The allocator analyzes how to divide the power load to different power sources. The server power controller adapts the CPU cores that process batch workloads, to improve the efficiency in terms of computing, energy and cost. The UPS power controller dynamically adjusts the discharge rate of UPS energy storage to satisfy the time-varying power demand of interactive workloads, and ensure power safety. The experiment results show that compared to state-of-the-art solutions, SprintCon can achieve 6-56% better computing performance and up to 87% less demand of energy storage.
计算冲刺是一种暂时提高数据中心服务器性能的有效机制。然而,鉴于短跑对提高成绩的巨大作用,如何使短跑过程可控,如何使短跑效率最大化,目前还没有得到很好的讨论。当一个数据中心需要进行超过几分钟的计算冲刺时,这可能是一个严重的问题,因为它需要能量存储的支持,而能量存储的容量是有限的。短跑的控制和效率不仅涉及到服务器运行的速度有多快,如何将资源分配给协同运行的工作负载,还涉及到对电源过载的影响,以及如何通过断路器和储能来处理过载以确保电源安全。不同的工作负载会以不同的方式影响冲刺,因此高效的冲刺需要特定于工作负载的策略。为了实现数据中心服务器的可控、高效的计算冲刺,我们提出了SprintCon。SprintCon主要由一个电源负载分配器和两个不同的电源控制器组成。分配器分析如何将电力负荷分配给不同的电源。服务器电源控制器适配处理批量工作负载的CPU内核,从计算、能源和成本等方面提高效率。UPS电源控制器动态调节UPS储能的放电速率,满足交互式工作负载的时变功率需求,保证供电安全。实验结果表明,与目前最先进的解决方案相比,SprintCon的计算性能提高了6-56%,储能需求减少了87%。
{"title":"SprintCon: Controllable and Efficient Computational Sprinting for Data Center Servers","authors":"Wenli Zheng, Xiaorui Wang, Yue Ma, Chao Li, Hao Lin, Bin Yao, Jianfeng Zhang, M. Guo","doi":"10.1109/IPDPS.2019.00090","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00090","url":null,"abstract":"Computational sprinting is an effective mechanism to temporarily boost the performance of data center servers. However, given the great effect on performance improvement, how to make the sprinting process controllable and how to maximize the sprinting efficiency have not been well discussed yet. Those can be significant problems for a data center when computational sprinting is needed for more than a few minutes, since it requires the support of energy storage, whose capacity is limited. The control and efficiency of sprinting not only involve how fast to run servers and how to allocate resources to co-running workloads, but also the impact on power overload, and how to handle the overload with circuit breakers and energy storage to ensure power safety. Different workloads can impact sprinting in different ways, and hence efficient sprinting requires workload-specific strategies. In this paper, we propose SprintCon to realize controllable and efficient computational sprinting for data center servers. SprintCon mainly consists of a power load allocator and two different power controllers. The allocator analyzes how to divide the power load to different power sources. The server power controller adapts the CPU cores that process batch workloads, to improve the efficiency in terms of computing, energy and cost. The UPS power controller dynamically adjusts the discharge rate of UPS energy storage to satisfy the time-varying power demand of interactive workloads, and ensure power safety. The experiment results show that compared to state-of-the-art solutions, SprintCon can achieve 6-56% better computing performance and up to 87% less demand of energy storage.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127823187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Coding the Continuum 连续体编码
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00011
Ian T Foster
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
2001年,当早期的高速网络部署时,乔治·吉尔德观察到“当网络的速度和计算机的内部连接一样快时,机器就会在网络上分解成一组特殊用途的设备。”二十年后,我们的网络速度快了1000倍,我们的设备越来越专业化,我们的计算机系统确实在瓦解。当硬件加速克服光速延迟时,时间和空间合并成一个计算连续体。熟悉的问题,如“我应该在哪里计算”,“我应该为什么样的工作负载设计计算机”,以及“我应该把计算机放在哪里”,似乎提供了无数令人兴奋但也令人生畏的新答案。当我们在一个不受中心、云、边缘等熟悉的标志束缚的世界中设计应用程序和计算机系统时,是否有一些概念可以帮助指导我们?我提出了一些想法,并报告了对连续体进行编码的实验。
{"title":"Coding the Continuum","authors":"Ian T Foster","doi":"10.1109/IPDPS.2019.00011","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00011","url":null,"abstract":"In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and \"where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134349808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Identifying Latent Reduced Models to Precondition Lossy Compression 基于有损压缩条件的潜在约简模型识别
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00039
Huizhang Luo, Dan Huang, Qing Liu, Zhenbo Qiao, Hong Jiang, J. Bi, Haitao Yuan, Mengchu Zhou, Jinzhen Wang, Zhenlu Qin
With the high volume and velocity of scientific data produced on high-performance computing systems, it has become increasingly critical to improve the compression performance. Leveraging the general tolerance of reduced accuracy in applications, lossy compressors can achieve much higher compression ratios with a user-prescribed error bound. However, they are still far from satisfying the reduction requirements from applications. In this paper, we propose and evaluate the idea that data need to be preconditioned prior to compression, such that they can better match the design philosophies of a compressor. In particular, we aim to identify a reduced model that can be utilized to transform the original data to a more compressible form. We begin with a case study of Heat3d as a proof of concept, in which we demonstrate that a reduced model can indeed reside in the full model output, and can be utilized to improve compression ratios. We further explore more general dimension reduction techniques to extract the reduced model, including principal component analysis, singular value decomposition, and discrete wavelet transform. After preconditioning, the reduced model in conjunction with delta is stored, which results in higher compression ratios. We evaluate the reduced models on nine scientific datasets, and the results show the effectiveness of our approaches.
随着高性能计算系统产生的科学数据的高容量和高速度,提高压缩性能变得越来越重要。利用应用中降低精度的一般容忍度,有损压缩器可以在用户规定的误差范围内实现更高的压缩比。然而,他们仍然远远不能满足申请的减少要求。在本文中,我们提出并评估了在压缩之前需要对数据进行预处理的想法,以便它们能够更好地匹配压缩机的设计理念。特别是,我们的目标是确定一个可用于将原始数据转换为更可压缩形式的简化模型。我们首先以Heat3d的案例研究作为概念证明,其中我们证明了一个简化的模型确实可以驻留在完整的模型输出中,并且可以用来提高压缩比。我们进一步探索了更一般的降维技术来提取降维模型,包括主成分分析、奇异值分解和离散小波变换。预处理后,结合delta存储简化模型,压缩比更高。我们在9个科学数据集上对简化模型进行了评估,结果表明了我们方法的有效性。
{"title":"Identifying Latent Reduced Models to Precondition Lossy Compression","authors":"Huizhang Luo, Dan Huang, Qing Liu, Zhenbo Qiao, Hong Jiang, J. Bi, Haitao Yuan, Mengchu Zhou, Jinzhen Wang, Zhenlu Qin","doi":"10.1109/IPDPS.2019.00039","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00039","url":null,"abstract":"With the high volume and velocity of scientific data produced on high-performance computing systems, it has become increasingly critical to improve the compression performance. Leveraging the general tolerance of reduced accuracy in applications, lossy compressors can achieve much higher compression ratios with a user-prescribed error bound. However, they are still far from satisfying the reduction requirements from applications. In this paper, we propose and evaluate the idea that data need to be preconditioned prior to compression, such that they can better match the design philosophies of a compressor. In particular, we aim to identify a reduced model that can be utilized to transform the original data to a more compressible form. We begin with a case study of Heat3d as a proof of concept, in which we demonstrate that a reduced model can indeed reside in the full model output, and can be utilized to improve compression ratios. We further explore more general dimension reduction techniques to extract the reduced model, including principal component analysis, singular value decomposition, and discrete wavelet transform. After preconditioning, the reduced model in conjunction with delta is stored, which results in higher compression ratios. We evaluate the reduced models on nine scientific datasets, and the results show the effectiveness of our approaches.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130836567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Understanding the Impact of Dynamic Power Capping on Application Progress 了解动态功率封顶对应用进度的影响
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00088
Srinivasan Ramesh, Swann Perarnau, Sridutt Bhalachandra, A. Malony, P. Beckman
Electrical power has become an important design constraint in high-performance computing (HPC) systems. On future HPC machines, power is likely to be a budgeted resource and thus managed dynamically. Power management software needs to reliably measure application performance at runtime in order to respond effectively to changes in application behavior. Execution time tells us little about how the science in the application is progressing toward an application-defined end goal. To the best of our knowledge, no study has defined or categorized online application progress in the context of power management. Based on semi-structured interviews with HPC application-specialists, we define an online notion of progress—an application-specific metric that can be monitored at runtime to provide a sense of the rate at which application science is being performed. Using instrumentation, we characterize and categorize the progress of various production scientific applications and benchmarks. We propose a model of the impact of dynamic power capping on application progress. By experimental evaluation, we show that our model accurately captures the general behavior of the progress of different classes of applications under a power cap. We believe that such a model is an important first step toward the design of more dynamic power management policies for HPC systems.
在高性能计算(HPC)系统中,电力已经成为一个重要的设计约束。在未来的高性能计算机器上,电力可能是一种预算资源,因此是动态管理的。电源管理软件需要在运行时可靠地测量应用程序性能,以便有效地响应应用程序行为的变化。执行时间很少告诉我们应用程序中的科学是如何朝着应用程序定义的最终目标发展的。据我们所知,在电源管理的背景下,没有研究定义或分类在线应用程序的进展。基于对HPC应用程序专家的半结构化访谈,我们定义了一个在线的进度概念——一个特定于应用程序的度量,可以在运行时监控,以提供应用程序科学执行速度的感觉。使用仪器,我们对各种生产科学应用和基准的进展进行了表征和分类。我们提出了一个动态功率上限对应用进度影响的模型。通过实验评估,我们表明我们的模型准确地捕捉了功率上限下不同类别应用程序进程的一般行为。我们相信这样的模型是为高性能计算系统设计更动态的电源管理策略的重要的第一步。
{"title":"Understanding the Impact of Dynamic Power Capping on Application Progress","authors":"Srinivasan Ramesh, Swann Perarnau, Sridutt Bhalachandra, A. Malony, P. Beckman","doi":"10.1109/IPDPS.2019.00088","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00088","url":null,"abstract":"Electrical power has become an important design constraint in high-performance computing (HPC) systems. On future HPC machines, power is likely to be a budgeted resource and thus managed dynamically. Power management software needs to reliably measure application performance at runtime in order to respond effectively to changes in application behavior. Execution time tells us little about how the science in the application is progressing toward an application-defined end goal. To the best of our knowledge, no study has defined or categorized online application progress in the context of power management. Based on semi-structured interviews with HPC application-specialists, we define an online notion of progress—an application-specific metric that can be monitored at runtime to provide a sense of the rate at which application science is being performed. Using instrumentation, we characterize and categorize the progress of various production scientific applications and benchmarks. We propose a model of the impact of dynamic power capping on application progress. By experimental evaluation, we show that our model accurately captures the general behavior of the progress of different classes of applications under a power cap. We believe that such a model is an important first step toward the design of more dynamic power management policies for HPC systems.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122235227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Revisiting the I/O-Complexity of Fast Matrix Multiplication with Recomputations 重计算快速矩阵乘法的I/ o复杂度
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00058
Roy Nissim, O. Schwartz
Communication costs, between processors and across the memory hierarchy, often dominate the runtime of algorithms. Can we trade these costs for recomputations? Most algorithms do not utilize recomputation for this end, and most communication cost lower bounds assume no recomputation, hence do not address this fundamental question. Recently, Bilardi and De Stefani (2017), and Bilardi, Scquizzato, and Silvestri (2018) showed that recomputations cannot reduce communication costs in Strassen's fast matrix multiplication and in fast Fourier transform. We extend the former bound and show that recomputations cannot reduce communication costs for a few other fast matrix multiplication algorithms.
处理器之间和内存层次结构之间的通信成本通常支配着算法的运行时间。我们可以用这些成本换取重新计算吗?大多数算法没有为此目的使用重新计算,并且大多数通信成本下限假设没有重新计算,因此没有解决这个基本问题。最近,Bilardi和De Stefani(2017)以及Bilardi、Scquizzato和Silvestri(2018)表明,在Strassen的快速矩阵乘法和快速傅里叶变换中,重新计算不能降低通信成本。我们扩展了前一个边界,并证明了重复计算不能降低其他一些快速矩阵乘法算法的通信开销。
{"title":"Revisiting the I/O-Complexity of Fast Matrix Multiplication with Recomputations","authors":"Roy Nissim, O. Schwartz","doi":"10.1109/IPDPS.2019.00058","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00058","url":null,"abstract":"Communication costs, between processors and across the memory hierarchy, often dominate the runtime of algorithms. Can we trade these costs for recomputations? Most algorithms do not utilize recomputation for this end, and most communication cost lower bounds assume no recomputation, hence do not address this fundamental question. Recently, Bilardi and De Stefani (2017), and Bilardi, Scquizzato, and Silvestri (2018) showed that recomputations cannot reduce communication costs in Strassen's fast matrix multiplication and in fast Fourier transform. We extend the former bound and show that recomputations cannot reduce communication costs for a few other fast matrix multiplication algorithms.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126840626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A High-Performance Distributed Relational Database System for Scalable OLAP Processing 面向可扩展OLAP处理的高性能分布式关系数据库系统
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00083
Jason Arnold, Boris Glavic, I. Raicu
The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled query processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. While we also support serializable transactions, the system has not been optimized for this use case. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implementations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS's scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.
构建在大数据平台之上的Hive和Spark SQL等系统的可扩展性使查询处理能够处理非常大的数据集。然而,与传统的关系数据库相比,这些系统的每节点性能通常较低。相反,大规模并行处理(MPP)数据库的可伸缩性不如这些系统。我们介绍了HRDBMS,这是一个完全实现的分布式无共享关系数据库,其开发目标是提高OLAP查询的可伸缩性。HRDBMS通过将关系和大数据系统的技术与新颖的通信和工作分配技术相结合,实现了高可扩展性。虽然我们也支持可序列化的事务,但是系统还没有针对这个用例进行优化。HRDBMS运行在定制的分布式异步执行引擎上,该引擎从头开始构建,以支持高度并行化的操作符实现。我们与Hive, Spark SQL和Greenplum的实验比较证实,HRDBMS的可伸缩性与Hive和Spark SQL相当(最多96个节点),而其每个节点的性能可以与MPP数据库(如Greenplum)竞争。
{"title":"A High-Performance Distributed Relational Database System for Scalable OLAP Processing","authors":"Jason Arnold, Boris Glavic, I. Raicu","doi":"10.1109/IPDPS.2019.00083","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00083","url":null,"abstract":"The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled query processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. While we also support serializable transactions, the system has not been optimized for this use case. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implementations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS's scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124622347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation 具有明确外部通货紧缩的重启动Lanczos的矩阵幂核
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00057
I. Yamazaki, Z. Bai, Ding Lu, J. Dongarra
Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that further reduces the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.
一些科学和工程应用需要计算一个大厄米矩阵的大量特征对。尽管Lanczos方法对于计算少数特征值是有效的,但是对于计算大量特征对(例如,在计算和通信方面)来说,它可能是昂贵的。为了提高该方法的性能,本文结合显式外放气(EED),研究了厚重启Lanczos (TRLan)的s步变体。s步法每次生成一组s个基向量,减少了生成基向量的通信开销。然后,我们设计了一个专门的矩阵功率内核(MPK),利用压缩矩阵的特殊性质进一步降低了通信和计算成本。我们利用合成矩阵和电子结构计算矩阵对新的TRLan特征求解器进行了数值实验。在国家能源研究科学计算中心(NERSC)的Cori超级计算机上的性能结果表明,专用MPK可以显著缩短TRLan特征解算器的执行时间。在我们的连续和并行运行中分别获得了高达3.1倍和5.3倍的加速。
{"title":"Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation","authors":"I. Yamazaki, Z. Bai, Ding Lu, J. Dongarra","doi":"10.1109/IPDPS.2019.00057","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00057","url":null,"abstract":"Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that further reduces the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Online Live VM Migration Algorithms to Minimize Total Migration Time and Downtime 在线虚拟机迁移算法,最大限度地减少总迁移时间和停机时间
Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00051
Nikos Tziritas, Thanasis Loukopoulos, S. Khan, Chengzhong Xu, Albert Y. Zomaya
Virtual machine (VM) migration is a widely used technique in cloud computing systems to increase reliability. There are also many other reasons that a VM is migrated during its lifetime, such as reducing energy consumption, improving performance, maintenance, etc. During a live VM migration, the underlying VM continues being up until all or part of its data has been transmitted from source to destination. The remaining data are transmitted in an off-line manner by suspending the corresponding VM. The longer the off-line transmission time, the worse the performance of the respective VM. The above is because during the off-line data transmission, the VM service is down. Because a running VM's memory is subject to changes, already transmitted data pages may get dirtied and thus needing re-transmission. The decision of when suspending the VM is not a trivial task at all. The above is justified by the fact that when suspending the VM early we may result in transmitting off-line a significant amount of data degrading thus the VM's performance. On the other hand, a long waiting time to suspend the VM may result in re-transmitting a huge amount of dirty data, leading in that way to waste of resources. In this paper, we tackle the joint problem of minimizing both the total VM migration time (reflecting the resources spent during a migration) and the VM downtime (reflecting the performance degradation). The aforementioned objective functions are weighted according to the needs of the underlying cloud provider/user. To tackle the problem, we propose an online deterministic algorithm resulting in an strong competitive ratio, as well as a randomized online algorithm achieving significantly better results against the deterministic algorithm.
虚拟机(VM)迁移是云计算系统中广泛使用的一种提高可靠性的技术。虚拟机在其生命周期内迁移还有许多其他原因,如降低能耗、提高性能、维护等。在虚拟机迁移过程中,底层虚拟机一直处于运行状态,直到其全部或部分数据从源传输到目标为止。剩余数据通过挂起对应的虚拟机离线传输。离线传输时间越长,对应虚拟机的性能越差。这是因为在离线传输数据时,虚拟机服务关闭。由于正在运行的虚拟机的内存可能会发生变化,已经传输的数据页可能会被弄脏,因此需要重新传输。决定何时挂起虚拟机根本不是一项简单的任务。上述理由是合理的,因为当早期挂起VM时,我们可能会导致离线传输大量数据,从而降低VM的性能。另一方面,等待挂起虚拟机的时间过长,可能导致大量脏数据重传,造成资源浪费。在本文中,我们解决了最小化总VM迁移时间(反映迁移期间花费的资源)和VM停机时间(反映性能下降)的联合问题。根据底层云提供商/用户的需求对上述目标函数进行加权。为了解决这个问题,我们提出了一种在线确定性算法,该算法具有很强的竞争比,以及一种随机在线算法,其结果明显优于确定性算法。
{"title":"Online Live VM Migration Algorithms to Minimize Total Migration Time and Downtime","authors":"Nikos Tziritas, Thanasis Loukopoulos, S. Khan, Chengzhong Xu, Albert Y. Zomaya","doi":"10.1109/IPDPS.2019.00051","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00051","url":null,"abstract":"Virtual machine (VM) migration is a widely used technique in cloud computing systems to increase reliability. There are also many other reasons that a VM is migrated during its lifetime, such as reducing energy consumption, improving performance, maintenance, etc. During a live VM migration, the underlying VM continues being up until all or part of its data has been transmitted from source to destination. The remaining data are transmitted in an off-line manner by suspending the corresponding VM. The longer the off-line transmission time, the worse the performance of the respective VM. The above is because during the off-line data transmission, the VM service is down. Because a running VM's memory is subject to changes, already transmitted data pages may get dirtied and thus needing re-transmission. The decision of when suspending the VM is not a trivial task at all. The above is justified by the fact that when suspending the VM early we may result in transmitting off-line a significant amount of data degrading thus the VM's performance. On the other hand, a long waiting time to suspend the VM may result in re-transmitting a huge amount of dirty data, leading in that way to waste of resources. In this paper, we tackle the joint problem of minimizing both the total VM migration time (reflecting the resources spent during a migration) and the VM downtime (reflecting the performance degradation). The aforementioned objective functions are weighted according to the needs of the underlying cloud provider/user. To tackle the problem, we propose an online deterministic algorithm resulting in an strong competitive ratio, as well as a randomized online algorithm achieving significantly better results against the deterministic algorithm.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129547531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1