首页 > 最新文献

Proceedings of the 48th International Conference on Parallel Processing最新文献

英文 中文
Breaking Band: A Breakdown of High-performance Communication 打破频带:高性能通信的崩溃
Rohit Zambre, M. Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis
The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication.
大型系统节点间通信的关键路径由多个组件组成。当超级计算应用程序使用高级通信例程(如MPI_Send)启动消息传输时,消息的有效负载将遍历多个软件堆栈、主机和目标节点上的I/O子系统以及交换机等网络组件。在本文中,我们通过对系统的总体注入开销和端到端延迟进行建模,分析在通信的关键路径上花费的位置、原因和时间。我们将分析重点放在小消息的性能上,因为随着每个节点的核心数量不断增加,细粒度通信变得越来越重要。分析模型对节点间通信所花费的时间进行了准确而详细的分解。我们在连接Mellanox InfiniBand的基于Arm thunderx2的服务器上验证了模型。这是Arm上的第一个此类作品。除了我们的故障外,我们还描述了测量每个组件所花费时间的方法,以便读者可以访问精确的CPU计时器和PCIe分析仪来测量他们感兴趣的系统的故障。这样的分解对于软件开发人员、系统架构师和研究人员指导他们的优化工作是至关重要的。作为研究人员,我们使用故障来模拟影响,并讨论针对当今高性能通信瓶颈的一组优化的可能性。
{"title":"Breaking Band: A Breakdown of High-performance Communication","authors":"Rohit Zambre, M. Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis","doi":"10.1145/3337821","DOIUrl":"https://doi.org/10.1145/3337821","url":null,"abstract":"The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127590231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
EMBA
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337863
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang
EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.
Emba 604战略分析。(2)本课程提供了一个竞争分析和竞争优势的框架,在此基础上,本课程的功能导向课程可以建立。它提供了公司最高管理团队(即首席执行官,总经理,部门经理)所面临的分析活动和决策情况的全貌,重点是与外部环境和内部问题有关的最高管理决策。它提供了认识到什么信息是重要的,筛选它的相关性,并利用知识为公司的竞争利益的实践经验。前提条件:进入联合EMBA项目。
{"title":"EMBA","authors":"Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang","doi":"10.1145/3337821.3337863","DOIUrl":"https://doi.org/10.1145/3337821.3337863","url":null,"abstract":"EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124916423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
JobPacker
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337880
Zhuozhao Li, Haiying Shen
In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.
{"title":"JobPacker","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1145/3337821.3337880","DOIUrl":"https://doi.org/10.1145/3337821.3337880","url":null,"abstract":"In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations 分布式记忆冰盖模拟中网格奇异点检测的并行图算法
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337841
Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota
We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.
我们提出了一种新的分布式内存并行算法,用于检测退化网格特征,这些特征可能导致冰盖网格模拟中的奇异性。识别和删除网格特征,如断开的组件(冰山)或铰链顶点(从陆地分离的冰半岛)可以显著提高迭代求解器的收敛性。由于冰盖在模拟过程中不断演变,因此重要的是,检测算法可以在模拟过程中就地运行——并行运行,计算时间可以忽略不计——以便在退化特征(例如,崩解的冰山)发展时可以检测到。我们提出了一种分布式内存,基于bfs的标签传播方法来退化特征检测,该方法足够高效,可以在冰盖模拟的每个步骤中调用,同时正确识别冰盖网格的所有退化特征。我们的方法在MPAS Albany Land Ice (MALI)模型的1536个核上,在0.0561秒内找到了包含1300万个顶点的网格中的所有退化特征。与之前使用的串行预处理方法相比,我们观察到我们的算法加速了46,000倍,并提供了在仿真中动态检测退化特征的额外能力。
{"title":"A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations","authors":"Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota","doi":"10.1145/3337821.3337841","DOIUrl":"https://doi.org/10.1145/3337821.3337841","url":null,"abstract":"We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Data-Parallel Primitives on Heterogeneous Systems 异构系统中高效的数据并行基元
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337920
Zhuohang Lai, Qiong Luo, Xiaolong Xie
Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.
数据并行原语,如gather、scatter、scan和split,广泛用于数据密集型应用程序。然而,在由异构处理器组成的系统上优化它们是具有挑战性的。在本文中,我们研究和比较了一组数据并行原语在GPU、CPU和Xeon Phi协处理器上的现有实现和优化策略。我们的目标是确定在不同架构上实现数据并行原语操作的关键性能因素,并开发在各种平台上有效实现这些原语的通用策略。我们引入了一种可移植且高效的顺序存储器访问模式,消除了为单个设备调整存储器访问模式的成本。通过适当的调优,我们优化的原语实现可以达到与本机版本相当的性能。此外,我们的分析结果表明,CPU和Phi协处理器共享大多数优化策略,而GPU由于这些设备之间的硬件差异而差异很大,例如向量化,数据和TLB缓存以及数据预取的效率。我们总结了这些因素,并为异构系统提供了通用的原始优化策略。
{"title":"Efficient Data-Parallel Primitives on Heterogeneous Systems","authors":"Zhuohang Lai, Qiong Luo, Xiaolong Xie","doi":"10.1145/3337821.3337920","DOIUrl":"https://doi.org/10.1145/3337821.3337920","url":null,"abstract":"Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125332264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Modeling the Performance of Atomic Primitives on Modern Architectures 现代体系结构中原子原语的性能建模
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337901
F. Hoseini, A. Atalar, P. Tsigas
Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems. In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.
利用处理器的原子原语以原子方式访问内存位置是并行软件系统正确性和可行性的关键。原子性能在并行软件系统的可伸缩性和整体性能中起着重要的作用。在这项工作中,我们从延迟、吞吐量、公平性、能耗等方面研究了原子原语在两种常见软件执行设置上下文中的性能,这两种设置会导致共享内存上的高争用访问和低争用访问。我们对这两个应用程序上下文中的原子性能进行了详尽的研究,并提出了一个捕获其行为的性能模型。我们考虑两种最先进的架构:Intel Xeon E5和Xeon Phi (KNL)。我们提出了一个模型,该模型以在这些共享缓存线上执行原子原语的线程之间的缓存线反弹为中心。该模型在实践中使用非常简单,可以准确地捕获这些执行场景下原子的行为,并有助于多线程编程中的算法设计决策。
{"title":"Modeling the Performance of Atomic Primitives on Modern Architectures","authors":"F. Hoseini, A. Atalar, P. Tsigas","doi":"10.1145/3337821.3337901","DOIUrl":"https://doi.org/10.1145/3337821.3337901","url":null,"abstract":"Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems. In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127282688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations 电子结构计算中特征解的改进无约束能量泛函方法
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337914
M. D. Ben, O. Marques, A. Canning
This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.
本文报道了一种基于无约束能量泛函最小化格式的预条件共轭梯度迭代特征解的性能。与标准实现相比,该方案避免了试验特征向量的显式重新正交化,并成为解决非常大问题的有吸引力的替代方案。该无约束公式在第一性原理材料与化学CP2K代码中实现,该代码基于密度泛函理论近似求解多体Schrödinger方程进行电子结构计算。我们在国家能源研究科学计算中心(NERSC)的Cray XC40上研究了无约束公式的收敛性及其并行缩放。我们在研究中使用的系统是大量液态水,超分子催化剂金(III)配合物,MoS2-WSe2双分子层和硅的距离点缺陷,原子数从2,247到12,288不等。我们表明,具有适当前置条件的无约束公式具有良好的收敛特性,并且可以很好地扩展到23万核,大约占整机的38%。
{"title":"Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations","authors":"M. D. Ben, O. Marques, A. Canning","doi":"10.1145/3337821.3337914","DOIUrl":"https://doi.org/10.1145/3337821.3337914","url":null,"abstract":"This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks 基于sdn的数据中心网络中自适应路由重构以最小化流量成本
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337861
Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen
Data center networks have become heavily reliant on software-defined network to orchestrate data transmission. To maintain optimal network configurations, a controller needs to solve the multi-commodity flow problem and globally update the network under tight time constraints. In this paper, we aim to minimize flow cost or intuitively average transmission delay, under reconfiguration budget constraints in data centers. Thus, we formulate this optimization problem as a constrained Markov Decision Process and propose a set of algorithms to solve it in a scalable manner. We first develop a propagation algorithm to identify the flows which are mostly affected in terms of latency and will be configured in the next network update. Then, we set a limitation range for updating them to improve adaptability and scalability by updating a less number of flows each time to achieve fast operations as well. Further, based on the Drift-Plus-Penalty method in Lyapunov theory, we propose a heuristic policy without prior information of flow demand with a performance guarantee to minimize the additive optimality gap. To the best of our knowledge, this is the first paper that studies the range and frequency of flow reconfigurations, which has both theoretical and practical significance in the related area. Extensive emulations and numerical simulations, which are much better than the estimated theoretical bound, show that our proposed policy outperform the state of the art algorithms in terms of latency by over 45% while making improvements in adaptability and scalability.
数据中心网络在很大程度上依赖于软件定义的网络来协调数据传输。为了保持最优的网络配置,控制器需要解决多商品流问题,并在严格的时间约束下对网络进行全局更新。在本文中,我们的目标是在数据中心重构预算约束下最小化流成本或直观的平均传输延迟。因此,我们将这个优化问题表述为一个约束马尔可夫决策过程,并提出了一套可扩展的算法来解决它。我们首先开发了一种传播算法来识别受延迟影响最大的流,并将在下一次网络更新中进行配置。然后,我们设置了更新流的限制范围,通过每次更新较少的流来提高适应性和可扩展性,从而实现快速的操作。进一步,基于Lyapunov理论中的漂移加惩罚方法,提出了一种不含流量需求先验信息的启发式策略,并保证了性能,使加性最优性缺口最小化。据我们所知,这是第一篇研究流动重构范围和频率的论文,在相关领域具有理论和实际意义。大量的仿真和数值模拟结果表明,我们提出的策略在延迟方面优于现有算法的45%以上,同时在适应性和可扩展性方面有所改进。
{"title":"Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks","authors":"Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen","doi":"10.1145/3337821.3337861","DOIUrl":"https://doi.org/10.1145/3337821.3337861","url":null,"abstract":"Data center networks have become heavily reliant on software-defined network to orchestrate data transmission. To maintain optimal network configurations, a controller needs to solve the multi-commodity flow problem and globally update the network under tight time constraints. In this paper, we aim to minimize flow cost or intuitively average transmission delay, under reconfiguration budget constraints in data centers. Thus, we formulate this optimization problem as a constrained Markov Decision Process and propose a set of algorithms to solve it in a scalable manner. We first develop a propagation algorithm to identify the flows which are mostly affected in terms of latency and will be configured in the next network update. Then, we set a limitation range for updating them to improve adaptability and scalability by updating a less number of flows each time to achieve fast operations as well. Further, based on the Drift-Plus-Penalty method in Lyapunov theory, we propose a heuristic policy without prior information of flow demand with a performance guarantee to minimize the additive optimality gap. To the best of our knowledge, this is the first paper that studies the range and frequency of flow reconfigurations, which has both theoretical and practical significance in the related area. Extensive emulations and numerical simulations, which are much better than the estimated theoretical bound, show that our proposed policy outperform the state of the art algorithms in terms of latency by over 45% while making improvements in adaptability and scalability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era 在微服务时代释放功率受限数据中心的可扩展性潜力
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337857
Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo
Recent scale-out cloud services have undergone a shift from monolithic applications to microservices by putting each functionality into lightweight software containers. Although traditional data center power optimization frameworks excel at per-server or per-rack management, they can hardly make informed decisions when facing microservices that have different QoS requirements on a per-service basis. In a power-constrained data center, blindly budgeting power usage could lead to a power unbalance issue: microservices on the critical path may not receive adequate power budget. This unavoidably hinders the growth of cloud productivity. To unleash the performance potential of cloud in the microservice era, this paper investigates microservice-aware data center resource management. We model microservice using a bipartite graph and propose a metric called microservice criticality factor (MCF) to measure the overall impact of performance scaling on a microservice from the whole application's perspective. We further devise ServiceFridge, a novel system framework that leverages MCF to jointly orchestrate software containers and control hardware power demand. Our detailed case study on a practical microservice application demonstrates that ServiceFridge allows data center to reduce its dynamic power by 25% with slight performance loss. It improves the mean response time by 25.2% and improves the 90th tail latency by 18.0% compared with existing schemes.
最近的横向扩展云服务通过将每个功能放入轻量级软件容器中,经历了从单片应用到微服务的转变。尽管传统的数据中心电源优化框架擅长于每台服务器或每机架的管理,但当面对在每个服务基础上具有不同QoS需求的微服务时,它们很难做出明智的决策。在电力受限的数据中心中,盲目地预算电力使用可能导致电力不平衡问题:关键路径上的微服务可能没有得到足够的电力预算。这不可避免地阻碍了云生产力的增长。为了在微服务时代释放云的性能潜力,本文研究了微服务感知数据中心资源管理。我们使用二部图对微服务进行建模,并提出了一个称为微服务临界系数(MCF)的度量,从整个应用程序的角度衡量性能扩展对微服务的总体影响。我们进一步设计了ServiceFridge,这是一个利用MCF共同编排软件容器和控制硬件功率需求的新系统框架。我们对一个实际微服务应用程序的详细案例研究表明,ServiceFridge允许数据中心在轻微性能损失的情况下将其动态功率降低25%。与现有方案相比,平均响应时间提高了25.2%,第90尾延迟提高了18.0%。
{"title":"Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era","authors":"Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo","doi":"10.1145/3337821.3337857","DOIUrl":"https://doi.org/10.1145/3337821.3337857","url":null,"abstract":"Recent scale-out cloud services have undergone a shift from monolithic applications to microservices by putting each functionality into lightweight software containers. Although traditional data center power optimization frameworks excel at per-server or per-rack management, they can hardly make informed decisions when facing microservices that have different QoS requirements on a per-service basis. In a power-constrained data center, blindly budgeting power usage could lead to a power unbalance issue: microservices on the critical path may not receive adequate power budget. This unavoidably hinders the growth of cloud productivity. To unleash the performance potential of cloud in the microservice era, this paper investigates microservice-aware data center resource management. We model microservice using a bipartite graph and propose a metric called microservice criticality factor (MCF) to measure the overall impact of performance scaling on a microservice from the whole application's perspective. We further devise ServiceFridge, a novel system framework that leverages MCF to jointly orchestrate software containers and control hardware power demand. Our detailed case study on a practical microservice application demonstrates that ServiceFridge allows data center to reduce its dynamic power by 25% with slight performance loss. It improves the mean response time by 25.2% and improves the 90th tail latency by 18.0% compared with existing schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122689414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning 面向快速机器学习的参数服务器的重叠计算与通信
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337828
Haozhao Wang, Song Guo, Ruixuan Li
When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.
当在参数服务器(PS)中运行时,分布式随机梯度下降(SGD)会导致严重的通信延迟,因为在推送更新之后,计算节点(工作节点)必须等待全局模型在每次迭代中从主服务器通信回来。本文设计了一种新的同步并行机制,即重叠同步并行(OSP),该机制通过以重叠的方式进行计算和通信来消除等待时间。从理论上证明,对于非凸问题,我们的机制可以达到与序列SGD相同的收敛速度。评估表明,我们的机制比最先进的机制显著提高了性能,例如,在收敛速度方面,AlexNet和ResNet18都提高了4倍。
{"title":"OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning","authors":"Haozhao Wang, Song Guo, Ruixuan Li","doi":"10.1145/3337821.3337828","DOIUrl":"https://doi.org/10.1145/3337821.3337828","url":null,"abstract":"When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
期刊
Proceedings of the 48th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1