首页 > 最新文献

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)最新文献

英文 中文
Optimizing Decentralized Learning with Local Heterogeneity using Topology Morphing and Clustering 基于拓扑变形和聚类的局部异构分散学习优化
Waqwoya Abebe, A. Jannesari
Recently, local peer topology has been shown to influence the overall convergence of decentralized learning (DL) graphs in the presence of data heterogeneity. In this paper, we demonstrate the advantages of constructing a proxy-based locally heterogeneous DL topology to enhance convergence and maintain data privacy. In particular, we propose a novel peer clumping strategy to efficiently cluster peers before arranging them in a final training graph. By showing how locally heterogeneous graphs outperform locally homogeneous graphs of similar size and from the same global data distribution, we present a strong case for topological pre-processing. Moreover, we demonstrate the scalability of our approach by showing how the proposed topological pre-processing overhead remains small in large graphs while the performance gains get even more pronounced. Furthermore, we show the robustness of our approach in the presence of network partitions.
近年来,局部对等拓扑已被证明在存在数据异构的情况下影响分散学习(DL)图的整体收敛性。在本文中,我们展示了构建基于代理的本地异构DL拓扑的优点,以增强收敛性并维护数据隐私。特别地,我们提出了一种新的对等聚类策略,在最终的训练图中排列它们之前有效地聚类。通过展示局部异构图如何优于大小相似且来自相同全局数据分布的局部同构图,我们提出了拓扑预处理的有力案例。此外,我们通过展示所建议的拓扑预处理开销如何在大型图中保持较小,而性能增益如何更加明显,来演示我们方法的可伸缩性。此外,我们还展示了在存在网络分区的情况下我们的方法的健壮性。
{"title":"Optimizing Decentralized Learning with Local Heterogeneity using Topology Morphing and Clustering","authors":"Waqwoya Abebe, A. Jannesari","doi":"10.1109/CCGrid57682.2023.00041","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00041","url":null,"abstract":"Recently, local peer topology has been shown to influence the overall convergence of decentralized learning (DL) graphs in the presence of data heterogeneity. In this paper, we demonstrate the advantages of constructing a proxy-based locally heterogeneous DL topology to enhance convergence and maintain data privacy. In particular, we propose a novel peer clumping strategy to efficiently cluster peers before arranging them in a final training graph. By showing how locally heterogeneous graphs outperform locally homogeneous graphs of similar size and from the same global data distribution, we present a strong case for topological pre-processing. Moreover, we demonstrate the scalability of our approach by showing how the proposed topological pre-processing overhead remains small in large graphs while the performance gains get even more pronounced. Furthermore, we show the robustness of our approach in the presence of network partitions.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115345470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Optical Transceiver Reliability Study based on SFP Monitoring and OS-level Metric Data 基于SFP光模块监测和os级度量数据的光模块可靠性研究
Paolo Notaro, Qiao Yu, Soroush Haeri, Jorge Cardoso, M. Gerndt
The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.
随着云计算需求的不断增长,数据中心及其内部光网络的规模不断扩大,需要不断提高带宽、提高可靠性和降低时延。光收发器是光网络的重要组成部分,与其他硬件部件相比,对其可靠性的研究还不够深入。在本文中,我们利用来自光收发器和操作系统级指标的大量监控数据来提供有关光收发器故障发生的统计见解。我们估计收发器故障率和监测属性的正常工作范围,将早期可观察到的模式与已知的故障症状相关联,最后根据我们的分析建立故障预测模型。我们的研究结果使网络管理员能够部署早期预警系统并制定预测性维护策略,例如更换或流量重新路由,从而减少事件数量及其相关成本。
{"title":"An Optical Transceiver Reliability Study based on SFP Monitoring and OS-level Metric Data","authors":"Paolo Notaro, Qiao Yu, Soroush Haeri, Jorge Cardoso, M. Gerndt","doi":"10.1109/CCGrid57682.2023.00011","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00011","url":null,"abstract":"The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks FreeTrain:利用未使用的超级计算机节点训练神经网络的框架
Zhengchun Liu, R. Kettimuthu, M. Papka, Ian T. Foster
Supercomputer scheduling policies commonly result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node × time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.
超级计算机调度策略通常会导致许多瞬时空闲节点,这种现象只能通过回填调度方法得到部分缓解,回填调度方法促使小作业先于大作业运行。在这里,我们描述了如何实现这些资源的新用途,即深度神经网络(DNN)训练。这个重要的工作负载很容易被组织成许多小的片段,这些片段可以动态配置,以适应超级计算机调度中的任何节点×时间洞。我们描述了如何将调整合适的DNN训练任务以适应动态变化的孔的任务表述为基于确定性混合整数线性规划(MILP)的资源分配算法,并表明该MILP问题可以在运行时有效地解决。我们将进一步展示如何对这个MILP问题进行调整,以优化管理员或用户定义的指标。我们用超级计算机调度器日志和不同的DNN训练场景验证了我们的方法,与在专用节点上运行相同的训练任务相比,效率高达93%。因此,我们的方法可以将大量的超级计算机资源分配给DNN训练,而不会对其他应用产生影响。
{"title":"FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks","authors":"Zhengchun Liu, R. Kettimuthu, M. Papka, Ian T. Foster","doi":"10.1109/CCGrid57682.2023.00036","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00036","url":null,"abstract":"Supercomputer scheduling policies commonly result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node × time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128359555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artifact Evaluation Committee Members 文物评估委员会成员
{"title":"Artifact Evaluation Committee Members","authors":"","doi":"10.1109/ccgrid57682.2023.00009","DOIUrl":"https://doi.org/10.1109/ccgrid57682.2023.00009","url":null,"abstract":"","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127030411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Asynchronous Dataflow-Driven Execution Model For Distributed Accelerator Computing 分布式加速器计算的异步数据流驱动执行模型
Philip Salzmann, Fabian Knorr, Peter Thoman, P. Gschwandtner, Biagio Cosenza, T. Fahringer
While domain-specific HPC software packages continue to thrive and are vital to many scientific communities, a general purpose high-productivity GPU cluster programming model that facilitates experimentation for non-experts remains elusive. We demonstrate how Celerity, a high-level C++ programming model for distributed accelerator computing based on the open SYCL standard, allows for the quick development of - and experimentation with - distributed applications. To achieve scalability on large machines, we replace Celerity's existing master/worker scheduling model with a fully distributed scheme that reduces the worst-case scheduling complexity from quadratic to linear while maintaining the existing programming interface. We then show how this declarative, data-flow based API paired with a point-to-point communication model with eager data pushing can effectively expose and leverage opportunities for latency hiding and computation/communication overlapping with minimal or no manual guidance. We demonstrate how Celerity exhibits very good scalability on multiple benchmarks from several scientific domains and up to 128 GPUs.
虽然特定领域的HPC软件包继续蓬勃发展,并且对许多科学社区至关重要,但一个通用的高生产力GPU集群编程模型仍然难以实现,可以为非专家提供实验便利。我们演示了基于开放SYCL标准的用于分布式加速器计算的高级c++编程模型Celerity如何支持分布式应用程序的快速开发和实验。为了在大型机器上实现可扩展性,我们用一个完全分布式的方案取代了Celerity现有的主/工人调度模型,在保持现有编程接口的同时,将最坏情况调度复杂度从二次型降低到线性型。然后,我们将展示这种声明性的、基于数据流的API如何与具有即时数据推送的点对点通信模型配对,从而有效地暴露和利用延迟隐藏和计算/通信重叠的机会,而只需极少或无需手动指导。我们将演示如何在多个科学领域和多达128个gpu的多个基准测试中展示非常好的可扩展性。
{"title":"An Asynchronous Dataflow-Driven Execution Model For Distributed Accelerator Computing","authors":"Philip Salzmann, Fabian Knorr, Peter Thoman, P. Gschwandtner, Biagio Cosenza, T. Fahringer","doi":"10.1109/CCGrid57682.2023.00018","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00018","url":null,"abstract":"While domain-specific HPC software packages continue to thrive and are vital to many scientific communities, a general purpose high-productivity GPU cluster programming model that facilitates experimentation for non-experts remains elusive. We demonstrate how Celerity, a high-level C++ programming model for distributed accelerator computing based on the open SYCL standard, allows for the quick development of - and experimentation with - distributed applications. To achieve scalability on large machines, we replace Celerity's existing master/worker scheduling model with a fully distributed scheme that reduces the worst-case scheduling complexity from quadratic to linear while maintaining the existing programming interface. We then show how this declarative, data-flow based API paired with a point-to-point communication model with eager data pushing can effectively expose and leverage opportunities for latency hiding and computation/communication overlapping with minimal or no manual guidance. We demonstrate how Celerity exhibits very good scalability on multiple benchmarks from several scientific domains and up to 128 GPUs.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115599816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Blockchain Proportional Governance Reconfiguration: Mitigating a Governance Oligarchy 区块链比例治理重构:缓解治理寡头
Deepal Tennakoon, V. Gramoli
Blockchain governance is paramount to lead securely a large group of users towards the same decisions without disputes about the legitimacy of a blockchain instance over another. As of today, there is no efficient way of protecting this governance against an oligarchy. This paper aims to offer a new dimension to the security of blockchains by proposing a solution known as proportional governance reconfiguration. This solution mitigates the formation of an oligarchy by (1) electing governors proportionally using a proportional multi-winner election protocol (2) reconfiguring the governance automatically and periodically. The proportional governance reconfiguration relies on a Solidity based implementation making it compatible and usable in many smart contract supported blockchains. We prove our solution solves the proportional governance problem and we evaluate our solution on two smart contract supporting blockchains Ethereum-PoA and Smart Redbelly Blockchain. Our results indicate that our proportional governance can elect 200 governors within 6–12 minutes when 1000 voters from 5 continents vote for 500 candidates.
区块链治理对于安全地引导一大群用户做出相同的决定至关重要,而不会对区块链实例的合法性产生争议。到目前为止,还没有有效的方法来保护这种治理不受寡头统治的影响。本文旨在通过提出一种称为比例治理重构的解决方案,为区块链的安全性提供一个新的维度。该解决方案通过(1)使用比例多赢家选举协议按比例选举管理者(2)自动和定期地重新配置治理,减轻了寡头政治的形成。比例治理重新配置依赖于基于solid的实现,使其在许多智能合约支持的区块链中兼容和可用。我们证明了我们的解决方案解决了比例治理问题,我们在两个支持区块链的智能合约以太坊- poa和智能红腹区块链上评估了我们的解决方案。我们的研究结果表明,当来自五大洲的1000名选民投票选出500名候选人时,我们的比例治理可以在6-12分钟内选出200名州长。
{"title":"Blockchain Proportional Governance Reconfiguration: Mitigating a Governance Oligarchy","authors":"Deepal Tennakoon, V. Gramoli","doi":"10.1109/CCGrid57682.2023.00057","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00057","url":null,"abstract":"Blockchain governance is paramount to lead securely a large group of users towards the same decisions without disputes about the legitimacy of a blockchain instance over another. As of today, there is no efficient way of protecting this governance against an oligarchy. This paper aims to offer a new dimension to the security of blockchains by proposing a solution known as proportional governance reconfiguration. This solution mitigates the formation of an oligarchy by (1) electing governors proportionally using a proportional multi-winner election protocol (2) reconfiguring the governance automatically and periodically. The proportional governance reconfiguration relies on a Solidity based implementation making it compatible and usable in many smart contract supported blockchains. We prove our solution solves the proportional governance problem and we evaluate our solution on two smart contract supporting blockchains Ethereum-PoA and Smart Redbelly Blockchain. Our results indicate that our proportional governance can elect 200 governors within 6–12 minutes when 1000 voters from 5 continents vote for 500 candidates.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116179574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Use of Cost Surface Analysis and Stream Order Analysis for Computing Shortest Paths 利用成本面分析和流序分析计算最短路径
Yogesh Dasgaonkar
We find that the current state-of-the-art shortest path navigation systems have a computational bottleneck that limits their scalability. To solve this problem, our first contribution is an important result showing that two points in the environment relate to each other by more geometric criteria than just the distances between them. Our second contribution shows that the environment's geometry is such that it allows for the points in the environment to be uniquely distinguishable based on the length of the shortest paths meeting at that point. Using this result, we order the points, so their ordering uniquely distinguishes the shortest path between any source and destination pair. Through these two important results, we propose a system that solves the computational bottleneck problem using lower processing resources and has higher optimal efficiency than the state-of-the-art.
我们发现当前最先进的最短路径导航系统有一个计算瓶颈,限制了它们的可扩展性。为了解决这个问题,我们的第一个贡献是一个重要的结果,表明环境中的两点通过更多的几何标准而不仅仅是它们之间的距离相互关联。我们的第二个贡献表明,环境的几何结构允许基于在该点相遇的最短路径的长度来唯一地区分环境中的点。利用这个结果,我们对点进行排序,使它们的排序唯一地区分任意源对和目标对之间的最短路径。通过这两个重要的结果,我们提出了一个用更少的处理资源解决计算瓶颈问题的系统,并且具有比目前最优效率更高的系统。
{"title":"Use of Cost Surface Analysis and Stream Order Analysis for Computing Shortest Paths","authors":"Yogesh Dasgaonkar","doi":"10.1109/CCGrid57682.2023.00067","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00067","url":null,"abstract":"We find that the current state-of-the-art shortest path navigation systems have a computational bottleneck that limits their scalability. To solve this problem, our first contribution is an important result showing that two points in the environment relate to each other by more geometric criteria than just the distances between them. Our second contribution shows that the environment's geometry is such that it allows for the points in the environment to be uniquely distinguishable based on the length of the shortest paths meeting at that point. Using this result, we order the points, so their ordering uniquely distinguishes the shortest path between any source and destination pair. Through these two important results, we propose a system that solves the computational bottleneck problem using lower processing resources and has higher optimal efficiency than the state-of-the-art.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127160901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences 实现和优化gpu感知MPI库的英特尔gpu:早期的经验
Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda
As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.
随着高性能计算(HPC)和深度学习(DL)应用对计算能力的需求增加,为现代百亿亿级集群配备加速器(如NVIDIA和AMD gpu)的趋势越来越大。gpu感知MPI库允许应用程序在具有高生产力和性能的并行环境中在gpu之间进行通信。虽然NVIDIA和AMD的gpu在过去几年中一直主导着顶级超级计算机的加速器市场,但英特尔最近开发并发布了自己的gpu和相关软件堆栈,并提供了一个统一的编程模型来编程他们的gpu,称为oneAPI。英特尔gpu的出现推动了对初始mpi级gpu感知支持的需求,这种支持利用了这些gpu特有的底层软件堆栈,并对通信进行了彻底的评估。在本文中,我们提出了一个gpu感知的MPI库,用于英特尔gpu使用一个api和SYCL后端。我们深入研究了我们使用英特尔gpu的实验,以及在使用英特尔为其gpu提供的软件堆栈添加gpu感知支持时在MPI层考虑的挑战。我们探索了不同的内存分配方法,并对英特尔gpu的内存复制性能进行了基准测试。我们提出了基于我们在英特尔gpu上的实验来支持点对点gpu感知的MPI操作的实现,并通过将实现扩展到MPI集合操作(如MPI_Bcast和MPI_Reduce)来显示我们方法的高适应性。我们通过在OSU micro - benchmark上扩展对英特尔GPU缓冲区的支持来评估我们在基准级别实现的好处。与英特尔MPI和朴素基准测试相比,我们的实现分别在使用设备缓冲区处理小消息时提供了1.8倍和2.2倍的点对点延迟加速;并且在大消息大小时具有高达1.3倍和1.5倍的加速。在集体MPI操作中,我们的实现显示在处理大消息时MPI_Allreduce和MPI_Allgather的速度分别提高了8倍和5倍。在应用级评估中,与英特尔MPI相比,我们的实现为3DStencil提供了高达40%的改进。
{"title":"Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/CCGrid57682.2023.00022","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00022","url":null,"abstract":"As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116442319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Hybrid DFT Simulations Using Performance Modeling on Supercomputers 在超级计算机上使用性能建模加速混合DFT仿真
Yosuke Oyama, Takumi Honda, Atsushi Ishikawa, Koichi Shirahata
Density Functional Theory (DFT) is an electronic-structure theory that computes the electronic energy of atoms and molecules from their electron density. Among several DFT methods, one called “hybrid DFT” adds the Hartree-Fock exchange energy to the original DFT exchange energy, and it improves the accuracy of the estimation of energy. However, this introduces additional computational costs, preventing its wide application for large-scale calculations. In light of those issues, a performance model to tune the computational configurations for hybrid DFT software automatically is proposed. The proposed model makes it possible to exhaustively search for parameters to minimize computation time without having to execute actual calculations with all parameter combinations. Several techniques for optimizing hybrid DFT, specially designed for the Fugaku supercomputer, are also proposed. It is concluded that combining all approaches reduces node-time cost by 2.23x and 2.68x for a 52-atom input on Fugaku and ABCI, respectively.
密度泛函理论(DFT)是一种从原子和分子的电子密度计算电子能量的电子结构理论。在几种DFT方法中,一种称为“混合DFT”的方法在原DFT交换能量的基础上增加Hartree-Fock交换能量,提高了能量估计的精度。然而,这引入了额外的计算成本,阻碍了它在大规模计算中的广泛应用。针对这些问题,提出了一种自动调优混合DFT软件计算配置的性能模型。所提出的模型使得不需要对所有参数组合执行实际计算就可以穷尽地搜索参数以最小化计算时间成为可能。提出了几种针对Fugaku超级计算机设计的混合DFT优化技术。结果表明,结合所有方法,在Fugaku和ABCI上,52原子输入的节点时间成本分别降低了2.23倍和2.68倍。
{"title":"Accelerating Hybrid DFT Simulations Using Performance Modeling on Supercomputers","authors":"Yosuke Oyama, Takumi Honda, Atsushi Ishikawa, Koichi Shirahata","doi":"10.1109/ccgrid57682.2023.00055","DOIUrl":"https://doi.org/10.1109/ccgrid57682.2023.00055","url":null,"abstract":"Density Functional Theory (DFT) is an electronic-structure theory that computes the electronic energy of atoms and molecules from their electron density. Among several DFT methods, one called “hybrid DFT” adds the Hartree-Fock exchange energy to the original DFT exchange energy, and it improves the accuracy of the estimation of energy. However, this introduces additional computational costs, preventing its wide application for large-scale calculations. In light of those issues, a performance model to tune the computational configurations for hybrid DFT software automatically is proposed. The proposed model makes it possible to exhaustively search for parameters to minimize computation time without having to execute actual calculations with all parameter combinations. Several techniques for optimizing hybrid DFT, specially designed for the Fugaku supercomputer, are also proposed. It is concluded that combining all approaches reduces node-time cost by 2.23x and 2.68x for a 52-atom input on Fugaku and ABCI, respectively.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133483144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chronica: A Data-Imbalance-Aware Scheduler for Distributed Deep Learning Chronica:分布式深度学习的数据不平衡感知调度器
Sanha Maeng, G. Moon, Sungyong Park
One of the major challenges in distributed deep learning is attenuating straggler problem. The straggler increases synchronization latency and significantly inhibits the convergence of deep learning model. We empirically observe that the imbal-anced data samples worsen the straggler problem and make the convergence of the deep learning model slower. However, existing approaches such as BOA and EP4DDL have not addressed data imbalance issues while solving the straggler problem. To overcome the straggler and data imbalance problems, we propose Chronica,a new data-imbalance-aware scheduler. Based on the size of the data samples and the configuration of each worker, Chronicaelaborately predicts the training time required for each worker. Chronicathen provides equivalent training time to each of the workers, alleviating both step- and epoch-level straggler problems. Furthermore, Chronicasuggests a new parameter synchronization scheme to achieve fast convergence based on the weighted average of the training workload on each worker. Our extensive evaluation using four deep learning models on 32 Amazon EC2 GPU instances showed that the new Chronicaachieves up to 3.19 times speedup over the state-of-the-art systems.
分布式深度学习面临的主要挑战之一是离散问题的衰减。离散子增加了同步延迟,显著抑制了深度学习模型的收敛性。我们的经验观察到,不平衡的数据样本加剧了离散问题,使深度学习模型的收敛速度变慢。然而,现有的方法,如BOA和EP4DDL,在解决掉队问题的同时,并没有解决数据不平衡问题。为了克服离散和数据不平衡问题,我们提出了一种新的数据不平衡感知调度程序Chronica。根据数据样本的大小和每个工人的配置,chronica精心预测每个工人所需的培训时间。为每个工人提供同等的培训时间,减轻了台阶级和时代级的掉队问题。在此基础上,提出了一种基于每个工人训练工作量加权平均的参数同步方案,以实现快速收敛。我们在32个Amazon EC2 GPU实例上使用四种深度学习模型进行了广泛的评估,结果表明,与最先进的系统相比,新的chrona7的加速速度高达3.19倍。
{"title":"Chronica: A Data-Imbalance-Aware Scheduler for Distributed Deep Learning","authors":"Sanha Maeng, G. Moon, Sungyong Park","doi":"10.1109/CCGrid57682.2023.00033","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00033","url":null,"abstract":"One of the major challenges in distributed deep learning is attenuating straggler problem. The straggler increases synchronization latency and significantly inhibits the convergence of deep learning model. We empirically observe that the imbal-anced data samples worsen the straggler problem and make the convergence of the deep learning model slower. However, existing approaches such as BOA and EP4DDL have not addressed data imbalance issues while solving the straggler problem. To overcome the straggler and data imbalance problems, we propose Chronica,a new data-imbalance-aware scheduler. Based on the size of the data samples and the configuration of each worker, Chronicaelaborately predicts the training time required for each worker. Chronicathen provides equivalent training time to each of the workers, alleviating both step- and epoch-level straggler problems. Furthermore, Chronicasuggests a new parameter synchronization scheme to achieve fast convergence based on the weighted average of the training workload on each worker. Our extensive evaluation using four deep learning models on 32 Amazon EC2 GPU instances showed that the new Chronicaachieves up to 3.19 times speedup over the state-of-the-art systems.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131896820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1