Proceedings of the 10th Workshop on Scientific Cloud Computing最新文献

英文中文

Session details: Session 2: Scientific Computing Based on Cloud 会议详情:第二部分:基于云的科学计算

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 2019-06-17 DOI: 10.1145/3341817

Dmitry Duplyakin

引用次数: 0

Horizontal or Vertical?: A Hybrid Approach to Large-Scale Distributed Machine Learning 水平还是垂直?大规模分布式机器学习的混合方法

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 2019-06-17 DOI: 10.1145/3322795.3331461

Jinkun Geng, Dan Li, Shuai Wang

Data parallelism and model parallelism are two typical parallel modes for distributed machine learning (DML). Traditionally, DML mainly leverages data parallelism, which maintains one model instance for each node and synchronizes the model parameters at the end of every iteration. However, as the model grows larger, communication cost and GPU memory consumption become significant. Data parallelism thus fails to work efficiently in large scale, and model-parallel solutions are proposed in recent years. In this paper, we comprehensively discuss the benefits and drawbacks on both sides. Based on the comparative analysis, we propose Hove, a hybrid approach incorporating data parallelism and model parallelism to balance the overheads and achieve high performance for large-scale DML.

数据并行和模型并行是分布式机器学习(DML)的两种典型并行模式。传统上，DML主要利用数据并行性，它为每个节点维护一个模型实例，并在每次迭代结束时同步模型参数。然而，随着模型变大，通信成本和GPU内存消耗变得显著。因此，数据并行不能有效地在大规模中工作，近年来提出了模型并行解决方案。在本文中，我们全面讨论了双方的利弊。在比较分析的基础上，我们提出了一种结合数据并行性和模型并行性的混合方法Hove，以平衡开销并实现大规模DML的高性能。

引用次数: 15

Session details: Session 1: Converged Computing Infrastructures 会议详情:会议1:融合计算基础设施

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 2019-06-17 DOI: 10.1145/3341816

Bogdan Nicolae

引用次数: 0

Deconstructing the 2017 Changes to AWS Spot Market Pricing 解析2017年AWS现货市场定价变化

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 2019-06-17 DOI: 10.1145/3322795.3331465

Matt Baughman, Simon Caton, C. Haas, Ryan Chard, R. Wolski, Ian T Foster, K. Chard

The Amazon Web Services spot market sells excess computing capacity at a reduced price and with reduced reliability guarantees. The low cost nature of the spot market has led to widespread adoption in industry and science. However, one of the challenges with using the spot market is that it is intentionally opaque and thus users have little understanding of the underlying dynamics. In late 2017, the mechanisms underlying the spot market were significantly altered-no longer are bid prices used to clear capacity and as a result the pricing is much less volatile. In this paper, we revisit prior work with the aim to analyze the differences in market dynamics between the pre-change and post-change spot instance market. We then use these analyses to highlight possible properties of the current and previous pricing algorithms, including artificial manipulation, dynamic algorithm adjustment, and persistent trends in market supply, demand, and pricing.

Amazon Web Services现货市场以较低的价格和较低的可靠性保证出售多余的计算能力。现货市场的低成本特性使其在工业和科学领域得到广泛采用。然而，使用现货市场的一个挑战是，它故意不透明，因此用户对潜在的动态知之甚少。2017年底，现货市场的基础机制发生了重大变化——不再使用投标价格来清除产能，因此价格的波动性大大降低。在本文中，我们回顾了之前的工作，目的是分析变化前和变化后现货实例市场之间的市场动态差异。然后，我们使用这些分析来突出当前和以前的定价算法的可能属性，包括人为操纵，动态算法调整，以及市场供求和定价的持续趋势。

引用次数: 14

ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training 弹性管道:DNN训练的高效动态模型并行解决方案

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 2019-06-17 DOI: 10.1145/3322795.3331463

Jinkun Geng, Dan Li, Shuai Wang

Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.

传统的深度神经网络(deep neural network, DNN)训练是通过数据并行执行的，这带来了巨大的通信开销和GPU内存消耗。考虑到这一点，最近的开创性工作试图用模型并行性来训练深度神经网络。然而，模型划分仍然是一个主要问题，静态划分不能适应云集群不断变化的计算环境。该文提出了一种基于管道模型并行性的神经网络训练方法ElasticPipe。与数据并行解决方案不同，ElasticPipe中的每个节点只包含整个模型的一部分，从而大大降低了通信和GPU内存的成本。更重要的是，ElasticPipe能够动态调整不同节点之间的工作负载分布，从而减轻云环境中常见的离散效应。我们的初步实验表明，与数据并行基线相比，在不考虑离散效应的情况下，ElasticPipe的训练时间减少了89.03%，在存在离散效应的情况下，该算法的训练时间减少了76.72%。此外，当涉及到散点时，ElasticPipe的训练性能也比静态的要好28.81%。

{"title":"ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training","authors":"Jinkun Geng, Dan Li, Shuai Wang","doi":"10.1145/3322795.3331463","DOIUrl":"https://doi.org/10.1145/3322795.3331463","url":null,"abstract":"Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116069846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Towards a Smart, Internet-Scale Cache Service for Data Intensive Scientific Applications 面向数据密集型科学应用的智能互联网级缓存服务

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 2019-06-17 DOI: 10.1145/3322795.3331464

Yubo Qin, Anthony Simonet, Philip E. Davis, Azita Nouri, Zhe Wang, M. Parashar, I. Rodero

Data and services provided by shared facilities, such as large-scale observing facilities, have become important enablers of scientific insights and discoveries across many science and engineering disciplines. Ensuring satisfactory quality of service can be challenging for facilities, due to their remote locations and to the distributed nature of the instruments, observatories, and users, as well as the rapid growth of data volumes and rates. This research explores how knowledge of the facilities usage patterns, coupled with emerging cyberinfrastructures can be leveraged to improve their performance, usability, and scientific impact. We propose a framework with a smart, internet-scale cache augmented with prefetching and data placement strategies to improve data delivery performance for scientific facilities. Our evaluations, which are based on the NSF Ocean Observatories Initiative, demonstrate that our framework is able to predict user requests and reduce data movements by more than 56% across networks.

共享设施(如大型观测设施)提供的数据和服务已成为许多科学和工程学科的科学见解和发现的重要推动力。对于设施来说，确保令人满意的服务质量可能具有挑战性，因为它们的位置偏远，仪器、观测站和用户的分布性质，以及数据量和速率的快速增长。本研究探讨了如何利用设施使用模式的知识，结合新兴的网络基础设施来提高其性能、可用性和科学影响。我们提出了一个具有智能互联网规模缓存的框架，增强了预取和数据放置策略，以提高科学设施的数据传输性能。我们的评估基于NSF海洋观测站计划，表明我们的框架能够预测用户请求，并将网络上的数据移动减少56%以上。

引用次数: 5

Proceedings of the 10th Workshop on Scientific Cloud Computing 第十届科学云计算研讨会论文集

Proceedings of the 10th Workshop on Scientific Cloud Computing

Pub Date : 1900-01-01 DOI: 10.1145/3322795

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 10th Workshop on Scientific Cloud Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀