{"title":"Session details: Session 2: Scientific Computing Based on Cloud","authors":"Dmitry Duplyakin","doi":"10.1145/3341817","DOIUrl":"https://doi.org/10.1145/3341817","url":null,"abstract":"","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123705965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data parallelism and model parallelism are two typical parallel modes for distributed machine learning (DML). Traditionally, DML mainly leverages data parallelism, which maintains one model instance for each node and synchronizes the model parameters at the end of every iteration. However, as the model grows larger, communication cost and GPU memory consumption become significant. Data parallelism thus fails to work efficiently in large scale, and model-parallel solutions are proposed in recent years. In this paper, we comprehensively discuss the benefits and drawbacks on both sides. Based on the comparative analysis, we propose Hove, a hybrid approach incorporating data parallelism and model parallelism to balance the overheads and achieve high performance for large-scale DML.
{"title":"Horizontal or Vertical?: A Hybrid Approach to Large-Scale Distributed Machine Learning","authors":"Jinkun Geng, Dan Li, Shuai Wang","doi":"10.1145/3322795.3331461","DOIUrl":"https://doi.org/10.1145/3322795.3331461","url":null,"abstract":"Data parallelism and model parallelism are two typical parallel modes for distributed machine learning (DML). Traditionally, DML mainly leverages data parallelism, which maintains one model instance for each node and synchronizes the model parameters at the end of every iteration. However, as the model grows larger, communication cost and GPU memory consumption become significant. Data parallelism thus fails to work efficiently in large scale, and model-parallel solutions are proposed in recent years. In this paper, we comprehensively discuss the benefits and drawbacks on both sides. Based on the comparative analysis, we propose Hove, a hybrid approach incorporating data parallelism and model parallelism to balance the overheads and achieve high performance for large-scale DML.","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123280795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matt Baughman, Simon Caton, C. Haas, Ryan Chard, R. Wolski, Ian T Foster, K. Chard
The Amazon Web Services spot market sells excess computing capacity at a reduced price and with reduced reliability guarantees. The low cost nature of the spot market has led to widespread adoption in industry and science. However, one of the challenges with using the spot market is that it is intentionally opaque and thus users have little understanding of the underlying dynamics. In late 2017, the mechanisms underlying the spot market were significantly altered-no longer are bid prices used to clear capacity and as a result the pricing is much less volatile. In this paper, we revisit prior work with the aim to analyze the differences in market dynamics between the pre-change and post-change spot instance market. We then use these analyses to highlight possible properties of the current and previous pricing algorithms, including artificial manipulation, dynamic algorithm adjustment, and persistent trends in market supply, demand, and pricing.
Amazon Web Services现货市场以较低的价格和较低的可靠性保证出售多余的计算能力。现货市场的低成本特性使其在工业和科学领域得到广泛采用。然而,使用现货市场的一个挑战是,它故意不透明,因此用户对潜在的动态知之甚少。2017年底,现货市场的基础机制发生了重大变化——不再使用投标价格来清除产能,因此价格的波动性大大降低。在本文中,我们回顾了之前的工作,目的是分析变化前和变化后现货实例市场之间的市场动态差异。然后,我们使用这些分析来突出当前和以前的定价算法的可能属性,包括人为操纵,动态算法调整,以及市场供求和定价的持续趋势。
{"title":"Deconstructing the 2017 Changes to AWS Spot Market Pricing","authors":"Matt Baughman, Simon Caton, C. Haas, Ryan Chard, R. Wolski, Ian T Foster, K. Chard","doi":"10.1145/3322795.3331465","DOIUrl":"https://doi.org/10.1145/3322795.3331465","url":null,"abstract":"The Amazon Web Services spot market sells excess computing capacity at a reduced price and with reduced reliability guarantees. The low cost nature of the spot market has led to widespread adoption in industry and science. However, one of the challenges with using the spot market is that it is intentionally opaque and thus users have little understanding of the underlying dynamics. In late 2017, the mechanisms underlying the spot market were significantly altered-no longer are bid prices used to clear capacity and as a result the pricing is much less volatile. In this paper, we revisit prior work with the aim to analyze the differences in market dynamics between the pre-change and post-change spot instance market. We then use these analyses to highlight possible properties of the current and previous pricing algorithms, including artificial manipulation, dynamic algorithm adjustment, and persistent trends in market supply, demand, and pricing.","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133389227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.
{"title":"ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training","authors":"Jinkun Geng, Dan Li, Shuai Wang","doi":"10.1145/3322795.3331463","DOIUrl":"https://doi.org/10.1145/3322795.3331463","url":null,"abstract":"Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116069846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yubo Qin, Anthony Simonet, Philip E. Davis, Azita Nouri, Zhe Wang, M. Parashar, I. Rodero
Data and services provided by shared facilities, such as large-scale observing facilities, have become important enablers of scientific insights and discoveries across many science and engineering disciplines. Ensuring satisfactory quality of service can be challenging for facilities, due to their remote locations and to the distributed nature of the instruments, observatories, and users, as well as the rapid growth of data volumes and rates. This research explores how knowledge of the facilities usage patterns, coupled with emerging cyberinfrastructures can be leveraged to improve their performance, usability, and scientific impact. We propose a framework with a smart, internet-scale cache augmented with prefetching and data placement strategies to improve data delivery performance for scientific facilities. Our evaluations, which are based on the NSF Ocean Observatories Initiative, demonstrate that our framework is able to predict user requests and reduce data movements by more than 56% across networks.
{"title":"Towards a Smart, Internet-Scale Cache Service for Data Intensive Scientific Applications","authors":"Yubo Qin, Anthony Simonet, Philip E. Davis, Azita Nouri, Zhe Wang, M. Parashar, I. Rodero","doi":"10.1145/3322795.3331464","DOIUrl":"https://doi.org/10.1145/3322795.3331464","url":null,"abstract":"Data and services provided by shared facilities, such as large-scale observing facilities, have become important enablers of scientific insights and discoveries across many science and engineering disciplines. Ensuring satisfactory quality of service can be challenging for facilities, due to their remote locations and to the distributed nature of the instruments, observatories, and users, as well as the rapid growth of data volumes and rates. This research explores how knowledge of the facilities usage patterns, coupled with emerging cyberinfrastructures can be leveraged to improve their performance, usability, and scientific impact. We propose a framework with a smart, internet-scale cache augmented with prefetching and data placement strategies to improve data delivery performance for scientific facilities. Our evaluations, which are based on the NSF Ocean Observatories Initiative, demonstrate that our framework is able to predict user requests and reduce data movements by more than 56% across networks.","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117144223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 10th Workshop on Scientific Cloud Computing","authors":"","doi":"10.1145/3322795","DOIUrl":"https://doi.org/10.1145/3322795","url":null,"abstract":"","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"409 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123814972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}