High Availability on Jetstream: Practices and Lessons Learned

John Michael Lowe, Jeremy Fischer, Sanjana Sudarshan, George W. Turner, C. Stewart, David Y. Hancock
{"title":"High Availability on Jetstream: Practices and Lessons Learned","authors":"John Michael Lowe, Jeremy Fischer, Sanjana Sudarshan, George W. Turner, C. Stewart, David Y. Hancock","doi":"10.1145/3217880.3217884","DOIUrl":null,"url":null,"abstract":"Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.","PeriodicalId":340918,"journal":{"name":"Proceedings of the 9th Workshop on Scientific Cloud Computing","volume":"223 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th Workshop on Scientific Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3217880.3217884","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Jetstream的高可用性:实践和经验教训
研究计算传统上使用高性能计算(HPC)集群,并且在计算和存储容量加倍的情况下不提供高可用性的服务。系统维护(如安全补丁、固件更新和其他系统升级)通常意味着系统在工作期间不可用,除非有冗余的HPC系统和存储。虽然通常会努力限制停机时间,但在必要时,维护窗口可能是一到两个小时,甚至长达一整天。随着美国国家科学基金会(NSF)开始资助非传统研究系统,寻找为研究人员提供更高可用性的方法成为服务提供商关注的焦点之一。Jetstream的设计要素之一是地理分散,以最大限度地提高可用性。这是一系列设计元素的第一步,旨在使Jetstream超出NSF的可用性要求。我们将研究所采用的设计步骤、系统的组件以及在部署中如何考虑每个组件的可用性、如何处理维护以及从Jetstream云的设计和实现中吸取的经验教训。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Batch and online anomaly detection for scientific applications in a Kubernetes environment High Availability on Jetstream: Practices and Lessons Learned Faodel Libra Early Experience Using Amazon Batch for Scientific Workflows
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1