Team-Based Message Logging: Preliminary Results

Esteban Meneses, C. Mendes, L. Kalé
{"title":"Team-Based Message Logging: Preliminary Results","authors":"Esteban Meneses, C. Mendes, L. Kalé","doi":"10.1109/CCGRID.2010.110","DOIUrl":null,"url":null,"abstract":"Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2010.110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39

Abstract

Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于团队的消息记录:初步结果
在未来十年,容错将是一个基本的必要条件,因为包含数十万个核心的机器将被安装在不同的地方。在这种情况下,传统的检查点/重新启动模型似乎不是一个合适的选择,因为如果其中一个处理器出现单个故障,它会使所有处理器回滚到最新的检查点。内存中消息日志记录是一种替代方法,它可以避免这种全局恢复过程,而是将消息重放到故障处理器。但是,与消息日志记录相关的内存开销很大,因为必须记录每条消息,以便在发生故障时回放。在本文中,我们介绍了一种通过分组处理器来减少消息日志对内存需求的技术。这些团队充当一个故障单元:如果一个团队成员失败,该团队中的所有其他成员回滚到他们最近的检查点并开始恢复过程。这消除了在团队中记录消息内容的需要。这种方法所节省的内存取决于应用程序的特征、每个计算单元发送的消息数量以及这些消息的大小。我们在多个基准测试中展示了有希望的结果。例如,在512核上运行类D的NPB-CG代码设法将消息日志的内存开销减少62%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
In Search of Visualization Metaphors for PlanetLab Multi-criteria Content Adaptation Service Selection Broker Enabling the Next Generation of Scalable Clusters Development and Support of Platforms for Research into Rare Diseases Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1