Enhancing reliability and response times via replication in computing clusters

Z. Qiu, Juan F. Pérez
{"title":"Enhancing reliability and response times via replication in computing clusters","authors":"Z. Qiu, Juan F. Pérez","doi":"10.1109/INFOCOM.2015.7218512","DOIUrl":null,"url":null,"abstract":"Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.","PeriodicalId":342583,"journal":{"name":"2015 IEEE Conference on Computer Communications (INFOCOM)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Conference on Computer Communications (INFOCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOM.2015.7218512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过计算集群中的复制提高可靠性和响应时间
计算集群已广泛应用于科学和工程应用,以支持密集计算和海量数据操作。由于集群中的应用程序和资源容易发生故障,因此通常采用容错策略,有时会增加作业响应时间的额外延迟,或者不必要地增加资源使用。在本文中,我们探讨了带取消的并发复制,这是一种容错方法,其中作业及其副本是并发处理的,成功完成任一作业都会触发其副本的删除。我们提出了一个随机模型来研究这种方法如何影响集群服务水平目标(slo),特别是提供的响应时间百分位数。除了可靠性方面的预期收益外,所建议的模型还允许我们确定在哪些区域引入复制并取消复制可以有效地减少响应时间。此外,我们还展示了该模型如何支持具有可靠性和响应时间保证的资源供应决策。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ambient rendezvous: Energy-efficient neighbor discovery via acoustic sensing A-DCF: Design and implementation of delay and queue length based wireless MAC Original SYN: Finding machines hidden behind firewalls Supporting WiFi and LTE co-existence MadeCR: Correlation-based malware detection for cognitive radio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1