A characteristic study on failures of production distributed data-parallel programs

Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, Tao Xie
{"title":"A characteristic study on failures of production distributed data-parallel programs","authors":"Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, Tao Xie","doi":"10.1109/ICSE.2013.6606646","DOIUrl":null,"url":null,"abstract":"SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking, and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C# user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers' mistakes and frequent data-schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93% fixes do not change data processing logic; (4) there are 8% failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms.","PeriodicalId":322423,"journal":{"name":"2013 35th International Conference on Software Engineering (ICSE)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 35th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2013.6606646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking, and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C# user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers' mistakes and frequent data-schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93% fixes do not change data processing logic; (4) there are 8% failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生产分布式数据并行程序故障特性研究
SCOPE被来自微软必应数十个不同产品团队的数千名开发人员采用,用于日常网络规模的数据处理,包括索引构建、搜索排名和广告显示。SCOPE作业由类似sql的声明性查询和命令式c#用户定义函数(udf)组成,它们由数千台机器在流水线中执行。每天在Microsoft集群上执行数以万计的SCOPE作业,而其中一些在执行时间较长后失败,从而浪费了大量资源。减少SCOPE失败将节省大量资源。本文结合微软Bing的调试统计数据,对200个SCOPE故障/修复和50个SCOPE故障进行了全面的特征研究,不仅调查了主要的故障类型、故障来源和修复,还调查了当前的调试实践。我们的主要发现包括:(1)大多数失败(84.5%)是由于数据处理缺陷而不是代码逻辑缺陷引起的;(2)表级故障(22.5%)主要由程序员的错误和频繁的数据模式更改引起,而行级故障(62%)主要由异常数据引起;(3) 93%的修复不改变数据处理逻辑;(4)有8%的故障根本原因不在故障暴露阶段,当前的调试实践在这种情况下是不够的。我们的研究结果为数据并行程序的未来发展提供了有价值的指导。我们认为这些指导方针并不局限于SCOPE,也可以推广到其他类似的数据并行平台。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Studios in software engineering education: Towards an evaluable model Not going to take this anymore: Multi-objective overtime planning for Software Engineering projects 3rd International workshop on collaborative teaching of globally distributed software development (CTGDSD 2013) TestEvol: A tool for analyzing test-suite evolution A characteristic study on failures of production distributed data-parallel programs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1