Reflections on the NASA MDP data sets

David Gray, David Bowes, N. Davey, Yi Sun, B. Christianson
{"title":"Reflections on the NASA MDP data sets","authors":"David Gray, David Bowes, N. Davey, Yi Sun, B. Christianson","doi":"10.1049/iet-sen.2011.0132","DOIUrl":null,"url":null,"abstract":"Background: The NASA metrics data program (MDP) data sets have been heavily used in software defect prediction research. Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context. Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining. Conclusions: Firstly researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Secondly, the bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly because of repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.","PeriodicalId":13395,"journal":{"name":"IET Softw.","volume":"22 1","pages":"549-558"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"69","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Softw.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/iet-sen.2011.0132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 69

Abstract

Background: The NASA metrics data program (MDP) data sets have been heavily used in software defect prediction research. Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context. Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining. Conclusions: Firstly researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Secondly, the bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly because of repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
对NASA MDP数据集的思考
背景:NASA度量数据程序(MDP)数据集在软件缺陷预测研究中被大量使用。目的:强调这些数据集中存在的数据质量问题,以及在二进制分类上下文中使用它们时可能出现的问题。方法:对所有13个原始NASA数据集进行全面探索,随后进行各种实验,证明数据挖掘时重复数据点的潜在影响。结论:首先,研究人员需要在如何使用的背景下分析构成其发现基础的数据。其次,基于NASA MDP数据集的大量缺陷预测实验可能导致错误的发现。这主要是因为重复/重复的数据点可能导致大量的训练和测试数据相同。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Prioritising test scripts for the testing of memory bloat in web applications A synergic quantum particle swarm optimisation for constrained combinatorial test generation A hybrid model for prediction of software effort based on team size A 20-year mapping of Bayesian belief networks in software project management Emerging and multidisciplinary approaches to software engineering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1