MIC check: A correlation tactic for ESE data

Daryl Posnett, Premkumar T. Devanbu, V. Filkov
{"title":"MIC check: A correlation tactic for ESE data","authors":"Daryl Posnett, Premkumar T. Devanbu, V. Filkov","doi":"10.1109/MSR.2012.6224295","DOIUrl":null,"url":null,"abstract":"Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2012.6224295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MIC检查:ESE数据的相关策略
经验软件工程研究人员关心的是理解感兴趣的结果(例如缺陷)与过程和产品度量之间的关系。使用相关性来揭示强关系是多变量建模的自然先驱。不幸的是,相关系数很难解释,而且/或容易引起误解。例如,多项式关系中的变量之间存在很强的相关性;这可能导致人们错误地,并最终误导,在线性回归中建立多项式相关变量的模型。同样,非单调泛函关系,甚至非泛函关系也可能被相关系数完全忽略。异常值会影响标准相关度量,关联值甚至会过度影响稳健的非参数秩相关度量,而较小的样本量会导致相关度量不稳定。一种新的二元关联度量,最大信息系数(MIC)[1],承诺同时发现两个变量是否具有:A)任何关联,b)函数关系,c)非线性关系。MIC是对标准和等级相关度量的非常有用的补充。它分别规定了一种关系的存在及其确切的性质;因此,它可以在建模非功能和非线性关系时做出更明智的选择,并且可以更细致地指示标准和等级相关度量所报告的值的潜在问题。我们使用各种软件工程度量来说明MIC的使用。我们研究和解释了软件工程数据中MIC和相关度量的分布特性,并说明了这些度量对经验软件工程研究者的价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MINCE: Mining change history of Android project Co-evolution of logical couplings and commits for defect estimation Analysis of customer satisfaction survey data Do faster releases improve software quality? An empirical case study of Mozilla Firefox Why do software packages conflict?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1