MIC check: A correlation tactic for ESE data

2012 9th IEEE Working Conference on Mining Software Repositories (MSR) Pub Date : 2012-06-01 DOI:10.1109/MSR.2012.6224295

Daryl Posnett, Premkumar T. Devanbu, V. Filkov

{"title":"MIC check: A correlation tactic for ESE data","authors":"Daryl Posnett, Premkumar T. Devanbu, V. Filkov","doi":"10.1109/MSR.2012.6224295","DOIUrl":null,"url":null,"abstract":"Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2012.6224295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MIC检查:ESE数据的相关策略

经验软件工程研究人员关心的是理解感兴趣的结果(例如缺陷)与过程和产品度量之间的关系。使用相关性来揭示强关系是多变量建模的自然先驱。不幸的是，相关系数很难解释，而且/或容易引起误解。例如，多项式关系中的变量之间存在很强的相关性;这可能导致人们错误地，并最终误导，在线性回归中建立多项式相关变量的模型。同样，非单调泛函关系，甚至非泛函关系也可能被相关系数完全忽略。异常值会影响标准相关度量，关联值甚至会过度影响稳健的非参数秩相关度量，而较小的样本量会导致相关度量不稳定。一种新的二元关联度量，最大信息系数(MIC)[1]，承诺同时发现两个变量是否具有:A)任何关联，b)函数关系，c)非线性关系。MIC是对标准和等级相关度量的非常有用的补充。它分别规定了一种关系的存在及其确切的性质;因此，它可以在建模非功能和非线性关系时做出更明智的选择，并且可以更细致地指示标准和等级相关度量所报告的值的潜在问题。我们使用各种软件工程度量来说明MIC的使用。我们研究和解释了软件工程数据中MIC和相关度量的分布特性，并说明了这些度量对经验软件工程研究者的价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量

期刊最新文献

MINCE: Mining change history of Android project Co-evolution of logical couplings and commits for defect estimation Analysis of customer satisfaction survey data Do faster releases improve software quality? An empirical case study of Mozilla Firefox Why do software packages conflict?