Assessing the Representativeness of Open Source Projects in Empirical Software Engineering Studies

2012 19th Asia-Pacific Software Engineering Conference Pub Date : 2012-12-04 DOI:10.1109/APSEC.2012.36

Hao Zhong, Ye Yang, J. Keung

{"title":"Assessing the Representativeness of Open Source Projects in Empirical Software Engineering Studies","authors":"Hao Zhong, Ye Yang, J. Keung","doi":"10.1109/APSEC.2012.36","DOIUrl":null,"url":null,"abstract":"BACKGROUND: Software engineering researchers have carried out many empirical studies on open source software (OSS) projects to understand the OSS phenomenon, and to develop better software engineering techniques. Many of these studies typically use only a few successful projects as study subjects. Recently, these studies have received criticisms and challenges on their representativeness on OSS projects. AIM: First, we aim to examine to what extent data extracted from successful projects are different from data extracted from the majority. If data extracted from successful projects are quite different from data extracted from the majority, approaches that are effective on successful projects may not be effective in general. Second, we aim to examine whether successful OSS projects are representative to the whole population of OSS. If they are not, conclusions that are drawn from only successful projects may reflect the OSS phenomenon partially. METHODOLOGY: We analyzed 11, 684 OSS projects that are hosted on Source Forge. When researchers select subjects, they typically select successful projects that are attractive to both users and developers. Considering this preference, we clustered these projects into four categories based their attractiveness to users and developers. Here, we use the K-means clustering technique to produce combined result. Furthermore, we selected eight indicators that are used in many existing studies (e.g., team sizes), and compared indicators that are extracted from different categories to investigate to what degree they are different. RESULT: For the first research aim, the result shows that 66.1% projects are under developing projects, 14.7% projects are user-preference projects, 14.2% projects are developer-preference projects, and only 5.0% projects are considered successful. For the second research aim, the result shows that all the eight analyzed indicators are highly unbalanced with the gamma distribution. Furthermore, the result reveals that users and developers of Source Forge have different perceptions on the development status defined by Source Forge. CONCLUSION: We conclude that successful projects are not representative to the whole population of OSS, and data extracted from successful projects are quite different from data extracted from the majority. The result implies that conclusions drawn from only a few successful projects may be challenged. This work is important to allow researchers to refine conclusions of existing studies, and to better understand and to carefully select OSS project subjects for their future empirical experiments.","PeriodicalId":364411,"journal":{"name":"2012 19th Asia-Pacific Software Engineering Conference","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 19th Asia-Pacific Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC.2012.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

BACKGROUND: Software engineering researchers have carried out many empirical studies on open source software (OSS) projects to understand the OSS phenomenon, and to develop better software engineering techniques. Many of these studies typically use only a few successful projects as study subjects. Recently, these studies have received criticisms and challenges on their representativeness on OSS projects. AIM: First, we aim to examine to what extent data extracted from successful projects are different from data extracted from the majority. If data extracted from successful projects are quite different from data extracted from the majority, approaches that are effective on successful projects may not be effective in general. Second, we aim to examine whether successful OSS projects are representative to the whole population of OSS. If they are not, conclusions that are drawn from only successful projects may reflect the OSS phenomenon partially. METHODOLOGY: We analyzed 11, 684 OSS projects that are hosted on Source Forge. When researchers select subjects, they typically select successful projects that are attractive to both users and developers. Considering this preference, we clustered these projects into four categories based their attractiveness to users and developers. Here, we use the K-means clustering technique to produce combined result. Furthermore, we selected eight indicators that are used in many existing studies (e.g., team sizes), and compared indicators that are extracted from different categories to investigate to what degree they are different. RESULT: For the first research aim, the result shows that 66.1% projects are under developing projects, 14.7% projects are user-preference projects, 14.2% projects are developer-preference projects, and only 5.0% projects are considered successful. For the second research aim, the result shows that all the eight analyzed indicators are highly unbalanced with the gamma distribution. Furthermore, the result reveals that users and developers of Source Forge have different perceptions on the development status defined by Source Forge. CONCLUSION: We conclude that successful projects are not representative to the whole population of OSS, and data extracted from successful projects are quite different from data extracted from the majority. The result implies that conclusions drawn from only a few successful projects may be challenged. This work is important to allow researchers to refine conclusions of existing studies, and to better understand and to carefully select OSS project subjects for their future empirical experiments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估开源项目在实证软件工程研究中的代表性

背景:软件工程研究人员对开源软件(OSS)项目进行了许多实证研究，以了解OSS现象，并开发更好的软件工程技术。许多这样的研究通常只使用少数成功的项目作为研究对象。最近，这些研究因其在OSS项目中的代表性而受到批评和挑战。目的:首先，我们的目的是检验从成功项目中提取的数据与从大多数项目中提取的数据有多大不同。如果从成功项目中提取的数据与从大多数项目中提取的数据有很大不同，那么对成功项目有效的方法可能并不普遍有效。其次，我们的目标是检查成功的OSS项目是否代表了整个OSS群体。如果不是，那么仅从成功项目中得出的结论可能部分地反映了OSS现象。方法论:我们分析了在Source Forge上托管的11,684个OSS项目。当研究人员选择主题时，他们通常会选择对用户和开发人员都有吸引力的成功项目。考虑到这种偏好，我们根据对用户和开发人员的吸引力将这些项目分为四类。在这里，我们使用K-means聚类技术来产生组合结果。此外，我们选择了许多现有研究中使用的八个指标(例如，团队规模)，并比较了从不同类别中提取的指标，以调查它们的差异程度。结果:对于第一个研究目标，结果显示66.1%的项目处于开发项目中，14.7%的项目是用户偏好项目，14.2%的项目是开发商偏好项目，只有5.0%的项目被认为是成功的。对于第二个研究目标，结果表明，所分析的8个指标都与gamma分布高度不平衡。此外，结果还揭示了Source Forge的用户和开发者对Source Forge定义的开发状态有不同的看法。结论:我们得出的结论是，成功的项目并不能代表整个OSS群体，从成功项目中提取的数据与从大多数项目中提取的数据有很大的不同。结果表明，仅从少数成功项目中得出的结论可能会受到挑战。这项工作非常重要，它允许研究人员提炼现有研究的结论，并更好地理解和仔细选择OSS项目主题，用于他们未来的经验实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 19th Asia-Pacific Software Engineering Conference

自引率

0.00%

发文量