How are software datasets constructed in Empirical Software Engineering studies? A systematic mapping study

J. A. Carruthers, J. A. D. Pace, E. Irrazábal
{"title":"How are software datasets constructed in Empirical Software Engineering studies? A systematic mapping study","authors":"J. A. Carruthers, J. A. D. Pace, E. Irrazábal","doi":"10.1109/SEAA56994.2022.00075","DOIUrl":null,"url":null,"abstract":"Context: Software projects are common inputs in Empirical Software Engineering (ESE) studies, although they are often selected with ad-hoc strategies that reduce the generalizability of the results. An alternative is the usage of available datasets of software projects, which should be current and follow explicit rules for ensuring their validity over time. Goal: In this context, it is important to assess the general state of software datasets in terms of purpose, last update, project characterization, source code metrics, and tools to extract source-code-related artifacts. Method: We conducted a systematic mapping study retrieving software datasets used in ESE studies published from January 2013 to December 2021. Results: We selected 74 datasets created mainly for software defects, software estimation, and software maintainability studies. The majority of these datasets (64%) explicitly stated the characteristics to select the projects, and the most common programming languages were Java and C. Conclusions: Our study identified scarce efforts to keep datasets updated over time and also provides recommendations to support their construction and consumption for ESE studies.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Context: Software projects are common inputs in Empirical Software Engineering (ESE) studies, although they are often selected with ad-hoc strategies that reduce the generalizability of the results. An alternative is the usage of available datasets of software projects, which should be current and follow explicit rules for ensuring their validity over time. Goal: In this context, it is important to assess the general state of software datasets in terms of purpose, last update, project characterization, source code metrics, and tools to extract source-code-related artifacts. Method: We conducted a systematic mapping study retrieving software datasets used in ESE studies published from January 2013 to December 2021. Results: We selected 74 datasets created mainly for software defects, software estimation, and software maintainability studies. The majority of these datasets (64%) explicitly stated the characteristics to select the projects, and the most common programming languages were Java and C. Conclusions: Our study identified scarce efforts to keep datasets updated over time and also provides recommendations to support their construction and consumption for ESE studies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在实证软件工程研究中如何构建软件数据集?系统的制图研究
背景:软件项目是经验软件工程(ESE)研究中的常见输入,尽管它们通常是用减少结果的普遍性的特殊策略来选择的。另一种选择是使用软件项目的可用数据集,这些数据集应该是最新的,并遵循明确的规则,以确保它们随着时间的推移而有效。目标:在这种情况下,根据目的、最近的更新、项目特征、源代码度量和提取源代码相关工件的工具来评估软件数据集的一般状态是很重要的。方法:我们进行了系统的制图研究,检索了2013年1月至2021年12月发表的ESE研究中使用的软件数据集。结果:我们选择了74个主要用于软件缺陷、软件评估和软件可维护性研究的数据集。这些数据集中的大多数(64%)明确地说明了选择项目的特征,最常见的编程语言是Java和c。结论:我们的研究发现,随着时间的推移,保持数据集更新的努力很少,并且还提供了支持ESE研究的数据集构建和使用的建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Service Classification through Machine Learning: Aiding in the Efficient Identification of Reusable Assets in Cloud Application Development Handling Environmental Uncertainty in Design Time Access Control Analysis How are software datasets constructed in Empirical Software Engineering studies? A systematic mapping study Microservices smell detection through dynamic analysis Towards Secure Agile Software Development Process: A Practice-Based Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1