Otávio Cury da Costa Castro, G. Avelino, P. Neto, Ricardo Britto, M. T. Valente
{"title":"识别源代码文件专家","authors":"Otávio Cury da Costa Castro, G. Avelino, P. Neto, Ricardo Britto, M. T. Valente","doi":"10.1145/3544902.3546243","DOIUrl":null,"url":null,"abstract":"Background: In software development, the identification of source code file experts is an important task. Identifying these experts helps to improve software maintenance and evolution activities, such as developing new features, code reviews, and bug fixes. Although some studies have proposed repository-mining techniques to automatically identify source code experts, there are still gaps in this area that can be explored. For example, investigating new variables related to source code knowledge and applying machine learning aiming to improve the performance of techniques to identify source code experts. Aim: The goal of this study is to investigate opportunities to improve the performance of existing techniques to recommend source code files experts. Method: We built an oracle by collecting data from the development history and surveying developers of 113 software projects. Then, we use this oracle to: (i) analyze the correlation between measures extracted from the development history and the developers’ source code knowledge and (ii) investigate the use of machine learning classifiers by evaluating their performance in identifying source code files experts. Results:First Authorship and Recency of Modification are the variables with the highest positive and negative correlations with source code knowledge, respectively. Machine learning classifiers outperformed the linear techniques (F-Measure = 71% to 73%) in the public dataset, but this advantage is not clear in the private dataset, with F-Measure ranging from 55% to 68% for the linear techniques and 58% to 67% for ML techniques. Conclusion: Overall, the linear techniques and the machine learning classifiers achieved similar performance, particularly if we analyze F-Measure. However, machine learning classifiers usually get higher precision while linear techniques obtained the highest recall values. Therefore, the choice of the best technique depends on the user’s tolerance to false positives and false negatives.","PeriodicalId":220679,"journal":{"name":"Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement","volume":"203 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Identifying Source Code File Experts\",\"authors\":\"Otávio Cury da Costa Castro, G. Avelino, P. Neto, Ricardo Britto, M. T. Valente\",\"doi\":\"10.1145/3544902.3546243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: In software development, the identification of source code file experts is an important task. Identifying these experts helps to improve software maintenance and evolution activities, such as developing new features, code reviews, and bug fixes. Although some studies have proposed repository-mining techniques to automatically identify source code experts, there are still gaps in this area that can be explored. For example, investigating new variables related to source code knowledge and applying machine learning aiming to improve the performance of techniques to identify source code experts. Aim: The goal of this study is to investigate opportunities to improve the performance of existing techniques to recommend source code files experts. Method: We built an oracle by collecting data from the development history and surveying developers of 113 software projects. Then, we use this oracle to: (i) analyze the correlation between measures extracted from the development history and the developers’ source code knowledge and (ii) investigate the use of machine learning classifiers by evaluating their performance in identifying source code files experts. Results:First Authorship and Recency of Modification are the variables with the highest positive and negative correlations with source code knowledge, respectively. Machine learning classifiers outperformed the linear techniques (F-Measure = 71% to 73%) in the public dataset, but this advantage is not clear in the private dataset, with F-Measure ranging from 55% to 68% for the linear techniques and 58% to 67% for ML techniques. Conclusion: Overall, the linear techniques and the machine learning classifiers achieved similar performance, particularly if we analyze F-Measure. However, machine learning classifiers usually get higher precision while linear techniques obtained the highest recall values. Therefore, the choice of the best technique depends on the user’s tolerance to false positives and false negatives.\",\"PeriodicalId\":220679,\"journal\":{\"name\":\"Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement\",\"volume\":\"203 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3544902.3546243\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3544902.3546243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
背景:在软件开发中,源代码文件专家的识别是一项重要任务。识别这些专家有助于改进软件维护和发展活动,例如开发新特性、代码审查和错误修复。尽管一些研究已经提出了自动识别源代码专家的存储库挖掘技术,但在这一领域仍然存在有待探索的空白。例如,调查与源代码知识相关的新变量,并应用旨在提高识别源代码专家的技术性能的机器学习。目的:本研究的目的是调查机会,以提高现有技术的性能,以推荐源代码文件专家。方法:通过对113个软件项目的开发人员进行调查,收集开发历史数据,构建了一个oracle数据库。然后,我们使用该oracle:(i)分析从开发历史中提取的度量与开发人员的源代码知识之间的相关性;(ii)通过评估机器学习分类器在识别源代码文件专家方面的性能来研究机器学习分类器的使用。结果:第一作者身份(First author)和修改近时性(recent of Modification)分别是与源代码知识正相关和负相关最高的变量。机器学习分类器在公共数据集中优于线性技术(F-Measure = 71%至73%),但这种优势在私有数据集中并不明显,线性技术的F-Measure范围为55%至68%,ML技术的F-Measure范围为58%至67%。结论:总体而言,线性技术和机器学习分类器实现了相似的性能,特别是如果我们分析F-Measure。然而,机器学习分类器通常获得更高的精度,而线性技术获得最高的召回值。因此,最佳技术的选择取决于用户对假阳性和假阴性的容忍度。
Background: In software development, the identification of source code file experts is an important task. Identifying these experts helps to improve software maintenance and evolution activities, such as developing new features, code reviews, and bug fixes. Although some studies have proposed repository-mining techniques to automatically identify source code experts, there are still gaps in this area that can be explored. For example, investigating new variables related to source code knowledge and applying machine learning aiming to improve the performance of techniques to identify source code experts. Aim: The goal of this study is to investigate opportunities to improve the performance of existing techniques to recommend source code files experts. Method: We built an oracle by collecting data from the development history and surveying developers of 113 software projects. Then, we use this oracle to: (i) analyze the correlation between measures extracted from the development history and the developers’ source code knowledge and (ii) investigate the use of machine learning classifiers by evaluating their performance in identifying source code files experts. Results:First Authorship and Recency of Modification are the variables with the highest positive and negative correlations with source code knowledge, respectively. Machine learning classifiers outperformed the linear techniques (F-Measure = 71% to 73%) in the public dataset, but this advantage is not clear in the private dataset, with F-Measure ranging from 55% to 68% for the linear techniques and 58% to 67% for ML techniques. Conclusion: Overall, the linear techniques and the machine learning classifiers achieved similar performance, particularly if we analyze F-Measure. However, machine learning classifiers usually get higher precision while linear techniques obtained the highest recall values. Therefore, the choice of the best technique depends on the user’s tolerance to false positives and false negatives.