Context. Empirical research consistently demonstrates that that scholarly peer review is ineffective, unreliable, and prejudiced. In principle, the solution is to move from contemporary, unstructured, essay-like reviewing to more structured, checklist-like reviewing. The Task Force created models—called “empirical standards”—of the software engineering community’s expectations for different popular methodologies. Objective. This paper presents a tool for facilitating more structured reviewing by generating review checklists from the empirical standards. Design. A tool that generates pre-submission and review forms using the empirical standards for software engineering research was designed and implemented. The pre-submission and review forms can be used by authors and reviewers, respectively, to determine whether a manuscript meets the software engineering community’s expectations for the particular kind of research conducted. Evaluation. The proposed tool can be empirically evaluated using lab or field randomized experiments as well as qualitative research. Huge, impractical studies involving splitting a conference program committee are not necessary to establish the effectiveness of the standards, checklists and structured review. Conclusions. The checklist generator enables more structured peer reviews, which in turn should improve review quality, reliability, thoroughness, and readability. Empirical research is needed to assess the effectiveness of the tool and the standards.
{"title":"Towards a More Structured Peer Review Process with Empirical Standards","authors":"Arham Arshad, Taher Ahmed Ghaleb, P. Ralph","doi":"10.1145/3463274.3463359","DOIUrl":"https://doi.org/10.1145/3463274.3463359","url":null,"abstract":"Context. Empirical research consistently demonstrates that that scholarly peer review is ineffective, unreliable, and prejudiced. In principle, the solution is to move from contemporary, unstructured, essay-like reviewing to more structured, checklist-like reviewing. The Task Force created models—called “empirical standards”—of the software engineering community’s expectations for different popular methodologies. Objective. This paper presents a tool for facilitating more structured reviewing by generating review checklists from the empirical standards. Design. A tool that generates pre-submission and review forms using the empirical standards for software engineering research was designed and implemented. The pre-submission and review forms can be used by authors and reviewers, respectively, to determine whether a manuscript meets the software engineering community’s expectations for the particular kind of research conducted. Evaluation. The proposed tool can be empirically evaluated using lab or field randomized experiments as well as qualitative research. Huge, impractical studies involving splitting a conference program committee are not necessary to establish the effectiveness of the standards, checklists and structured review. Conclusions. The checklist generator enables more structured peer reviews, which in turn should improve review quality, reliability, thoroughness, and readability. Empirical research is needed to assess the effectiveness of the tool and the standards.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130488054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","authors":"","doi":"10.1145/3463274","DOIUrl":"https://doi.org/10.1145/3463274","url":null,"abstract":"","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128983452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Context: It is impossible to imagine our everyday and professional lives without software. Consequently, software products, especially socio-technical systems, have more or less obvious impacts on almost all areas of our society. For this purpose, a group of scientists worldwide has developed the Sustainability Awareness Framework (SusAF) which examines the impacts on five interrelated dimensions: social, individual, environmental, economic, and technical. According to this framework, we should design software to maintain or improve the Sustainability Impacts. Designing for sustainability is a major challenge that can profoundly change the field of activity – particular for Software Engineers. Objectives: The aim of the thesis work is to analyze the current role of Software Engineers and relate it to Sustainability Impacts of Software Products in order to contribute to this paradigm shift. This should provide a basis for follow-up works. The question in which direction exactly the Software Engineer should develop and how exactly this path can be followed is still owed by the scientific community. Perhaps universities will have to adapt the curriculum in the training of Software Engineers, politics could possibly initiate support programs in the field of sustainability for software companies, or maybe software sustainability certifications could emerge. In any case, Software Engineers must adapt to the times and acquire the necessary knowledge, the skills and the competencies. Results: The results of the dissertation are a better understanding of the needed paradigm shift of Software Engineers and complement the SusAF that to better support sustainability design. The extended SusAF is intended for both training and corporate use.
{"title":"The Connection between the Sustainability Impacts of Software Products and the Role of Software Engineers","authors":"Dominic Lammert","doi":"10.1145/3463274.3463346","DOIUrl":"https://doi.org/10.1145/3463274.3463346","url":null,"abstract":"Context: It is impossible to imagine our everyday and professional lives without software. Consequently, software products, especially socio-technical systems, have more or less obvious impacts on almost all areas of our society. For this purpose, a group of scientists worldwide has developed the Sustainability Awareness Framework (SusAF) which examines the impacts on five interrelated dimensions: social, individual, environmental, economic, and technical. According to this framework, we should design software to maintain or improve the Sustainability Impacts. Designing for sustainability is a major challenge that can profoundly change the field of activity – particular for Software Engineers. Objectives: The aim of the thesis work is to analyze the current role of Software Engineers and relate it to Sustainability Impacts of Software Products in order to contribute to this paradigm shift. This should provide a basis for follow-up works. The question in which direction exactly the Software Engineer should develop and how exactly this path can be followed is still owed by the scientific community. Perhaps universities will have to adapt the curriculum in the training of Software Engineers, politics could possibly initiate support programs in the field of sustainability for software companies, or maybe software sustainability certifications could emerge. In any case, Software Engineers must adapt to the times and acquire the necessary knowledge, the skills and the competencies. Results: The results of the dissertation are a better understanding of the needed paradigm shift of Software Engineers and complement the SusAF that to better support sustainability design. The extended SusAF is intended for both training and corporate use.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122664012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to limited time, budget or resources, a team is prone to introduce code that does not follow the best software development practices. This code that introduces instability in the software projects is known as Technical Debt (TD). Often, TD intentionally manifests in source code, which is known as Self-Admitted Technical Debt (SATD). This paper presents DebtHunter, a natural language processing (NLP)- and machine learning (ML)- based approach for identifying and classifying SATD in source code comments. The proposed classification approach combines two classification phases for differentiating between the multiple debt types. Evaluations over 10 open source systems, containing more than 259k comments, showed that the approach was able to improve the performance of others in the literature. The presented approach is supported by a tool that can help developers to effectively manage SATD. The tool complements the analysis over Java source code by allowing developers to also examine the associated issue tracker. DebtHunter can be used in a continuous evolution environment to monitor the development process and make developers aware of how and where SATD is introduced, thus helping them to manage and resolve it.
{"title":"DebtHunter: A Machine Learning-based Approach for Detecting Self-Admitted Technical Debt","authors":"Irene Sala, Antonela Tommasel, F. Fontana","doi":"10.1145/3463274.3464455","DOIUrl":"https://doi.org/10.1145/3463274.3464455","url":null,"abstract":"Due to limited time, budget or resources, a team is prone to introduce code that does not follow the best software development practices. This code that introduces instability in the software projects is known as Technical Debt (TD). Often, TD intentionally manifests in source code, which is known as Self-Admitted Technical Debt (SATD). This paper presents DebtHunter, a natural language processing (NLP)- and machine learning (ML)- based approach for identifying and classifying SATD in source code comments. The proposed classification approach combines two classification phases for differentiating between the multiple debt types. Evaluations over 10 open source systems, containing more than 259k comments, showed that the approach was able to improve the performance of others in the literature. The presented approach is supported by a tool that can help developers to effectively manage SATD. The tool complements the analysis over Java source code by allowing developers to also examine the associated issue tracker. DebtHunter can be used in a continuous evolution environment to monitor the development process and make developers aware of how and where SATD is introduced, thus helping them to manage and resolve it.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124451878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is an ongoing interest in the Software Engineering field for multivocal literature reviews including grey literature. However, at the same time, the role of the grey literature is still controversial, and the benefits of its inclusion in systematic reviews are object of discussion. Some of these arguments concern the quality assessment methods for grey literature entries, which is often considered a challenging and critical task. On the one hand, apart from a few proposals, there is a lack of an acknowledged methodological support for the inclusion of Software Engineering grey literature in systematic surveys. On the other hand, the unstructured shape of the grey literature contents could lead to bias in the evaluation process impacting on the quality of the surveys. This work leverages an approach on fuzzy Likert scales, and it proposes a methodology for managing the explicit uncertainties emerging during the assessment of entries from the grey literature. The methodology also strengthens the adoption of consensus policies that take into account the individual confidence level expressed for each of the collected scores.
{"title":"About the Assessment of Grey Literature in Software Engineering","authors":"G. D. Angelis, F. Lonetti","doi":"10.1145/3463274.3463362","DOIUrl":"https://doi.org/10.1145/3463274.3463362","url":null,"abstract":"There is an ongoing interest in the Software Engineering field for multivocal literature reviews including grey literature. However, at the same time, the role of the grey literature is still controversial, and the benefits of its inclusion in systematic reviews are object of discussion. Some of these arguments concern the quality assessment methods for grey literature entries, which is often considered a challenging and critical task. On the one hand, apart from a few proposals, there is a lack of an acknowledged methodological support for the inclusion of Software Engineering grey literature in systematic surveys. On the other hand, the unstructured shape of the grey literature contents could lead to bias in the evaluation process impacting on the quality of the surveys. This work leverages an approach on fuzzy Likert scales, and it proposes a methodology for managing the explicit uncertainties emerging during the assessment of entries from the grey literature. The methodology also strengthens the adoption of consensus policies that take into account the individual confidence level expressed for each of the collected scores.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114416693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The design of recommendation systems is based on complex information processing and big data interaction. This personalized view has evolved into a hot area in the past decade, where applications might have been proved to help for solving problem in the software development field. Therefore, with the evolvement of Recommendation System in Software Engineering (RSSE), the coordination of software projects with their stakeholders is improving. This experiment examines four open source recommender systems and implemented a customized recommender engine with two industrial-oriented packages: Lenskit and Mahout. Each of the main functions was examined and issues were identified during the experiment.
{"title":"Recommender Systems for Software Project Managers","authors":"Liang Wei, Luiz Fernando Capretz","doi":"10.1145/3463274.3463951","DOIUrl":"https://doi.org/10.1145/3463274.3463951","url":null,"abstract":"The design of recommendation systems is based on complex information processing and big data interaction. This personalized view has evolved into a hot area in the past decade, where applications might have been proved to help for solving problem in the software development field. Therefore, with the evolvement of Recommendation System in Software Engineering (RSSE), the coordination of software projects with their stakeholders is improving. This experiment examines four open source recommender systems and implemented a customized recommender engine with two industrial-oriented packages: Lenskit and Mahout. Each of the main functions was examined and issues were identified during the experiment.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125558233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bridging the gap between academic research and industrial application is an important issue to promote Jackson's Problem Frames approach (PF) to the software engineering community. Various attempts have been made to tackle this problem, such as defining formal semantics of PF for software development, and providing a semi-formal approach to model transformations of problem diagrams, with automated tool support. In this paper, we propose to exclusively focus on exploring and evaluating the effectiveness of Jackson's problem diagrams for modeling the context of cyber-physical systems, by developing a suite of support tools enhanced with adaptive user interfaces, and empirically and comprehensively assess its usability. This paper introduces the state of the art, corresponding research questions, research methodologies and current progress of our research.
{"title":"Evaluating the Effectiveness of Problem Frames for Contextual Modeling of Cyber-Physical Systems: a Tool Suite with Adaptive User Interfaces","authors":"Waqas Junaid","doi":"10.1145/3463274.3463344","DOIUrl":"https://doi.org/10.1145/3463274.3463344","url":null,"abstract":"Bridging the gap between academic research and industrial application is an important issue to promote Jackson's Problem Frames approach (PF) to the software engineering community. Various attempts have been made to tackle this problem, such as defining formal semantics of PF for software development, and providing a semi-formal approach to model transformations of problem diagrams, with automated tool support. In this paper, we propose to exclusively focus on exploring and evaluating the effectiveness of Jackson's problem diagrams for modeling the context of cyber-physical systems, by developing a suite of support tools enhanced with adaptive user interfaces, and empirically and comprehensively assess its usability. This paper introduces the state of the art, corresponding research questions, research methodologies and current progress of our research.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116879830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Context: Software development is moving towards a place where data about development is gathered in a systematic fashion in order to improve the practice, for example, in tuning of static code analysis. However, this kind of data gathering has so far primarily happened within organizations, which is unfortunate as it tends to favor larger organizations with more resources for maintenance of developer tools. Objective: Over the years, we have seen a lot of benefits from open source and recently there has been a lot of development in open data. We see this as an opportunity for cross-organisation community building and wonder to what extent the views on using and sharing open source software developer tools carry across to open data-driven tuning of software development tools. Method: An exploratory study with 11 participants divided into 3 focus groups discussing using and sharing of static code analyzers and data about these analyzers. Results: While using and sharing open-source code (analyzers in this case) is perceived in a positive light as part of the practice of modern software development, sharing data is met with skepticism and uncertainty. Developers are concerned about threats to the company brand, exposure of intellectual property, legal liabilities, and to what extent data is context-specific to a certain organisation. Conclusions: Sharing data in software development is different from sharing data about software development. We need to better understand how we can provide solutions for sharing of software development data in a fashion that reduces risk and enables openness.
{"title":"Open Data-driven Usability Improvements of Static Code Analysis and its Challenges","authors":"Emma Söderberg, Luke Church, Martin Höst","doi":"10.1145/3463274.3463808","DOIUrl":"https://doi.org/10.1145/3463274.3463808","url":null,"abstract":"Context: Software development is moving towards a place where data about development is gathered in a systematic fashion in order to improve the practice, for example, in tuning of static code analysis. However, this kind of data gathering has so far primarily happened within organizations, which is unfortunate as it tends to favor larger organizations with more resources for maintenance of developer tools. Objective: Over the years, we have seen a lot of benefits from open source and recently there has been a lot of development in open data. We see this as an opportunity for cross-organisation community building and wonder to what extent the views on using and sharing open source software developer tools carry across to open data-driven tuning of software development tools. Method: An exploratory study with 11 participants divided into 3 focus groups discussing using and sharing of static code analyzers and data about these analyzers. Results: While using and sharing open-source code (analyzers in this case) is perceived in a positive light as part of the practice of modern software development, sharing data is met with skepticism and uncertainty. Developers are concerned about threats to the company brand, exposure of intellectual property, legal liabilities, and to what extent data is context-specific to a certain organisation. Conclusions: Sharing data in software development is different from sharing data about software development. We need to better understand how we can provide solutions for sharing of software development data in a fashion that reduces risk and enables openness.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115468378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Blogs are a source of grey literature which are widely adopted by software practitioners for disseminating opinion and experience. Analysing such articles can provide useful insights into the state–of–practice for software engineering research. However, there are challenges in identifying higher quality content from the large quantity of articles available. Credibility assessment can help in identifying quality content, though there is a lack of existing corpora. Credibility is typically measured through a series of conceptual criteria, with ’argumentation’ and ’evidence’ being two important criteria. Objective: We create a corpus labelled for argumentation and evidence that can aid the credibility community. The corpus consists of articles from the blog of a single software practitioner and is publicly available. Method: Three annotators label the corpus with a series of conceptual credibility criteria, reaching an agreement of 0.82 (Fleiss’ Kappa). We present preliminary analysis of the corpus by using it to investigate the identification of claim sentences (one of our ten labels). Results: We train four systems (Bert, KNN, Decision Tree and SVM) using three feature sets (Bag of Words, Topic Modelling and InferSent), achieving an F1 score of 0.64 using InferSent and a Linear SVM. Conclusions: Our preliminary results are promising, indicating that the corpus can help future studies in detecting the credibility of grey literature. Future research will investigate the degree to which the sentence level annotations can infer the credibility of the overall document.
背景:博客是灰色文献的来源,被软件从业者广泛采用,用于传播意见和经验。分析这些文章可以为软件工程研究的实践状态提供有用的见解。然而,在从大量可用文章中识别高质量内容方面存在挑战。尽管缺乏现有的语料库,但可信度评估可以帮助识别高质量的内容。可信度通常是通过一系列概念标准来衡量的,其中“论证”和“证据”是两个重要的标准。目的:我们创建一个标记为论证和证据的语料库,可以帮助可信度社区。语料库由来自单个软件从业者博客的文章组成,并且是公开可用的。方法:三位注释者用一系列概念可信度标准对语料库进行标注,一致性为0.82 (Fleiss’Kappa)。我们提出了语料库的初步分析,使用它来调查索赔句(我们的十个标签之一)的识别。结果:我们使用三个特征集(Bag of Words, Topic Modelling和InferSent)训练了四个系统(Bert, KNN, Decision Tree和SVM),使用InferSent和线性支持向量机获得了0.64的F1分数。结论:我们的初步结果是有希望的,表明语料库可以帮助未来的研究检测灰色文献的可信度。未来的研究将探讨句子级注释在多大程度上可以推断整个文档的可信度。
{"title":"Towards a corpus for credibility assessment in software practitioner blog articles","authors":"Ashley Williams, M. Shardlow, A. Rainer","doi":"10.1145/3463274.3463330","DOIUrl":"https://doi.org/10.1145/3463274.3463330","url":null,"abstract":"Background: Blogs are a source of grey literature which are widely adopted by software practitioners for disseminating opinion and experience. Analysing such articles can provide useful insights into the state–of–practice for software engineering research. However, there are challenges in identifying higher quality content from the large quantity of articles available. Credibility assessment can help in identifying quality content, though there is a lack of existing corpora. Credibility is typically measured through a series of conceptual criteria, with ’argumentation’ and ’evidence’ being two important criteria. Objective: We create a corpus labelled for argumentation and evidence that can aid the credibility community. The corpus consists of articles from the blog of a single software practitioner and is publicly available. Method: Three annotators label the corpus with a series of conceptual credibility criteria, reaching an agreement of 0.82 (Fleiss’ Kappa). We present preliminary analysis of the corpus by using it to investigate the identification of claim sentences (one of our ten labels). Results: We train four systems (Bert, KNN, Decision Tree and SVM) using three feature sets (Bag of Words, Topic Modelling and InferSent), achieving an F1 score of 0.64 using InferSent and a Linear SVM. Conclusions: Our preliminary results are promising, indicating that the corpus can help future studies in detecting the credibility of grey literature. Future research will investigate the degree to which the sentence level annotations can infer the credibility of the overall document.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115506553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. A. Tecimer, Eray Tüzün, Hamdi Dibeklioğlu, H. Erdogmus
Reviewer selection in modern code review is crucial for effective code reviews. Several techniques exist for recommending reviewers appropriate for a given pull request (PR). Most code reviewer recommendation techniques in the literature build and evaluate their models based on datasets collected from real projects using open-source or industrial practices. The techniques invariably presume that these datasets reliably represent the “ground truth.” In the context of a classification problem, ground truth refers to the objectively correct labels of a class used to build models from a dataset or evaluate a model’s performance. In a project dataset used to build a code reviewer recommendation system, the recommended code reviewer picked for a PR is usually assumed to be the best code reviewer for that PR. However, in practice, the recommended code reviewer may not be the best possible code reviewer, or even a qualified one. Recent code reviewer recommendation studies suggest that the datasets used tend to suffer from systematic labeling bias, making the ground truth unreliable. Therefore, models and recommendation systems built on such datasets may perform poorly in real practice. In this study, we introduce a novel approach to automatically detect and eliminate systematic labeling bias in code reviewer recommendation systems. The bias that we remove results from selecting reviewers that do not ensure a permanently successful fix for a bug-related PR. To demonstrate the effectiveness of our approach, we evaluated it on two open-source project datasets —HIVE and QT Creator— and with five code reviewer recommendation techniques —Profile-Based, RSTrace, Naive Bayes, k-NN, and Decision Tree. Our debiasing approach appears promising since it improved the Mean Reciprocal Rank (MRR) of the evaluated techniques up to 26% in the datasets used.
{"title":"Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems","authors":"K. A. Tecimer, Eray Tüzün, Hamdi Dibeklioğlu, H. Erdogmus","doi":"10.1145/3463274.3463336","DOIUrl":"https://doi.org/10.1145/3463274.3463336","url":null,"abstract":"Reviewer selection in modern code review is crucial for effective code reviews. Several techniques exist for recommending reviewers appropriate for a given pull request (PR). Most code reviewer recommendation techniques in the literature build and evaluate their models based on datasets collected from real projects using open-source or industrial practices. The techniques invariably presume that these datasets reliably represent the “ground truth.” In the context of a classification problem, ground truth refers to the objectively correct labels of a class used to build models from a dataset or evaluate a model’s performance. In a project dataset used to build a code reviewer recommendation system, the recommended code reviewer picked for a PR is usually assumed to be the best code reviewer for that PR. However, in practice, the recommended code reviewer may not be the best possible code reviewer, or even a qualified one. Recent code reviewer recommendation studies suggest that the datasets used tend to suffer from systematic labeling bias, making the ground truth unreliable. Therefore, models and recommendation systems built on such datasets may perform poorly in real practice. In this study, we introduce a novel approach to automatically detect and eliminate systematic labeling bias in code reviewer recommendation systems. The bias that we remove results from selecting reviewers that do not ensure a permanently successful fix for a bug-related PR. To demonstrate the effectiveness of our approach, we evaluated it on two open-source project datasets —HIVE and QT Creator— and with five code reviewer recommendation techniques —Profile-Based, RSTrace, Naive Bayes, k-NN, and Decision Tree. Our debiasing approach appears promising since it improved the Mean Reciprocal Rank (MRR) of the evaluated techniques up to 26% in the datasets used.","PeriodicalId":328024,"journal":{"name":"Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117175825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}