Jyoti Sheoran, Kelly Blincoe, Eirini Kalliamvakou, D. Damian, J. Ell
Users on GitHub can watch repositories to receive notifications about project activity. This introduces a new type of passive project membership. In this paper, we investigate the behavior of watchers and their contribution to the projects they watch. We find that a subset of project watchers begin contributing to the project and those contributors account for a significant percentage of contributors on the project. As contributors, watchers are more confident and contribute over a longer period of time in a more varied way than other contributors. This is likely attributable to the knowledge gained through project notifications.
{"title":"Understanding \"watchers\" on GitHub","authors":"Jyoti Sheoran, Kelly Blincoe, Eirini Kalliamvakou, D. Damian, J. Ell","doi":"10.1145/2597073.2597114","DOIUrl":"https://doi.org/10.1145/2597073.2597114","url":null,"abstract":"Users on GitHub can watch repositories to receive notifications about project activity. This introduces a new type of passive project membership. In this paper, we investigate the behavior of watchers and their contribution to the projects they watch. We find that a subset of project watchers begin contributing to the project and those contributors account for a significant percentage of contributors on the project. As contributors, watchers are more confident and contribute over a longer period of time in a more varied way than other contributors. This is likely attributable to the knowledge gained through project notifications.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"483 1","pages":"336-339"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80146082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thanh H. D. Nguyen, M. Nagappan, A. Hassan, Mohamed N. Nasser, P. Flora
Even the addition of a single extra field or control statement in the source code of a large-scale software system can lead to performance regressions. Such regressions can considerably degrade the user experience. Working closely with the members of a performance engineering team, we observe that they face a major challenge in identifying the cause of a performance regression given the large number of performance counters (e.g., memory and CPU usage) that must be analyzed. We propose the mining of a regression-causes repository (where the results of performance tests and causes of past regressions are stored) to assist the performance team in identifying the regression-cause of a newly-identified regression. We evaluate our approach on an open-source system, and a commercial system for which the team is responsible. The results show that our approach can accurately (up to 80% accuracy) identify performance regression-causes using a reasonably small number of historical test runs (sometimes as few as four test runs per regression-cause).
{"title":"An industrial case study of automatically identifying performance regression-causes","authors":"Thanh H. D. Nguyen, M. Nagappan, A. Hassan, Mohamed N. Nasser, P. Flora","doi":"10.1145/2597073.2597092","DOIUrl":"https://doi.org/10.1145/2597073.2597092","url":null,"abstract":"Even the addition of a single extra field or control statement in the source code of a large-scale software system can lead to performance regressions. Such regressions can considerably degrade the user experience. Working closely with the members of a performance engineering team, we observe that they face a major challenge in identifying the cause of a performance regression given the large number of performance counters (e.g., memory and CPU usage) that must be analyzed. We propose the mining of a regression-causes repository (where the results of performance tests and causes of past regressions are stored) to assist the performance team in identifying the regression-cause of a newly-identified regression. We evaluate our approach on an open-source system, and a commercial system for which the team is responsible. The results show that our approach can accurately (up to 80% accuracy) identify performance regression-causes using a reasonably small number of historical test runs (sometimes as few as four test runs per regression-cause).","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"11 1","pages":"232-241"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85069386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, A. Zaidman
In recent years, GitHub has become the largest code host in the world, with more than 5M developers collaborating across 10M repositories. Numerous popular open source projects (such as Ruby on Rails, Homebrew, Bootstrap, Django or jQuery) have chosen GitHub as their host and have migrated their code base to it. GitHub offers a tremendous research potential. For instance, it is a flagship for current open source development, a place for developers to showcase their expertise to peers or potential recruiters, and the platform where social coding features or pull requests emerged. However, GitHub data is, to date, largely underexplored. To facilitate studies of GitHub, we have created GHTorrent, a scalable, queriable, offline mirror of the data offered through the GitHub REST API. In this paper we present a novel feature of GHTorrent designed to offer customisable data dumps on demand. The new GHTorrent data-on-demand service offers users the possibility to request via a web form up-to-date GHTorrent data dumps for any collection of GitHub repositories. We hope that by offering customisable GHTorrent data dumps we will not only lower the "barrier for entry" even further for researchers interested in mining GitHub data (thus encourage researchers to intensify their mining efforts), but also enhance the replicability of GitHub studies (since a snapshot of the data on which the results were obtained can now easily accompany each study).
近年来,GitHub已经成为世界上最大的代码托管平台,有超过500万开发人员在1000万个存储库中进行协作。许多流行的开源项目(如Ruby on Rails、Homebrew、Bootstrap、Django或jQuery)都选择GitHub作为它们的宿主,并将它们的代码库迁移到它上面。GitHub提供了巨大的研究潜力。例如,它是当前开源开发的旗舰,是开发人员向同行或潜在招聘人员展示专业知识的地方,也是社交编码功能或pull请求出现的平台。然而,到目前为止,GitHub数据在很大程度上尚未得到充分开发。为了方便对GitHub的研究,我们创建了GHTorrent,这是一个可扩展的、可查询的、通过GitHub REST API提供的数据的离线镜像。在本文中,我们提出了GHTorrent的一个新特性,旨在提供可定制的数据转储。新的GHTorrent数据按需服务为用户提供了通过web表单请求任何GitHub存储库的最新GHTorrent数据转储的可能性。我们希望通过提供可定制的GHTorrent数据转储,我们不仅可以进一步降低对挖掘GitHub数据感兴趣的研究人员的“进入门槛”(从而鼓励研究人员加强挖掘工作),还可以增强GitHub研究的可复制性(因为获得结果的数据快照现在可以轻松地伴随每项研究)。
{"title":"Lean GHTorrent: GitHub data on demand","authors":"Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, A. Zaidman","doi":"10.1145/2597073.2597126","DOIUrl":"https://doi.org/10.1145/2597073.2597126","url":null,"abstract":"In recent years, GitHub has become the largest code host in the world, with more than 5M developers collaborating across 10M repositories. Numerous popular open source projects (such as Ruby on Rails, Homebrew, Bootstrap, Django or jQuery) have chosen GitHub as their host and have migrated their code base to it. GitHub offers a tremendous research potential. For instance, it is a flagship for current open source development, a place for developers to showcase their expertise to peers or potential recruiters, and the platform where social coding features or pull requests emerged. However, GitHub data is, to date, largely underexplored. To facilitate studies of GitHub, we have created GHTorrent, a scalable, queriable, offline mirror of the data offered through the GitHub REST API. In this paper we present a novel feature of GHTorrent designed to offer customisable data dumps on demand. The new GHTorrent data-on-demand service offers users the possibility to request via a web form up-to-date GHTorrent data dumps for any collection of GitHub repositories. We hope that by offering customisable GHTorrent data dumps we will not only lower the \"barrier for entry\" even further for researchers interested in mining GitHub data (thus encourage researchers to intensify their mining efforts), but also enhance the replicability of GitHub studies (since a snapshot of the data on which the results were obtained can now easily accompany each study).","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"70 1","pages":"384-387"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79723880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern web applications consist of a significant amount of client- side code, written in JavaScript, HTML, and CSS. In this paper, we present a study of common challenges and misconceptions among web developers, by mining related questions asked on Stack Over- flow. We use unsupervised learning to categorize the mined questions and define a ranking algorithm to rank all the Stack Overflow questions based on their importance. We analyze the top 50 questions qualitatively. The results indicate that (1) the overall share of web development related discussions is increasing among developers, (2) browser related discussions are prevalent; however, this share is decreasing with time, (3) form validation and other DOM related discussions have been discussed consistently over time, (4) web related discussions are becoming more prevalent in mobile development, and (5) developers face implementation issues with new HTML5 features such as Canvas. We examine the implications of the results on the development, research, and standardization communities.
{"title":"Mining questions asked by web developers","authors":"Kartik Bajaj, K. Pattabiraman, A. Mesbah","doi":"10.1145/2597073.2597083","DOIUrl":"https://doi.org/10.1145/2597073.2597083","url":null,"abstract":"Modern web applications consist of a significant amount of client- side code, written in JavaScript, HTML, and CSS. In this paper, we present a study of common challenges and misconceptions among web developers, by mining related questions asked on Stack Over- flow. We use unsupervised learning to categorize the mined questions and define a ranking algorithm to rank all the Stack Overflow questions based on their importance. We analyze the top 50 questions qualitatively. The results indicate that (1) the overall share of web development related discussions is increasing among developers, (2) browser related discussions are prevalent; however, this share is decreasing with time, (3) form validation and other DOM related discussions have been discussed consistently over time, (4) web related discussions are becoming more prevalent in mobile development, and (5) developers face implementation issues with new HTML5 features such as Canvas. We examine the implications of the results on the development, research, and standardization communities.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"2001 1","pages":"112-121"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82848658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A growing number of software solutions have been proposed to address application-level energy consumption problems in the last few years. However, little is known about how much software developers are concerned about energy consumption, what aspects of energy consumption they consider important, and what solutions they have in mind for improving energy efficiency. In this paper we present the first empirical study on understanding the views of application programmers on software energy consumption problems. Using StackOverflow as our primary data source, we analyze a carefully curated sample of more than 300 questions and 550 answers from more than 800 users. With this data, we observed a number of interesting findings. Our study shows that practitioners are aware of the energy consumption problems: the questions they ask are not only diverse -- we found 5 main themes of questions -- but also often more interesting and challenging when compared to the control question set. Even though energy consumption-related questions are popular when considering a number of different popularity measures, the same cannot be said about the quality of their answers. In addition, we observed that some of these answers are often flawed or vague. We contrast the advice provided by these answers with the state-of-the-art research on energy consumption. Our summary of software energy consumption problems may help researchers focus on what matters the most to software developers and end users.
{"title":"Mining questions about software energy consumption","authors":"G. Pinto, F. C. Filho, Yu David Liu","doi":"10.1145/2597073.2597110","DOIUrl":"https://doi.org/10.1145/2597073.2597110","url":null,"abstract":"A growing number of software solutions have been proposed to address application-level energy consumption problems in the last few years. However, little is known about how much software developers are concerned about energy consumption, what aspects of energy consumption they consider important, and what solutions they have in mind for improving energy efficiency. In this paper we present the first empirical study on understanding the views of application programmers on software energy consumption problems. Using StackOverflow as our primary data source, we analyze a carefully curated sample of more than 300 questions and 550 answers from more than 800 users. With this data, we observed a number of interesting findings. Our study shows that practitioners are aware of the energy consumption problems: the questions they ask are not only diverse -- we found 5 main themes of questions -- but also often more interesting and challenging when compared to the control question set. Even though energy consumption-related questions are popular when considering a number of different popularity measures, the same cannot be said about the quality of their answers. In addition, we observed that some of these answers are often flawed or vague. We contrast the advice provided by these answers with the state-of-the-art research on energy consumption. Our summary of software energy consumption problems may help researchers focus on what matters the most to software developers and end users.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"208 1","pages":"22-31"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80529852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James R. Williams, D. D. Ruscio, N. Matragkas, Juri Di Rocco, D. Kolovos
The process of selecting open-source software (OSS) for adoption is not straightforward as it involves exploring various sources of information to determine the quality, maturity, activity, and user support of each project. In the context of the OSSMETER project, we have developed a forge-agnostic metamodel that captures the meta-information common to all OSS projects. We specialise this metamodel for popular OSS forges in order to capture forge-specific meta-information. In this paper we present a dataset conforming to these metamodels for over 500,000 OSS projects hosted on three popular OSS forges: Eclipse, SourceForge, and GitHub. The dataset enables different kinds of automatic analysis and supports objective comparisons of cross-forge OSS alternatives with respect to a user's needs and quality requirements.
{"title":"Models of OSS project meta-information: a dataset of three forges","authors":"James R. Williams, D. D. Ruscio, N. Matragkas, Juri Di Rocco, D. Kolovos","doi":"10.1145/2597073.2597132","DOIUrl":"https://doi.org/10.1145/2597073.2597132","url":null,"abstract":"The process of selecting open-source software (OSS) for adoption is not straightforward as it involves exploring various sources of information to determine the quality, maturity, activity, and user support of each project. In the context of the OSSMETER project, we have developed a forge-agnostic metamodel that captures the meta-information common to all OSS projects. We specialise this metamodel for popular OSS forges in order to capture forge-specific meta-information. In this paper we present a dataset conforming to these metamodels for over 500,000 OSS projects hosted on three popular OSS forges: Eclipse, SourceForge, and GitHub. The dataset enables different kinds of automatic analysis and supports objective comparisons of cross-forge OSS alternatives with respect to a user's needs and quality requirements.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"1984 1","pages":"408-411"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89878331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of mobile computing, the responsibility of software developers to update and ship energy efficient applications has never been more pronounced. Green mining attempts to address this responsibility by examining the impact of software change on energy consumption. One problem with green mining is that power performance data is not readily available, unlike many other forms of MSR research. Green miners have to create tests and run them across numerous versions of a software project because power performance data was either missing or never existed for that particular project. In this paper we describe multiple open green mining datasets used in prior green mining work. The dataset includes numerous power traces and parallel system call and CPU/IO/Memory traces of multiple versions of multiple products. These datasets enable those more interested in data-mining and modeling to work on green mining problems as well.
{"title":"A green miner's dataset: mining the impact of software change on energy consumption","authors":"Chenlei Zhang, Abram Hindle","doi":"10.1145/2597073.2597130","DOIUrl":"https://doi.org/10.1145/2597073.2597130","url":null,"abstract":"With the advent of mobile computing, the responsibility of software developers to update and ship energy efficient applications has never been more pronounced. Green mining attempts to address this responsibility by examining the impact of software change on energy consumption. One problem with green mining is that power performance data is not readily available, unlike many other forms of MSR research. Green miners have to create tests and run them across numerous versions of a software project because power performance data was either missing or never existed for that particular project. In this paper we describe multiple open green mining datasets used in prior green mining work. The dataset includes numerous power traces and parallel system call and CPU/IO/Memory traces of multiple versions of multiple products. These datasets enable those more interested in data-mining and modeling to work on green mining problems as well.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"88 1","pages":"400-403"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79460811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motahareh Bahrami Zanjani, George Swartzendruber, Huzefa H. Kagdi
The paper presents an approach to perform impact analysis (IA) of an incoming change request on source code. The approach is based on a combination of interaction (e.g., Mylyn) and commit (e.g., CVS) histories. The source code entities (i.e., files and methods) that were interacted or changed in the resolution of past change requests (e.g., bug fixes) were used. Information retrieval, machine learning, and lightweight source code analysis techniques were employed to form a corpus from these source code entities. Additionally, the corpus was augmented with the textual descriptions of the previously resolved change requests and their associated commit messages. Given a textual description of a change request, this corpus is queried to obtain a ranked list of relevant source code entities that are most likely change prone. Such an approach that combines information from interactions and commits for IA at the change request level was not previously investigated. Furthermore, the approach requires only the entities that were interacted and/or committed in the past, which differs from the previous solutions that require indexing of a complete snapshot (e.g., a release). An empirical study on 3272 interactions and 5093 commits from Mylyn, an open source task management tool, was conducted. The results show that the combined approach outperforms an individual approach based on commits. Moreover, it also outperformed an approach based on indexing a single, complete snapshot of a software system.
{"title":"Impact analysis of change requests on source code based on interaction and commit histories","authors":"Motahareh Bahrami Zanjani, George Swartzendruber, Huzefa H. Kagdi","doi":"10.1145/2597073.2597096","DOIUrl":"https://doi.org/10.1145/2597073.2597096","url":null,"abstract":"The paper presents an approach to perform impact analysis (IA) of an incoming change request on source code. The approach is based on a combination of interaction (e.g., Mylyn) and commit (e.g., CVS) histories. The source code entities (i.e., files and methods) that were interacted or changed in the resolution of past change requests (e.g., bug fixes) were used. Information retrieval, machine learning, and lightweight source code analysis techniques were employed to form a corpus from these source code entities. Additionally, the corpus was augmented with the textual descriptions of the previously resolved change requests and their associated commit messages. Given a textual description of a change request, this corpus is queried to obtain a ranked list of relevant source code entities that are most likely change prone. Such an approach that combines information from interactions and commits for IA at the change request level was not previously investigated. Furthermore, the approach requires only the entities that were interacted and/or committed in the past, which differs from the previous solutions that require indexing of a complete snapshot (e.g., a release). \u0000 An empirical study on 3272 interactions and 5093 commits from Mylyn, an open source task management tool, was conducted. The results show that the combined approach outperforms an individual approach based on commits. Moreover, it also outperformed an approach based on indexing a single, complete snapshot of a software system.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"152 1","pages":"162-171"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79613730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present data downloaded from Maven, one of the most popular component repositories. The data includes the binaries of 186,392 components, along with source code for 161,025. We identify and organize these components into groups where each group contains all the versions of a library. In order to asses the quality of these components, we make available report generated by the FindBugs tool on 64,574 components. The information is also made available in the form of a database which stores total number, type, and priority of bug patterns found in each component, along with its defect density. We also describe how this dataset can be useful in software engineering research.
{"title":"A dataset for maven artifacts and bug patterns found in them","authors":"V. Saini, Hitesh Sajnani, Joel Ossher, C. Lopes","doi":"10.1145/2597073.2597134","DOIUrl":"https://doi.org/10.1145/2597073.2597134","url":null,"abstract":"In this paper, we present data downloaded from Maven, one of the most popular component repositories. The data includes the binaries of 186,392 components, along with source code for 161,025. We identify and organize these components into groups where each group contains all the versions of a library. In order to asses the quality of these components, we make available report generated by the FindBugs tool on 64,574 components. The information is also made available in the form of a database which stores total number, type, and priority of bug patterns found in each component, along with its defect density. We also describe how this dataset can be useful in software engineering research.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"201 1","pages":"416-419"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76987370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shane McIntosh, Yasutaka Kamei, Bram Adams, A. Hassan
Software code review, i.e., the practice of having third-party team members critique changes to a software system, is a well-established best practice in both open source and proprietary software domains. Prior work has shown that the formal code inspections of the past tend to improve the quality of software delivered by students and small teams. However, the formal code inspection process mandates strict review criteria (e.g., in-person meetings and reviewer checklists) to ensure a base level of review quality, while the modern, lightweight code reviewing process does not. Although recent work explores the modern code review process qualitatively, little research quantitatively explores the relationship between properties of the modern code review process and software quality. Hence, in this paper, we study the relationship between software quality and: (1) code review coverage, i.e., the proportion of changes that have been code reviewed, and (2) code review participation, i.e., the degree of reviewer involvement in the code review process. Through a case study of the Qt, VTK, and ITK projects, we find that both code review coverage and participation share a significant link with software quality. Low code review coverage and participation are estimated to produce components with up to two and five additional post-release defects respectively. Our results empirically confirm the intuition that poorly reviewed code has a negative impact on software quality in large systems using modern reviewing tools.
{"title":"The impact of code review coverage and code review participation on software quality: a case study of the qt, VTK, and ITK projects","authors":"Shane McIntosh, Yasutaka Kamei, Bram Adams, A. Hassan","doi":"10.1145/2597073.2597076","DOIUrl":"https://doi.org/10.1145/2597073.2597076","url":null,"abstract":"Software code review, i.e., the practice of having third-party team members critique changes to a software system, is a well-established best practice in both open source and proprietary software domains. Prior work has shown that the formal code inspections of the past tend to improve the quality of software delivered by students and small teams. However, the formal code inspection process mandates strict review criteria (e.g., in-person meetings and reviewer checklists) to ensure a base level of review quality, while the modern, lightweight code reviewing process does not. Although recent work explores the modern code review process qualitatively, little research quantitatively explores the relationship between properties of the modern code review process and software quality. Hence, in this paper, we study the relationship between software quality and: (1) code review coverage, i.e., the proportion of changes that have been code reviewed, and (2) code review participation, i.e., the degree of reviewer involvement in the code review process. Through a case study of the Qt, VTK, and ITK projects, we find that both code review coverage and participation share a significant link with software quality. Low code review coverage and participation are estimated to produce components with up to two and five additional post-release defects respectively. Our results empirically confirm the intuition that poorly reviewed code has a negative impact on software quality in large systems using modern reviewing tools.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"26 1","pages":"192-201"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74060626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}