Source code in software systems has been shown to have a good degree of repetitiveness at the lexical, syntactical, and API usage levels. This paper presents a large-scale study on the repetitiveness, containment, and composability of source code at the semantic level. We collected a large dataset consisting of 9,224 Java projects with 2.79M class files, 17.54M methods with 187M SLOCs. For each method in a project, we build the program dependency graph (PDG) to represent a routine, and compare PDGs with one another as well as the subgraphs within them. We found that within a project, 12.1% of the routines are repeated, and most of them repeat from 2–7 times. As entirety, the routines are quite project-specific with only 3.3% of them exactly repeating in 1–4 other projects with at most 8 times. We also found that 26.1% and 7.27% of the routines are contained in other routine(s), i.e., implemented as part of other routine(s) elsewhere within a project and in other projects, respectively. Except for trivial routines, their repetitiveness and containment is independent of their complexity. Defining a subroutine via a per-variable slicing subgraph in a PDG, we found that 14.3% of all routines have all of their subroutines repeated. A high percentage of subroutines in a routine can be found/reused elsewhere. We collected 8,764,971 unique subroutines (with 323,564 unique JDK subroutines) as basic units for code searching/synthesis. We also provide practical implications of our findings to automated tools.
{"title":"A Large-Scale Study on Repetitiveness, Containment, and Composability of Routines in Open-Source Projects","authors":"A. Nguyen, H. Nguyen, T. Nguyen","doi":"10.1145/2901739.2901759","DOIUrl":"https://doi.org/10.1145/2901739.2901759","url":null,"abstract":"Source code in software systems has been shown to have a good degree of repetitiveness at the lexical, syntactical, and API usage levels. This paper presents a large-scale study on the repetitiveness, containment, and composability of source code at the semantic level. We collected a large dataset consisting of 9,224 Java projects with 2.79M class files, 17.54M methods with 187M SLOCs. For each method in a project, we build the program dependency graph (PDG) to represent a routine, and compare PDGs with one another as well as the subgraphs within them. We found that within a project, 12.1% of the routines are repeated, and most of them repeat from 2–7 times. As entirety, the routines are quite project-specific with only 3.3% of them exactly repeating in 1–4 other projects with at most 8 times. We also found that 26.1% and 7.27% of the routines are contained in other routine(s), i.e., implemented as part of other routine(s) elsewhere within a project and in other projects, respectively. Except for trivial routines, their repetitiveness and containment is independent of their complexity. Defining a subroutine via a per-variable slicing subgraph in a PDG, we found that 14.3% of all routines have all of their subroutines repeated. A high percentage of subroutines in a routine can be found/reused elsewhere. We collected 8,764,971 unique subroutines (with 323,564 unique JDK subroutines) as basic units for code searching/synthesis. We also provide practical implications of our findings to automated tools.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"21 1","pages":"362-373"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84252999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sentiment analysis has been adopted in software engineeringfor problems such as software usability and sentimentof developers in open-source projects. This paper proposesa method to evaluate the sentiment contained in tickets forIT (Information Technology) support.IT tickets are broadin coverage (e.g. infrastructure, software), and involve errors,incidents, requests, etc. The main challenge is to automaticallydistinguish between factual information, whichis intrinsically negative (e.g. error description), from thesentiment embedded in the description. Our approach isto automatically create a Domain Dictionary that containsterms with sentiment in the IT context, used to filter termsin ticket for sentiment analysis. We experiment and evaluatethree approaches for calculating the polarity of terms intickets. Our study was developed using 34,895 tickets fromfive organizations, from which we randomly selected 2,333tickets to compose a Gold Standard. Our best results displayan average precision and recall of 82.83% and 88.42%, whichoutperforms the compared sentiment analysis solutions.
{"title":"Sentiment Analysis in Tickets for IT Support","authors":"Cássio Castaldi Araújo Blaz, Karin Becker","doi":"10.1145/2901739.2901781","DOIUrl":"https://doi.org/10.1145/2901739.2901781","url":null,"abstract":"Sentiment analysis has been adopted in software engineeringfor problems such as software usability and sentimentof developers in open-source projects. This paper proposesa method to evaluate the sentiment contained in tickets forIT (Information Technology) support.IT tickets are broadin coverage (e.g. infrastructure, software), and involve errors,incidents, requests, etc. The main challenge is to automaticallydistinguish between factual information, whichis intrinsically negative (e.g. error description), from thesentiment embedded in the description. Our approach isto automatically create a Domain Dictionary that containsterms with sentiment in the IT context, used to filter termsin ticket for sentiment analysis. We experiment and evaluatethree approaches for calculating the polarity of terms intickets. Our study was developed using 34,895 tickets fromfive organizations, from which we randomly selected 2,333tickets to compose a Gold Standard. Our best results displayan average precision and recall of 82.83% and 88.42%, whichoutperforms the compared sentiment analysis solutions.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"69 1","pages":"235-246"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90470210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Tajmilur Rahman, Louis-Philippe Querel, Peter C. Rigby, Bram Adams
Continuous delivery and rapid releases have led to innovative techniques for integrating new features and bug fixes into a new release faster. To reduce the probability of integration conflicts, major software companies, including Google, Facebook and Netflix, use feature toggles to incrementally integrate and test new features instead of integrating the feature only when it’s ready. Even after release, feature toggles allow operations managers to quickly disable a new feature that is behaving erratically or to enable certain features only for certain groups of customers. Since literature on feature toggles is surprisingly slim, this paper tries to understand the prevalence and impact of feature toggles. First, we conducted a quantitative analysis of feature toggle usage across 39 releases of Google Chrome (spanning five years of release history). Then, we studied the technical debt involved with feature toggles by mining a spreadsheet used by Google developers for feature toggle maintenance. Finally, we performed thematic analysis of videos and blog posts of release engineers at major software companies in order to further understand the strengths and drawbacks of feature toggles in practice. We also validated our findings with four Google developers. We find that toggles can reconcile rapid releases with long-term feature development and allow flexible control over which features to deploy. However they also introduce technical debt and additional maintenance for developers.
{"title":"Feature Toggles: Practitioner Practices and a Case Study","authors":"Md Tajmilur Rahman, Louis-Philippe Querel, Peter C. Rigby, Bram Adams","doi":"10.1145/2901739.2901745","DOIUrl":"https://doi.org/10.1145/2901739.2901745","url":null,"abstract":"Continuous delivery and rapid releases have led to innovative techniques for integrating new features and bug fixes into a new release faster. To reduce the probability of integration conflicts, major software companies, including Google, Facebook and Netflix, use feature toggles to incrementally integrate and test new features instead of integrating the feature only when it’s ready. Even after release, feature toggles allow operations managers to quickly disable a new feature that is behaving erratically or to enable certain features only for certain groups of customers. Since literature on feature toggles is surprisingly slim, this paper tries to understand the prevalence and impact of feature toggles. First, we conducted a quantitative analysis of feature toggle usage across 39 releases of Google Chrome (spanning five years of release history). Then, we studied the technical debt involved with feature toggles by mining a spreadsheet used by Google developers for feature toggle maintenance. Finally, we performed thematic analysis of videos and blog posts of release engineers at major software companies in order to further understand the strengths and drawbacks of feature toggles in practice. We also validated our findings with four Google developers. We find that toggles can reconcile rapid releases with long-term feature development and allow flexible control over which features to deploy. However they also introduce technical debt and additional maintenance for developers.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"74 1","pages":"201-211"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90271724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Methods for predicting issue lifetime can help software project managers to prioritize issues and allocate resources accordingly. Previous studies on issue lifetime prediction have focused on models built from static features, meaning features calculated at one snapshot of the issue's lifetime based on data associated to the issue itself. However, during its lifetime, an issue typically receives comments from various stakeholders, which may carry valuable insights into its perceived priority and difficulty and may thus be exploited to update lifetime predictions. Moreover, the lifetime of an issue depends not only on characteristics of the issue itself, but also on the state of the project as a whole. Hence, issue lifetime prediction may benefit from taking into account features capturing the issue's context (contextual features). In this work, we analyze issues from more than 4000 GitHub projects and build models to predict, at different points in an issue's lifetime, whether or not the issue will close within a given calendric period, by combining static, dynamic and contextual features. The results show that dynamic and contextual features complement the predictive power of static ones, particularly for long-term predictions.
{"title":"Using Dynamic and Contextual Features to Predict Issue Lifetime in GitHub Projects","authors":"R. Kikas, M. Dumas, Dietmar Pfahl","doi":"10.1145/2901739.2901751","DOIUrl":"https://doi.org/10.1145/2901739.2901751","url":null,"abstract":"Methods for predicting issue lifetime can help software project managers to prioritize issues and allocate resources accordingly. Previous studies on issue lifetime prediction have focused on models built from static features, meaning features calculated at one snapshot of the issue's lifetime based on data associated to the issue itself. However, during its lifetime, an issue typically receives comments from various stakeholders, which may carry valuable insights into its perceived priority and difficulty and may thus be exploited to update lifetime predictions. Moreover, the lifetime of an issue depends not only on characteristics of the issue itself, but also on the state of the project as a whole. Hence, issue lifetime prediction may benefit from taking into account features capturing the issue's context (contextual features). In this work, we analyze issues from more than 4000 GitHub projects and build models to predict, at different points in an issue's lifetime, whether or not the issue will close within a given calendric period, by combining static, dynamic and contextual features. The results show that dynamic and contextual features complement the predictive power of static ones, particularly for long-term predictions.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"118 1","pages":"291-302"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79522980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider
Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.
{"title":"Mining Duplicate Questions of Stack Overflow","authors":"Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider","doi":"10.1145/2901739.2901770","DOIUrl":"https://doi.org/10.1145/2901739.2901770","url":null,"abstract":"Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"33 1","pages":"402-412"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79709463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, Yves Le Traon
We present a growing collection of Android Applications col-lected from several sources, including the official GooglePlay app market. Our dataset, AndroZoo, currently contains more than three million apps, each of which has beenanalysed by tens of different AntiVirus products to knowwhich applications are detected as Malware. We provide thisdataset to contribute to ongoing research efforts, as well asto enable new potential research topics on Android Apps.By releasing our dataset to the research community, we alsoaim at encouraging our fellow researchers to engage in reproducible experiments.
{"title":"AndroZoo: Collecting Millions of Android Apps for the Research Community","authors":"Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, Yves Le Traon","doi":"10.1145/2901739.2903508","DOIUrl":"https://doi.org/10.1145/2901739.2903508","url":null,"abstract":"We present a growing collection of Android Applications col-lected from several sources, including the official GooglePlay app market. Our dataset, AndroZoo, currently contains more than three million apps, each of which has beenanalysed by tens of different AntiVirus products to knowwhich applications are detected as Malware. We provide thisdataset to contribute to ongoing research efforts, as well asto enable new potential research topics on Android Apps.By releasing our dataset to the research community, we alsoaim at encouraging our fellow researchers to engage in reproducible experiments.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"10 1","pages":"468-471"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81158236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Demóstenes Sena, Roberta Coelho, U. Kulesza, R. Bonifácio
This paper presents an empirical study whose goal was to investigate the exception handling strategies adopted by Java libraries and their potential impact on the client applications. In this study, exception flow analysis was used in combination with manual inspections in order: (i) to characterize the exception handling strategies of existing Java libraries from the perspective of their users; and (ii) to identify exception handling anti-patterns. We extended an existing static analysis tool to reason about exception flows and handler actions of 656 Java libraries selected from 145 categories in the Maven Central Repository. The study findings suggest a current trend of a high number of undocumented API runtime exceptions (i.e., @throws in Javadoc) and Unintended Handler problem. Moreover, we could also identify a considerable number of occurrences of exception handling anti-patterns (e.g. Catch and Ignore). Finally, we have also analyzed 647 bug issues of the 7 most popular libraries and identified that 20.71% of the reports are defects related to the problems of the exception strategies and anti-patterns identified in our study. The results of this study point to the need of tools to better understand and document the exception handling behavior of libraries.
{"title":"Understanding the Exception Handling Strategies of Java Libraries: An Empirical Study","authors":"Demóstenes Sena, Roberta Coelho, U. Kulesza, R. Bonifácio","doi":"10.1145/2901739.2901757","DOIUrl":"https://doi.org/10.1145/2901739.2901757","url":null,"abstract":"This paper presents an empirical study whose goal was to investigate the exception handling strategies adopted by Java libraries and their potential impact on the client applications. In this study, exception flow analysis was used in combination with manual inspections in order: (i) to characterize the exception handling strategies of existing Java libraries from the perspective of their users; and (ii) to identify exception handling anti-patterns. We extended an existing static analysis tool to reason about exception flows and handler actions of 656 Java libraries selected from 145 categories in the Maven Central Repository. The study findings suggest a current trend of a high number of undocumented API runtime exceptions (i.e., @throws in Javadoc) and Unintended Handler problem. Moreover, we could also identify a considerable number of occurrences of exception handling anti-patterns (e.g. Catch and Ignore). Finally, we have also analyzed 647 bug issues of the 7 most popular libraries and identified that 20.71% of the reports are defects related to the problems of the exception strategies and anti-patterns identified in our study. The results of this study point to the need of tools to better understand and document the exception handling behavior of libraries.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"16 1","pages":"212-222"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75372666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present a collection of Modern Code Review data for five open source projects. The data showcases mined data from both an integrated peer review system and source code repositories. We present an easy–to–use andricher data structure to retrieve the 1.) People 2.) Process and 3.) Product aspects of the peer review. This paperpresents the extraction methodology, the dataset structure, and a collection of database dumps.
{"title":"Mining the Modern Code Review Repositories: A Dataset of People, Process and Product","authors":"Xin Yang, R. Kula, Norihiro Yoshida, Hajimu Iida","doi":"10.1145/2901739.2903504","DOIUrl":"https://doi.org/10.1145/2901739.2903504","url":null,"abstract":"In this paper, we present a collection of Modern Code Review data for five open source projects. The data showcases mined data from both an integrated peer review system and source code repositories. We present an easy–to–use andricher data structure to retrieve the 1.) People 2.) Process and 3.) Product aspects of the peer review. This paperpresents the extraction methodology, the dataset structure, and a collection of database dumps.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"60 1","pages":"460-463"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77668809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Organizations like Mozilla, Microsoft, and Apple are floodedwith thousands of automated crash reports per day. Although crash reports contain valuable information for debugging, there are often too many for developers to examineindividually. Therefore, in industry, crash reports are oftenautomatically grouped together in buckets. Ubuntu’s repository contains crashes from hundreds of software systemsavailable with Ubuntu. A variety of crash report bucketing methods are evaluated using data collected by Ubuntu’sApport automated crash reporting system. The trade-off between precision and recall of numerous scalable crash deduplication techniques is explored. A set of criteria that acrash deduplication method must meet is presented and several methods that meet these criteria are evaluated on anew dataset. The evaluations presented in this paper showthat using off-the-shelf information retrieval techniques, thatwere not designed to be used with crash reports, outperformother techniques which are specifically designed for the taskof crash bucketing at realistic industrial scales. This researchindicates that automated crash bucketing still has a lot ofroom for improvement, especially in terms of identifier tokenization.
{"title":"The Unreasonable Effectiveness of Traditional Information Retrieval in Crash Report Deduplication","authors":"Hazel Victoria Campbell, E. Santos, Abram Hindle","doi":"10.1145/2901739.2901766","DOIUrl":"https://doi.org/10.1145/2901739.2901766","url":null,"abstract":"Organizations like Mozilla, Microsoft, and Apple are floodedwith thousands of automated crash reports per day. Although crash reports contain valuable information for debugging, there are often too many for developers to examineindividually. Therefore, in industry, crash reports are oftenautomatically grouped together in buckets. Ubuntu’s repository contains crashes from hundreds of software systemsavailable with Ubuntu. A variety of crash report bucketing methods are evaluated using data collected by Ubuntu’sApport automated crash reporting system. The trade-off between precision and recall of numerous scalable crash deduplication techniques is explored. A set of criteria that acrash deduplication method must meet is presented and several methods that meet these criteria are evaluated on anew dataset. The evaluations presented in this paper showthat using off-the-shelf information retrieval techniques, thatwere not designed to be used with crash reports, outperformother techniques which are specifically designed for the taskof crash bucketing at realistic industrial scales. This researchindicates that automated crash bucketing still has a lot ofroom for improvement, especially in terms of identifier tokenization.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"14 1","pages":"269-280"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79954975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper presents an analysis of developer commit logs for GitHub projects. In particular, developer sentiment in commits is analyzed across 28,466 projects within a seven year time frame. We use the Boa infrastructure’s online query system to generate commit logs as well as files that were changed during the commit. We analyze the commits in three categories: large, medium, and small based on the number of commits using a sentiment analysis tool. In addition, we also group the data based on the day of week the commit was made and map the sentiment to the file change history to determine if there was any correlation. Although a majority of the sentiment was neutral, the negative sentiment was about 10% more than the positive sentiment overall. Tuesdays seem to have the most negative sentiment overall. In addition, we do find a strong correlation between the number of files changed and the sentiment expressed by the commits the files were part of. Future work and implications of these results are discussed.
{"title":"Analyzing Developer Sentiment in Commit Logs","authors":"Vinayak Sinha, A. Lazar, Bonita Sharif","doi":"10.1145/2901739.2903501","DOIUrl":"https://doi.org/10.1145/2901739.2903501","url":null,"abstract":"The paper presents an analysis of developer commit logs for GitHub projects. In particular, developer sentiment in commits is analyzed across 28,466 projects within a seven year time frame. We use the Boa infrastructure’s online query system to generate commit logs as well as files that were changed during the commit. We analyze the commits in three categories: large, medium, and small based on the number of commits using a sentiment analysis tool. In addition, we also group the data based on the day of week the commit was made and map the sentiment to the file change history to determine if there was any correlation. Although a majority of the sentiment was neutral, the negative sentiment was about 10% more than the positive sentiment overall. Tuesdays seem to have the most negative sentiment overall. In addition, we do find a strong correlation between the number of files changed and the sentiment expressed by the commits the files were part of. Future work and implications of these results are discussed.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"23 1","pages":"520-523"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89026888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}