A kernel oops is an error report that logs the status of the Linux kernel at the time of a crash. Such a report can provide valuable first-hand information for a Linux kernel maintainer to conduct postmortem debugging. Recently, a repository has been created that systematically collects kernel oopses from Linux users. However, debugging based on only the information in a kernel oops is difficult. We consider the initial problem of finding the offending line, i.e., the line of source code that incurs the crash. For this, we propose a novel algorithm based on approximate sequence matching, as used in bioinformatics, to automatically pinpoint the offending line based on information about nearby machine-code instructions, as found in a kernel oops. Our algorithm achieves 92% accuracy compared to 26% for the traditional approach of using only the oops instruction pointer.
{"title":"Oops! where did that code snippet come from?","authors":"Lisong Guo, J. Lawall, Gilles Muller","doi":"10.1145/2597073.2597094","DOIUrl":"https://doi.org/10.1145/2597073.2597094","url":null,"abstract":"A kernel oops is an error report that logs the status of the Linux kernel at the time of a crash. Such a report can provide valuable first-hand information for a Linux kernel maintainer to conduct postmortem debugging. Recently, a repository has been created that systematically collects kernel oopses from Linux users. However, debugging based on only the information in a kernel oops is difficult. We consider the initial problem of finding the offending line, i.e., the line of source code that incurs the crash. For this, we propose a novel algorithm based on approximate sequence matching, as used in bioinformatics, to automatically pinpoint the offending line based on information about nearby machine-code instructions, as found in a kernel oops. Our algorithm achieves 92% accuracy compared to 26% for the traditional approach of using only the oops instruction pointer.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"46 1","pages":"52-61"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83057963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is currently a vast array of open source projects available on the web, and although they are searchable by name or description in the search engines, there is no way to search for projects by how well they perform on a given set of quality attributes (e.g. usability or maintainability). With OpenHub, we present a scalable and extensible architecture for the static and runtime analysis of open source repositories written in Python, presenting the architecture and pinpointing future possibilities with it.
{"title":"OpenHub: a scalable architecture for the analysis of software quality attributes","authors":"G. Farah, Juan Sebastian Tejada, D. Correal","doi":"10.1145/2597073.2597135","DOIUrl":"https://doi.org/10.1145/2597073.2597135","url":null,"abstract":"There is currently a vast array of open source projects available on the web, and although they are searchable by name or description in the search engines, there is no way to search for projects by how well they perform on a given set of quality attributes (e.g. usability or maintainability). With OpenHub, we present a scalable and extensible architecture for the static and runtime analysis of open source repositories written in Python, presenting the architecture and pinpointing future possibilities with it.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"15 1","pages":"420-423"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90736034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trick question: what is Data Science? The collection and use of low-veracity data in software repositories and other operational support systems is exploding. It is, therefore, imperative to elucidate basic principles of how such data comes into being and what it means. Are there practices of constructing software data analysis tools that could raise the integrity of their results despite the problematic nature of the underlying data? The talk explores the basic nature of data in operational support systems and considers approaches to develop engineering practices for software mining tools.
{"title":"Is mining software repositories data science? (keynote)","authors":"A. Mockus","doi":"10.1145/2597073.2600728","DOIUrl":"https://doi.org/10.1145/2597073.2600728","url":null,"abstract":"Trick question: what is Data Science? The collection and use of low-veracity data in software repositories and other operational support systems is exploding. It is, therefore, imperative to elucidate basic principles of how such data comes into being and what it means. Are there practices of constructing software data analysis tools that could raise the integrity of their results despite the problematic nature of the underlying data? The talk explores the basic nature of data in operational support systems and considers approaches to develop engineering practices for software mining tools.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"24 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87055064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João Brunet, G. Murphy, Ricardo Terra, J. Figueiredo, D. Guerrero
Design is often raised in the literature as important to attaining various properties and characteristics in a software system. At least for open-source projects, it can be hard to find evidence of ongoing design work in the technical artifacts produced as part of the development. Although developers usually do not produce specific design documents, they do communicate about design in different ways. In this paper, we provide quantitative evidence that developers address design through discussions in commits, issues, and pull requests. To achieve this, we built a discussions' classifier and automatically labeled 102,122 discussions from 77 projects. Based on this data, we make four observations about the projects: i) on average, 25% of the discussions in a project are about design; ii) on average, 26% of developers contribute to at least one design discussion; iii) only 1% of the developers contribute to more than 15% of the discussions in a project; and iv) these few developers who contribute to a broad range of design discussions are also the top committers in a project.
{"title":"Do developers discuss design?","authors":"João Brunet, G. Murphy, Ricardo Terra, J. Figueiredo, D. Guerrero","doi":"10.1145/2597073.2597115","DOIUrl":"https://doi.org/10.1145/2597073.2597115","url":null,"abstract":"Design is often raised in the literature as important to attaining various properties and characteristics in a software system. At least for open-source projects, it can be hard to find evidence of ongoing design work in the technical artifacts produced as part of the development. Although developers usually do not produce specific design documents, they do communicate about design in different ways. In this paper, we provide quantitative evidence that developers address design through discussions in commits, issues, and pull requests. To achieve this, we built a discussions' classifier and automatically labeled 102,122 discussions from 77 projects. Based on this data, we make four observations about the projects: i) on average, 25% of the discussions in a project are about design; ii) on average, 26% of developers contribute to at least one design discussion; iii) only 1% of the developers contribute to more than 15% of the discussions in a project; and iv) these few developers who contribute to a broad range of design discussions are also the top committers in a project.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"340-343"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82741274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mona Erfani Joorabchi, Mehdi MirzaAghaei, A. Mesbah
Bug repository systems have become an integral component of software development activities. Ideally, each bug report should help developers to find and fix a software fault. However, there is a subset of reported bugs that is not (easily) reproducible, on which developers spend considerable amounts of time and effort. We present an empirical analysis of non-reproducible bug reports to characterize their rate, nature, and root causes. We mine one industrial and five open-source bug repositories, resulting in 32K non-reproducible bug reports. We (1) compare properties of non-reproducible reports with their counterparts such as active time and number of authors, (2) investigate their life-cycle patterns, and (3) examine 120 Fixed non-reproducible reports. In addition, we qualitatively classify a set of randomly selected non-reproducible bug reports (1,643) into six common categories. Our results show that, on average, non-reproducible bug reports pertain to 17% of all bug reports, remain active three months longer than their counterparts, can be mainly (45%) classified as "Interbug Dependencies'', and 66% of Fixed non-reproducible reports were indeed reproduced and fixed.
{"title":"Works for me! characterizing non-reproducible bug reports","authors":"Mona Erfani Joorabchi, Mehdi MirzaAghaei, A. Mesbah","doi":"10.1145/2597073.2597098","DOIUrl":"https://doi.org/10.1145/2597073.2597098","url":null,"abstract":"Bug repository systems have become an integral component of software development activities. Ideally, each bug report should help developers to find and fix a software fault. However, there is a subset of reported bugs that is not (easily) reproducible, on which developers spend considerable amounts of time and effort. We present an empirical analysis of non-reproducible bug reports to characterize their rate, nature, and root causes. We mine one industrial and five open-source bug repositories, resulting in 32K non-reproducible bug reports. We (1) compare properties of non-reproducible reports with their counterparts such as active time and number of authors, (2) investigate their life-cycle patterns, and (3) examine 120 Fixed non-reproducible reports. In addition, we qualitatively classify a set of randomly selected non-reproducible bug reports (1,643) into six common categories. Our results show that, on average, non-reproducible bug reports pertain to 17% of all bug reports, remain active three months longer than their counterparts, can be mainly (45%) classified as \"Interbug Dependencies'', and 66% of Fixed non-reproducible reports were indeed reproduced and fixed.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"69 1","pages":"62-71"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85575209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Varun Tulsian, Aditya Kanade, Rahul Kumar, A. Lal, A. Nori
With the growing complexity of modern day software, software model checking has become a critical technology for ensuring correctness of software. As is true with any promising technology, there are a number of tools for software model checking. However, their respective performance trade-offs are difficult to characterize accurately – making it difficult for practitioners to select a suitable tool for the task at hand. This paper proposes a technique called MUX that addresses the problem of selecting the most suitable software model checker for a given input instance. MUX performs machine learning on a repository of software verification instances. The algorithm selector, synthesized through machine learning, uses structural features from an input instance, comprising a program-property pair, at runtime and determines which tool to use. We have implemented MUX for Windows device drivers and evaluated it on a number of drivers and model checkers. Our results are promising in that the algorithm selector not only avoids a significant number of timeouts but also improves the total runtime by a large margin, compared to any individual model checker. It also outperforms a portfolio-based algorithm selector being used in Microsoft at present. Besides, MUX identifies structural features of programs that are key factors in determining performance of model checkers.
{"title":"MUX: algorithm selection for software model checkers","authors":"Varun Tulsian, Aditya Kanade, Rahul Kumar, A. Lal, A. Nori","doi":"10.1145/2597073.2597080","DOIUrl":"https://doi.org/10.1145/2597073.2597080","url":null,"abstract":"With the growing complexity of modern day software, software model checking has become a critical technology for ensuring correctness of software. As is true with any promising technology, there are a number of tools for software model checking. However, their respective performance trade-offs are difficult to characterize accurately – making it difficult for practitioners to select a suitable tool for the task at hand. This paper proposes a technique called MUX that addresses the problem of selecting the most suitable software model checker for a given input instance. MUX performs machine learning on a repository of software verification instances. The algorithm selector, synthesized through machine learning, uses structural features from an input instance, comprising a program-property pair, at runtime and determines which tool to use. \u0000 We have implemented MUX for Windows device drivers and evaluated it on a number of drivers and model checkers. Our results are promising in that the algorithm selector not only avoids a significant number of timeouts but also improves the total runtime by a large margin, compared to any individual model checker. It also outperforms a portfolio-based algorithm selector being used in Microsoft at present. Besides, MUX identifies structural features of programs that are key factors in determining performance of model checkers.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"27 1","pages":"132-141"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72768358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the fundamental principles of open-source projects is that they foster collaboration among developers, disregarding their geographical location or personal background. When it comes to software repositories collaboration is a rather ephemeral phenomenon which lacks a clear definition, and it must therefore be mined and modeled. This throws up the question whether what is mined actually maps to reality. In this paper we investigate collaboration by modeling it using a number of diverse approaches that we then compare to a ground truth obtained by surveying a substantial set of developers of the Pharo open-source community. Our findings indicate that the notion of collaboration must be revisited, as it is undermined by a number of factors that are often tackled in imprecise ways or not taken into account at all.
{"title":"Collaboration in open-source projects: myth or reality?","authors":"Y. Tymchuk, Andrea Mocci, Michele Lanza","doi":"10.1145/2597073.2597093","DOIUrl":"https://doi.org/10.1145/2597073.2597093","url":null,"abstract":"One of the fundamental principles of open-source projects is that they foster collaboration among developers, disregarding their geographical location or personal background. When it comes to software repositories collaboration is a rather ephemeral phenomenon which lacks a clear definition, and it must therefore be mined and modeled. This throws up the question whether what is mined actually maps to reality. \u0000 In this paper we investigate collaboration by modeling it using a number of diverse approaches that we then compare to a ground truth obtained by surveying a substantial set of developers of the Pharo open-source community. Our findings indicate that the notion of collaboration must be revisited, as it is undermined by a number of factors that are often tackled in imprecise ways or not taken into account at all.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"55 1","pages":"304-307"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74582315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pull requests form a new method for collaborating in distributed software development. To study the pull request distributed development model, we constructed a dataset of almost 900 projects and 350,000 pull requests, including some of the largest users of pull requests on Github. In this paper, we describe how the project selection was done, we analyze the selected features and present a machine learning tool set for the R statistics environment.
{"title":"A dataset for pull-based development research","authors":"Georgios Gousios, A. Zaidman","doi":"10.1145/2597073.2597122","DOIUrl":"https://doi.org/10.1145/2597073.2597122","url":null,"abstract":"Pull requests form a new method for collaborating in distributed software development. To study the pull request distributed development model, we constructed a dataset of almost 900 projects and 350,000 pull requests, including some of the largest users of pull requests on Github. In this paper, we describe how the project selection was done, we analyze the selected features and present a machine learning tool set for the R statistics environment.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"254 1","pages":"368-371"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77443503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Matragkas, James R. Williams, D. Kolovos, R. Paige
In nature the diversity of species and genes in ecological communities affects the functioning of these communities. Biologists have found out that more diverse communities appear to be more productive than less diverse communities. Moreover such communities appear to be more stable in the face of perturbations. In this paper, we draw the analogy between ecological communities and Open Source Software (OSS) ecosystems, and we investigate the diversity and structure of OSS communities. To address this question we use the MSR 2014 challenge dataset, which includes data from the top-10 software projects for the top programming languages on GitHub. Our findings show that OSS communities on GitHub consist of 3 types of users (core developers, active users, passive users). Moreover, we show that the percentage of core developers and active users does not change as the project grows and that the majority of members of large projects are passive users.
{"title":"Analysing the 'biodiversity' of open source ecosystems: the GitHub case","authors":"N. Matragkas, James R. Williams, D. Kolovos, R. Paige","doi":"10.1145/2597073.2597119","DOIUrl":"https://doi.org/10.1145/2597073.2597119","url":null,"abstract":"In nature the diversity of species and genes in ecological communities affects the functioning of these communities. Biologists have found out that more diverse communities appear to be more productive than less diverse communities. Moreover such communities appear to be more stable in the face of perturbations. In this paper, we draw the analogy between ecological communities and Open Source Software (OSS) ecosystems, and we investigate the diversity and structure of OSS communities. To address this question we use the MSR 2014 challenge dataset, which includes data from the top-10 software projects for the top programming languages on GitHub. Our findings show that OSS communities on GitHub consist of 3 types of users (core developers, active users, passive users). Moreover, we show that the percentage of core developers and active users does not change as the project grows and that the majority of members of large projects are passive users.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"114 1","pages":"356-359"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77625140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Robles, Jesus M. Gonzalez-Barahona, C. Cervigón, A. Capiluppi, Daniel Izquierdo-Cortazar
Because of the distributed and collaborative nature of free / open source software (FOSS) projects, the development effort invested in a project is usually unknown, even after the software has been released. However, this information is becoming of major interest, especially ---but not only--- because of the growth in the number of companies for which FOSS has become relevant for their business strategy. In this paper we present a novel approach to estimate effort by considering data from source code management repositories. We apply our model to the OpenStack project, a FOSS project with more than 1,000 authors, in which several tens of companies cooperate. Based on data from its repositories and together with the input from a survey answered by more than 100 developers, we show that the model offers a simple, but sound way of obtaining software development estimations with bounded margins of error.
{"title":"Estimating development effort in Free/Open source software projects by mining software repositories: a case study of OpenStack","authors":"G. Robles, Jesus M. Gonzalez-Barahona, C. Cervigón, A. Capiluppi, Daniel Izquierdo-Cortazar","doi":"10.1145/2597073.2597107","DOIUrl":"https://doi.org/10.1145/2597073.2597107","url":null,"abstract":"Because of the distributed and collaborative nature of free / open source software (FOSS) projects, the development effort invested in a project is usually unknown, even after the software has been released. However, this information is becoming of major interest, especially ---but not only--- because of the growth in the number of companies for which FOSS has become relevant for their business strategy. In this paper we present a novel approach to estimate effort by considering data from source code management repositories. We apply our model to the OpenStack project, a FOSS project with more than 1,000 authors, in which several tens of companies cooperate. Based on data from its repositories and together with the input from a survey answered by more than 100 developers, we show that the model offers a simple, but sound way of obtaining software development estimations with bounded margins of error.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"137 1","pages":"222-231"},"PeriodicalIF":0.0,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89132881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}