V. Kovalenko, Egor Bogomolov, T. Bryksin, Alberto Bacchelli
One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation – an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code. In this paper, we present PathMiner – an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257].
{"title":"PathMiner: A Library for Mining of Path-Based Representations of Code","authors":"V. Kovalenko, Egor Bogomolov, T. Bryksin, Alberto Bacchelli","doi":"10.1109/MSR.2019.00013","DOIUrl":"https://doi.org/10.1109/MSR.2019.00013","url":null,"abstract":"One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation – an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code. In this paper, we present PathMiner – an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257].","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"31 1","pages":"13-17"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90108187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data produced by version control systems are widely used in software research and development. Version control data users always use the name or email address field to identify the committer or author of a modification. However, developers may use multiple names and email addresses, which brings difficulties for identification of distinct developers. In this paper, we sample 450 Git repositories from GitHub to study the multiple names and email addresses of developers. We conduct a conservative estimation of its prevalence and impact on related measurements. We merge the multiple names and email addresses of a developer through a method of high precision. With the merged identities, we obtain a number of interesting findings, e.g., about 6% of the developers used multiple names or email addresses in more than 60% of the repositories, and they contributed about half of all the commits. Our impact analysis shows that the multiple names and email addresses issue cannot be ignored for the basic related measurements, e.g., the number of developers in a repository. Our results could help researchers and practitioners have a more clear understanding of multiple names and email addresses in practice to improve the accuracy of related measurements.
{"title":"An Empirical Study of Multiple Names and Email Addresses in OSS Version Control Repositories","authors":"Jiaxin Zhu, Jun Wei","doi":"10.1109/MSR.2019.00068","DOIUrl":"https://doi.org/10.1109/MSR.2019.00068","url":null,"abstract":"Data produced by version control systems are widely used in software research and development. Version control data users always use the name or email address field to identify the committer or author of a modification. However, developers may use multiple names and email addresses, which brings difficulties for identification of distinct developers. In this paper, we sample 450 Git repositories from GitHub to study the multiple names and email addresses of developers. We conduct a conservative estimation of its prevalence and impact on related measurements. We merge the multiple names and email addresses of a developer through a method of high precision. With the merged identities, we obtain a number of interesting findings, e.g., about 6% of the developers used multiple names or email addresses in more than 60% of the repositories, and they contributed about half of all the commits. Our impact analysis shows that the multiple names and email addresses issue cannot be ignored for the basic related measurements, e.g., the number of developers in a repository. Our results could help researchers and practitioners have a more clear understanding of multiple names and email addresses in practice to improve the accuracy of related measurements.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"409-420"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89485463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nasif Imtiaz, A. Rahman, Effat Farhana, L. Williams
Static analysis tool alerts can help developers detect potential defects in the code early in the development cycle. However, developers are not always able to respond to the alerts with their preferred action and may turn away from using the tool. In this paper, we qualitatively analyze 280 Stack Overflow (SO) questions regarding static analysis tool alerts to identify the challenges developers face in understanding and responding to these alerts. We find that the most prevalent question on SO is how to ignore and filter alerts, followed by validation of false positives. Our findings confirm prior researchers' findings related to notification communication theory as 44.6% of the SO questions that we analyzed indicate developers face communication challenges.
{"title":"Challenges with Responding to Static Analysis Tool Alerts","authors":"Nasif Imtiaz, A. Rahman, Effat Farhana, L. Williams","doi":"10.1109/MSR.2019.00049","DOIUrl":"https://doi.org/10.1109/MSR.2019.00049","url":null,"abstract":"Static analysis tool alerts can help developers detect potential defects in the code early in the development cycle. However, developers are not always able to respond to the alerts with their preferred action and may turn away from using the tool. In this paper, we qualitatively analyze 280 Stack Overflow (SO) questions regarding static analysis tool alerts to identify the challenges developers face in understanding and responding to these alerts. We find that the most prevalent question on SO is how to ignore and filter alerts, followed by validation of false positives. Our findings confirm prior researchers' findings related to notification communication theory as 44.6% of the SO questions that we analyzed indicate developers face communication challenges.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"55 1","pages":"245-249"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83747370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as "sleeping defects". We call "snoring" the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove 50% of releases.
{"title":"Snoring: A Noise in Defect Prediction Datasets","authors":"A. Ahluwalia, D. Falessi, M. D. Penta","doi":"10.1109/MSR.2019.00019","DOIUrl":"https://doi.org/10.1109/MSR.2019.00019","url":null,"abstract":"In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as \"sleeping defects\". We call \"snoring\" the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove 50% of releases.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"63-67"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82795391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Zerouali, Valerio Cosentino, G. Robles, Jesus M. Gonzalez-Barahona, T. Mens
Deploying software packages and services into containers is a popular software engineering practice that increases portability and reusability. Docker, the most popular containerization technology, helps DevOps practitioners in their daily activities. Despite being successfully and increasingly employed, containers may include buggy and vulnerable packages that put at risk the environments in which the containers have been deployed. Existing quality and security monitoring tools provide only limited support to analyze Docker containers, thus forcing practitioners to perform additional manual work or develop adhoc scripts when the analysis goes beyond security purposes. This limitation also affects researchers desiring to empirically study the evolution dynamics of Docker containers and their contained packages. To overcome this limitation, we present ConPan, an automated tool to inspect the characteristics of packages in Docker containers, such as their outdatedness and other possible flaws (e.g., bugs and security vulnerabilities). ConPan comes with a CLI and API, and the analysis results can be presented to the user in a variety of formats.
{"title":"ConPan: A Tool to Analyze Packages in Software Containers","authors":"Ahmed Zerouali, Valerio Cosentino, G. Robles, Jesus M. Gonzalez-Barahona, T. Mens","doi":"10.1109/MSR.2019.00089","DOIUrl":"https://doi.org/10.1109/MSR.2019.00089","url":null,"abstract":"Deploying software packages and services into containers is a popular software engineering practice that increases portability and reusability. Docker, the most popular containerization technology, helps DevOps practitioners in their daily activities. Despite being successfully and increasingly employed, containers may include buggy and vulnerable packages that put at risk the environments in which the containers have been deployed. Existing quality and security monitoring tools provide only limited support to analyze Docker containers, thus forcing practitioners to perform additional manual work or develop adhoc scripts when the analysis goes beyond security purposes. This limitation also affects researchers desiring to empirically study the evolution dynamics of Docker containers and their contained packages. To overcome this limitation, we present ConPan, an automated tool to inspect the characteristics of packages in Docker containers, such as their outdatedness and other possible flaws (e.g., bugs and security vulnerabilities). ConPan comes with a CLI and API, and the analysis results can be presented to the user in a variety of formats.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"8 1","pages":"592-596"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81351279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software Engineering researchers are increasingly using Natural Language Processing (NLP) techniques to automate Software Vulnerabilities (SVs) assessment using the descriptions in public repositories. However, the existing NLP-based approaches suffer from concept drift. This problem is caused by a lack of proper treatment of new (out-of-vocabulary) terms for the evaluation of unseen SVs over time. To perform automated SVs assessment with concept drift using SVs' descriptions, we propose a systematic approach that combines both character and word features. The proposed approach is used to predict seven Vulnerability Characteristics (VCs). The optimal model of each VC is selected using our customized time-based cross-validation method from a list of eight NLP representations and six well-known Machine Learning models. We have used the proposed approach to conduct large-scale experiments on more than 100,000 SVs in the National Vulnerability Database (NVD). The results show that our approach can effectively tackle the concept drift issue of the SVs' descriptions reported from 2000 to 2018 in NVD even without retraining the model. In addition, our approach performs competitively compared to the existing word-only method. We also investigate how to build compact concept-drift-aware models with much fewer features and give some recommendations on the choice of classifiers and NLP representations for SVs assessment.
{"title":"Automated Software Vulnerability Assessment with Concept Drift","authors":"T. H. Le, Bushra Sabir, M. Babar","doi":"10.1109/MSR.2019.00063","DOIUrl":"https://doi.org/10.1109/MSR.2019.00063","url":null,"abstract":"Software Engineering researchers are increasingly using Natural Language Processing (NLP) techniques to automate Software Vulnerabilities (SVs) assessment using the descriptions in public repositories. However, the existing NLP-based approaches suffer from concept drift. This problem is caused by a lack of proper treatment of new (out-of-vocabulary) terms for the evaluation of unseen SVs over time. To perform automated SVs assessment with concept drift using SVs' descriptions, we propose a systematic approach that combines both character and word features. The proposed approach is used to predict seven Vulnerability Characteristics (VCs). The optimal model of each VC is selected using our customized time-based cross-validation method from a list of eight NLP representations and six well-known Machine Learning models. We have used the proposed approach to conduct large-scale experiments on more than 100,000 SVs in the National Vulnerability Database (NVD). The results show that our approach can effectively tackle the concept drift issue of the SVs' descriptions reported from 2000 to 2018 in NVD even without retraining the model. In addition, our approach performs competitively compared to the existing word-only method. We also investigate how to build compact concept-drift-aware models with much fewer features and give some recommendations on the choice of classifiers and NLP representations for SVs assessment.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"51 1","pages":"371-382"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90959753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Android apps must be able to deal with both stop events, which require immediately stopping the execution of the app without losing state information, and start events, which require resuming the execution of the app at the same point it was stopped. Support to these kinds of events must be explicitly implemented by developers who unfortunately often fail to implement the proper logic for saving and restoring the state of an app. As a consequence apps can lose data when moved to background and then back to foreground (e.g., to answer a call) or when the screen is simply rotated. These faults can be the cause of annoying usability issues and unexpected crashes. This paper presents a public benchmark of 110 data loss faults in Android apps that we systematically collected to facilitate research and experimentation with these problems. The benchmark is available on GitLab and includes the faulty apps, the fixed apps (when available), the test cases to automatically reproduce the problems, and additional information that may help researchers in their tasks.
{"title":"A Benchmark of Data Loss Bugs for Android Apps","authors":"O. Riganelli, M. Mobilio, D. Micucci, L. Mariani","doi":"10.1109/MSR.2019.00087","DOIUrl":"https://doi.org/10.1109/MSR.2019.00087","url":null,"abstract":"Android apps must be able to deal with both stop events, which require immediately stopping the execution of the app without losing state information, and start events, which require resuming the execution of the app at the same point it was stopped. Support to these kinds of events must be explicitly implemented by developers who unfortunately often fail to implement the proper logic for saving and restoring the state of an app. As a consequence apps can lose data when moved to background and then back to foreground (e.g., to answer a call) or when the screen is simply rotated. These faults can be the cause of annoying usability issues and unexpected crashes. This paper presents a public benchmark of 110 data loss faults in Android apps that we systematically collected to facilitate research and experimentation with these problems. The benchmark is available on GitLab and includes the faulty apps, the fixed apps (when available), the test cases to automatically reproduce the problems, and additional information that may help researchers in their tasks.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"73 1","pages":"582-586"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85844707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alan Bandeira, Carlos Alberto Medeiros, M. Paixão, P. Maia
Microservices are a new and rapidly growing architectural model aimed at developing highly scalable software solutions based on independently deployable and evolvable components. Due to its novelty, microservice-related discussions are increasing in Q&A websites, such as StackOverflow (SO). In order to understand what is being discussed by the microservice community, this work has applied mining techniques and topic modelling to a manually-curated dataset of 1,043 microservice-related posts from StackOverflow. As a result, we found that 13.68% of microservice technical posts on SO discuss a single technology: Netflix Eureka. Moreover, buzzwords in the microservice ecosystem, e.g., blue/green deployment, were not identified as relevant subjects of discussion on SO. Finally, we show how a high discussion rate on SO may not reflect the popularity of a certain subject within the microservice community.
{"title":"We Need to Talk About Microservices: an Analysis from the Discussions on StackOverflow","authors":"Alan Bandeira, Carlos Alberto Medeiros, M. Paixão, P. Maia","doi":"10.1109/MSR.2019.00051","DOIUrl":"https://doi.org/10.1109/MSR.2019.00051","url":null,"abstract":"Microservices are a new and rapidly growing architectural model aimed at developing highly scalable software solutions based on independently deployable and evolvable components. Due to its novelty, microservice-related discussions are increasing in Q&A websites, such as StackOverflow (SO). In order to understand what is being discussed by the microservice community, this work has applied mining techniques and topic modelling to a manually-curated dataset of 1,043 microservice-related posts from StackOverflow. As a result, we found that 13.68% of microservice technical posts on SO discuss a single technology: Netflix Eureka. Moreover, buzzwords in the microservice ecosystem, e.g., blue/green deployment, were not identified as relevant subjects of discussion on SO. Finally, we show how a high discussion rate on SO may not reflect the popularity of a certain subject within the microservice community.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"64 1","pages":"255-259"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86203535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}