首页 > 最新文献

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)最新文献

英文 中文
Towards Mining Answer Edits to Extract Evolution Patterns in Stack Overflow 基于答案编辑挖掘的堆栈溢出演化模式研究
Themistoklis G. Diamantopoulos, Maria-Ioanna Sifaki, A. Symeonidis
The current state of practice dictates that in order to solve a problem encountered when building software, developers ask for help in online platforms, such as Stack Overflow. In this context of collaboration, answers to question posts often undergo several edits to provide the best solution to the problem stated. In this work, we explore the potential of mining Stack Overflow answer edits to extract common patterns when answering a post. In particular, we design a similarity scheme that takes into account the text and code of answer edits and cluster edits according to their semantics. Upon applying our methodology, we provide frequent edit patterns and indicate how they could be used to answer future research questions. Assessing our approach indicates that it can be effective for identifying commonly applied edits, thus illustrating the transformation path from the initial answer to the optimal solution.
当前的实践状态表明,为了解决构建软件时遇到的问题,开发人员会在在线平台上寻求帮助,例如Stack Overflow。在这种协作环境中,问题帖子的答案通常会经过多次编辑,以提供所述问题的最佳解决方案。在这项工作中,我们探索挖掘堆栈溢出回答编辑的潜力,以提取回复帖子时的常见模式。特别地,我们设计了一个相似性方案,该方案根据语义考虑了答案编辑和聚类编辑的文本和代码。在应用我们的方法时,我们提供了频繁的编辑模式,并指出如何使用它们来回答未来的研究问题。评估我们的方法表明,它可以有效地识别通常应用的编辑,从而说明从初始答案到最优解决方案的转换路径。
{"title":"Towards Mining Answer Edits to Extract Evolution Patterns in Stack Overflow","authors":"Themistoklis G. Diamantopoulos, Maria-Ioanna Sifaki, A. Symeonidis","doi":"10.1109/MSR.2019.00043","DOIUrl":"https://doi.org/10.1109/MSR.2019.00043","url":null,"abstract":"The current state of practice dictates that in order to solve a problem encountered when building software, developers ask for help in online platforms, such as Stack Overflow. In this context of collaboration, answers to question posts often undergo several edits to provide the best solution to the problem stated. In this work, we explore the potential of mining Stack Overflow answer edits to extract common patterns when answering a post. In particular, we design a similarity scheme that takes into account the text and code of answer edits and cluster edits according to their semantics. Upon applying our methodology, we provide frequent edit patterns and indicate how they could be used to answer future research questions. Assessing our approach indicates that it can be effective for identifying commonly applied edits, thus illustrating the transformation path from the initial answer to the optimal solution.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"15 1","pages":"215-219"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78427012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Boa Meets Python: A Boa Dataset of Data Science Software in Python Language 蟒蛇遇见Python: Python语言数据科学软件的蟒蛇数据集
Sumon Biswas, Md Johirul Islam, Yijia Huang, Hridesh Rajan
The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community.
近年来,由于Python编程语言在数据科学中的使用越来越多,它的受欢迎程度激增。Github中Python存储库的可用性为挖掘软件存储库研究提供了机会,例如,建议开发数据科学应用程序的最佳实践,识别错误模式,推荐代码增强等。为了进行这项研究,我们创建了一个新的数据集,其中包括1,558个成熟的Github项目,这些项目为数据科学任务开发Python软件。通过分析元数据和代码,我们将项目包含在我们的数据集中,这些项目使用了一组不同的机器学习库,并由各种用户和组织管理。数据集通过Boa基础设施公开提供,既可以作为原始项目的集合,也可以以处理后的形式用于使用Boa语言执行大规模分析。我们还提出了两个初步的应用程序来展示数据集的潜力,这些潜力可以被社区利用。
{"title":"Boa Meets Python: A Boa Dataset of Data Science Software in Python Language","authors":"Sumon Biswas, Md Johirul Islam, Yijia Huang, Hridesh Rajan","doi":"10.1109/MSR.2019.00086","DOIUrl":"https://doi.org/10.1109/MSR.2019.00086","url":null,"abstract":"The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"35 1","pages":"577-581"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77777817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice 应用深度树模型进行软件缺陷预测的经验教训
K. Dam, Trang Pham, S. W. Ng, T. Tran, J. Grundy, A. Ghose, Taeksu Kim, Chul-Joo Kim
Defects are common in software systems and cause many problems for software users. Different methods have been developed to make early prediction about the most likely defective modules in large codebases. Most focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and multiple levels of semantics of source code, a potentially important capability for building accurate prediction models. In this paper, we report on our experience of deploying a new deep learning tree-based defect prediction model in practice. This model is built upon the tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. We discuss a number of lessons learned from developing the model and evaluating it on two datasets, one from open source projects contributed by our industry partner Samsung and the other from the public PROMISE repository.
缺陷在软件系统中是常见的,并且会给软件用户带来很多问题。已经开发了不同的方法来对大型代码库中最有可能存在缺陷的模块进行早期预测。大多数关注于设计与潜在缺陷代码相关的特性(例如复杂性度量)。然而,这些方法不能充分捕获源代码的语法和多层语义,而这是构建准确预测模型的潜在重要能力。在本文中,我们报告了我们在实践中部署一个新的基于深度学习树的缺陷预测模型的经验。该模型建立在树形结构的长短期记忆网络的基础上,该网络与源代码的抽象语法树表示直接匹配。我们讨论了从开发模型和在两个数据集上评估模型中学到的一些经验教训,一个来自我们的行业合作伙伴三星提供的开源项目,另一个来自公共PROMISE存储库。
{"title":"Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice","authors":"K. Dam, Trang Pham, S. W. Ng, T. Tran, J. Grundy, A. Ghose, Taeksu Kim, Chul-Joo Kim","doi":"10.1109/MSR.2019.00017","DOIUrl":"https://doi.org/10.1109/MSR.2019.00017","url":null,"abstract":"Defects are common in software systems and cause many problems for software users. Different methods have been developed to make early prediction about the most likely defective modules in large codebases. Most focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and multiple levels of semantics of source code, a potentially important capability for building accurate prediction models. In this paper, we report on our experience of deploying a new deep learning tree-based defect prediction model in practice. This model is built upon the tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. We discuss a number of lessons learned from developing the model and evaluating it on two datasets, one from open source projects contributed by our industry partner Samsung and the other from the public PROMISE repository.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"4 1","pages":"46-57"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73289219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts 探索词嵌入技术改进软件工程文本情感分析
Eeshita Biswas, K. Vijay-Shanker, L. Pollock
Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.
基于文本的软件工件的情感分析(SA)越来越多地用于为各种任务提取信息,包括提供代码建议、提高开发团队的生产力、给出软件包和库的建议,以及对源代码中的缺陷、代码质量、应用程序改进的可能性提出评论。对应用于软件相关文本的最先进的情感分析工具的研究显示,基于技术和训练方法,结果各不相同。在本文中,我们研究了在使用Lin等人开发的Stack Overflow数据定制的神经网络的背景下,改进SE工件情感分析训练的两个潜在机会的影响。我们将情感分析过程定制到软件领域,使用从Stack Overflow (SO)文章中学习到的特定于软件领域的词嵌入,并研究了特定于软件领域的词嵌入对情感分析工具性能的影响,并与从Google News中学习到的通用词嵌入进行了比较。我们发现从Google新闻数据中学习到的词嵌入在大多数情况下与从SO帖子中学习到的词嵌入相似,在某些情况下甚至更好。我们还研究了两种机器学习技术(数据的过采样和欠采样)对情感分类器训练的影响,用于处理具有倾斜分布的小SE数据集。我们发现单独的过采样,以及过采样和欠采样的结合,有助于提高情感分类器的性能。
{"title":"Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts","authors":"Eeshita Biswas, K. Vijay-Shanker, L. Pollock","doi":"10.1109/MSR.2019.00020","DOIUrl":"https://doi.org/10.1109/MSR.2019.00020","url":null,"abstract":"Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"79 1","pages":"68-78"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90842152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
The Software Heritage Graph Dataset: Public Software Development Under One Roof 软件遗产图数据集:同一屋檐下的公共软件开发
Antoine Pietri, D. Spinellis, Stefano Zacchiroli
Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.
Software Heritage是现有最大的软件源代码和相关开发历史的公共存档:它目前涵盖了超过50亿个唯一的源代码文件和10亿个唯一的提交,来自超过8000万个软件项目。本文介绍了软件遗产图数据集:软件遗产档案的完全重复的Merkle DAG表示。数据集将文件内容标识符、源代码目录、版本控制系统(VCS)提交的跟踪演变链接在一起,直到软件遗产在定期抓取期间观察到的VCS存储库的完整状态。数据集的内容来自主要的开发平台(包括GitHub和GitLab)、自由/开源软件发行版(例如Debian)和特定语言的包管理器(例如PyPI)。还包括爬行信息,提供关于在野外观察到所有存档的源代码工件的时间和地点的时间戳。Software Heritage图形数据集有多种格式,包括可下载的CSV转储文件和供本地使用的Apache Parquet文件,以及Amazon Athena交互式查询服务上的公共实例,用于随时可用的强大分析处理。源代码文件内容在图叶上被交叉引用,并且可以通过使用Software Heritage存档API的单个请求进行检索。
{"title":"The Software Heritage Graph Dataset: Public Software Development Under One Roof","authors":"Antoine Pietri, D. Spinellis, Stefano Zacchiroli","doi":"10.1109/MSR.2019.00030","DOIUrl":"https://doi.org/10.1109/MSR.2019.00030","url":null,"abstract":"Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"70 Suppl4 1","pages":"138-142"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75778636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
RapidRelease - A Dataset of Projects and Issues on Github with Rapid Releases RapidRelease - Github上的项目和问题数据集,具有快速发布功能
Saket Joshi, S. Chimalakonda
In the recent years, there has been a surge in the adoption of agile development model and continuous integration (CI) in software development. Recent trends have reduced average release cycle lengths to as low as 1-2 weeks, leading to an extensive number of studies in release engineering. Open-source development (OSD) has also witnessed a rapid increase in release rates, however, no large dataset of open-source projects exists which features high release rates. In this paper, we introduce the RapidRelease dataset, a data showcase of high release frequency open-source projects. The dataset hosts 994 projects from Github, with over 2 million issue reports. To the best of our knowledge, this is the first dataset that can facilitate researchers to empirically study release engineering and agile software development in open-source projects with rapid releases.
近年来,在软件开发中大量采用敏捷开发模型和持续集成(CI)。最近的趋势已经将平均发布周期缩短到1-2周,这导致了对发布工程的大量研究。开源开发(OSD)也见证了发布率的快速增长,然而,没有一个大的开源项目数据集具有高发布率。在本文中,我们介绍了RapidRelease数据集,这是一个高发布频率开源项目的数据展示。该数据集托管了来自Github的994个项目,拥有超过200万份问题报告。据我们所知,这是第一个数据集,可以帮助研究人员在快速发布的开源项目中经验地研究发布工程和敏捷软件开发。
{"title":"RapidRelease - A Dataset of Projects and Issues on Github with Rapid Releases","authors":"Saket Joshi, S. Chimalakonda","doi":"10.1109/MSR.2019.00088","DOIUrl":"https://doi.org/10.1109/MSR.2019.00088","url":null,"abstract":"In the recent years, there has been a surge in the adoption of agile development model and continuous integration (CI) in software development. Recent trends have reduced average release cycle lengths to as low as 1-2 weeks, leading to an extensive number of studies in release engineering. Open-source development (OSD) has also witnessed a rapid increase in release rates, however, no large dataset of open-source projects exists which features high release rates. In this paper, we introduce the RapidRelease dataset, a data showcase of high release frequency open-source projects. The dataset hosts 994 projects from Github, with over 2 million issue reports. To the best of our knowledge, this is the first dataset that can facilitate researchers to empirically study release engineering and agile software development in open-source projects with rapid releases.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"47 1","pages":"587-591"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84348516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Impacts of Daylight Saving Time on Software Development 日光节约时间对软件开发的影响
J. Hayashi, Yoshiki Higo, S. Matsumoto, S. Kusumoto
Daylight saving time (DST) is observed in many countries and regions. DST is not considered on some software systems at the beginning of their developments, for example, software systems developed in regions where DST is not observed. However, such systems may have to consider DST at the requests of their users. Before now, there has been no study about the impacts of DST on software development. In this paper, we study the impacts of DST on software development by mining the repositories on GitHub. We analyze the date when the code related to DST is changed, and we analyze the regions where the developers applied the changes live. Furthermore, we classify the changes into some patterns.
许多国家和地区实行夏令时。一些软件系统在开发之初没有考虑夏令时,例如,在不遵守夏令时的地区开发的软件系统。但是,这些系统可能必须根据用户的要求考虑DST。在此之前,还没有关于DST对软件开发影响的研究。本文中,我们通过挖掘GitHub上的存储库来研究DST对软件开发的影响。我们分析与DST相关的代码发生更改的日期,并分析开发人员应用更改的区域。此外,我们将这些变化分类为一些模式。
{"title":"Impacts of Daylight Saving Time on Software Development","authors":"J. Hayashi, Yoshiki Higo, S. Matsumoto, S. Kusumoto","doi":"10.1109/MSR.2019.00076","DOIUrl":"https://doi.org/10.1109/MSR.2019.00076","url":null,"abstract":"Daylight saving time (DST) is observed in many countries and regions. DST is not considered on some software systems at the beginning of their developments, for example, software systems developed in regions where DST is not observed. However, such systems may have to consider DST at the requests of their users. Before now, there has been no study about the impacts of DST on software development. In this paper, we study the impacts of DST on software development by mining the repositories on GitHub. We analyze the date when the code related to DST is changed, and we analyze the regions where the developers applied the changes live. Furthermore, we classify the changes into some patterns.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"4 1","pages":"502-506"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90499237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Test Coverage in Python Programs Python程序中的测试覆盖率
Hongyu Zhai, Casey Casalnuovo, Premkumar T. Devanbu
We study code coverage in several popular Python projects: flask, matplotlib, pandas, scikit-learn, and scrapy. Coverage data on these projects is gathered and hosted on the Codecov website, from where this data can be mined. Using this data, and a syntactic parse of the code, we examine the effect of control flow structure, statement type (e.g., if, for) and code age on test coverage. We find that coverage depends on control flow structure, with more deeply nested statements being significantly less likely to be covered. This is a clear effect, which holds up in every project, even when controlling for the age of the line (as determined by git blame). We find that the age of a line per se has a small (but statistically significant) positive effect on coverage. Finally, we find that the kind of statement (try, if, except, raise, etc) has varying effects on coverage, with exception-handling statements being covered much less often. These results suggest that developers in Python projects have difficulty writing test sets that cover deeply-nested and error-handling statements, and might need assistance covering such code.
我们研究了几个流行的Python项目中的代码覆盖率:flask、matplotlib、pandas、scikit-learn和scrapy。这些项目的覆盖数据被收集并托管在Codecov网站上,从那里可以挖掘这些数据。使用这些数据,以及代码的语法解析,我们检查控制流结构、语句类型(例如,if, for)和代码年龄对测试覆盖率的影响。我们发现覆盖范围取决于控制流结构,嵌套更深的语句被覆盖的可能性更小。这是一个明显的效应,在每个项目中都是如此,即使控制了线路的年龄(由git责备决定)。我们发现,线路本身的年龄对覆盖率有一个小的(但统计上显着的)积极影响。最后,我们发现语句的类型(try、if、except、raise等)对覆盖率有不同的影响,异常处理语句的覆盖率要低得多。这些结果表明,Python项目中的开发人员很难编写覆盖深度嵌套和错误处理语句的测试集,并且可能需要帮助来覆盖此类代码。
{"title":"Test Coverage in Python Programs","authors":"Hongyu Zhai, Casey Casalnuovo, Premkumar T. Devanbu","doi":"10.1109/MSR.2019.00027","DOIUrl":"https://doi.org/10.1109/MSR.2019.00027","url":null,"abstract":"We study code coverage in several popular Python projects: flask, matplotlib, pandas, scikit-learn, and scrapy. Coverage data on these projects is gathered and hosted on the Codecov website, from where this data can be mined. Using this data, and a syntactic parse of the code, we examine the effect of control flow structure, statement type (e.g., if, for) and code age on test coverage. We find that coverage depends on control flow structure, with more deeply nested statements being significantly less likely to be covered. This is a clear effect, which holds up in every project, even when controlling for the age of the line (as determined by git blame). We find that the age of a line per se has a small (but statistically significant) positive effect on coverage. Finally, we find that the kind of statement (try, if, except, raise, etc) has varying effects on coverage, with exception-handling statements being covered much less often. These results suggest that developers in Python projects have difficulty writing test sets that cover deeply-nested and error-handling statements, and might need assistance covering such code.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"51 1","pages":"116-120"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89160984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Characterizing Duplicate Code Snippets between Stack Overflow and Tutorials 描述堆栈溢出和教程之间的重复代码片段
Manziba Akanda Nishi, Agnieszka Ciborowska, Kostadin Damevski
Developers are usually unaware of the quality and lineage of information available on popular Web resources, leading to potential maintenance problems and license violations when reusing code snippets from these resources. In this paper, we study the duplication of code snippets between two popular sources of software development information: the Stack Overflow Q a significant number (31%) of answers that contained a duplicate code block were chosen as the accepted answer. Qualitative analysis reveals that developers commonly use Stack Overflow to ask clarifying questions about code they reused from tutorials, and copy code snippets from tutorials to provide answers to questions.
开发人员通常不知道流行Web资源上可用信息的质量和沿袭,从而在重用这些资源中的代码片段时导致潜在的维护问题和违反许可。在本文中,我们研究了两个流行的软件开发信息来源之间的代码片段重复:堆栈溢出Q——包含重复代码块的大量(31%)答案被选为可接受的答案。定性分析表明,开发人员通常使用Stack Overflow对他们从教程中重用的代码提出澄清性问题,并从教程中复制代码片段以提供问题的答案。
{"title":"Characterizing Duplicate Code Snippets between Stack Overflow and Tutorials","authors":"Manziba Akanda Nishi, Agnieszka Ciborowska, Kostadin Damevski","doi":"10.1109/MSR.2019.00048","DOIUrl":"https://doi.org/10.1109/MSR.2019.00048","url":null,"abstract":"Developers are usually unaware of the quality and lineage of information available on popular Web resources, leading to potential maintenance problems and license violations when reusing code snippets from these resources. In this paper, we study the duplication of code snippets between two popular sources of software development information: the Stack Overflow Q a significant number (31%) of answers that contained a duplicate code block were chosen as the accepted answer. Qualitative analysis reveals that developers commonly use Stack Overflow to ask clarifying questions about code they reused from tutorials, and copy code snippets from tutorials to provide answers to questions.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"43 1","pages":"240-244"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77331966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools Slack问答聊天作为软件工程工具挖掘源的探索性研究
Preetha Chatterjee, Kostadin Damevski, L. Pollock, Vinay Augustine, Nicholas A. Kraft
Modern software development communities are increasingly social. Popular chat platforms such as Slack host public chat communities that focus on specific development topics such as Python or Ruby-on-Rails. Conversations in these public chats often follow a Q&A format, with someone seeking information and others providing answers in chat form. In this paper, we describe an exploratory study into the potential use-fulness and challenges of mining developer Q&A conversations for supporting software maintenance and evolution tools. We designed the study to investigate the availability of information that has been successfully mined from other developer communications, particularly Stack Overflow. We also analyze characteristics of chat conversations that might inhibit accurate automated analysis. Our results indicate the prevalence of useful information, including API mentions and code snippets with descriptions, and several hurdles that need to be overcome to automate mining that information.
现代软件开发社区越来越社会化。流行的聊天平台,如Slack,提供公共聊天社区,专注于特定的开发主题,如Python或Ruby-on-Rails。这些公共聊天中的对话通常采用问答形式,有人寻求信息,其他人以聊天形式提供答案。在本文中,我们描述了一项探索性研究,该研究探讨了挖掘开发人员问答对话的潜在有用性和挑战,以支持软件维护和进化工具。我们设计这项研究是为了调查从其他开发人员通信中成功挖掘的信息的可用性,特别是Stack Overflow。我们还分析聊天对话的特征,这些特征可能会抑制准确的自动化分析。我们的结果表明了有用信息的普遍性,包括API提及和带有描述的代码片段,以及自动化挖掘这些信息需要克服的几个障碍。
{"title":"Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools","authors":"Preetha Chatterjee, Kostadin Damevski, L. Pollock, Vinay Augustine, Nicholas A. Kraft","doi":"10.1109/MSR.2019.00075","DOIUrl":"https://doi.org/10.1109/MSR.2019.00075","url":null,"abstract":"Modern software development communities are increasingly social. Popular chat platforms such as Slack host public chat communities that focus on specific development topics such as Python or Ruby-on-Rails. Conversations in these public chats often follow a Q&A format, with someone seeking information and others providing answers in chat form. In this paper, we describe an exploratory study into the potential use-fulness and challenges of mining developer Q&A conversations for supporting software maintenance and evolution tools. We designed the study to investigate the availability of information that has been successfully mined from other developer communications, particularly Stack Overflow. We also analyze characteristics of chat conversations that might inhibit accurate automated analysis. Our results indicate the prevalence of useful information, including API mentions and code snippets with descriptions, and several hurdles that need to be overcome to automate mining that information.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"12 1","pages":"490-501"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81467568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
期刊
2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1