2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)最新文献

英文中文

A Large-Scale Study on Repetitiveness, Containment, and Composability of Routines in Open-Source Projects 开源项目中例程的重复性、包容性和可组合性的大规模研究

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901759

A. Nguyen, H. Nguyen, T. Nguyen

Source code in software systems has been shown to have a good degree of repetitiveness at the lexical, syntactical, and API usage levels. This paper presents a large-scale study on the repetitiveness, containment, and composability of source code at the semantic level. We collected a large dataset consisting of 9,224 Java projects with 2.79M class files, 17.54M methods with 187M SLOCs. For each method in a project, we build the program dependency graph (PDG) to represent a routine, and compare PDGs with one another as well as the subgraphs within them. We found that within a project, 12.1% of the routines are repeated, and most of them repeat from 2–7 times. As entirety, the routines are quite project-specific with only 3.3% of them exactly repeating in 1–4 other projects with at most 8 times. We also found that 26.1% and 7.27% of the routines are contained in other routine(s), i.e., implemented as part of other routine(s) elsewhere within a project and in other projects, respectively. Except for trivial routines, their repetitiveness and containment is independent of their complexity. Defining a subroutine via a per-variable slicing subgraph in a PDG, we found that 14.3% of all routines have all of their subroutines repeated. A high percentage of subroutines in a routine can be found/reused elsewhere. We collected 8,764,971 unique subroutines (with 323,564 unique JDK subroutines) as basic units for code searching/synthesis. We also provide practical implications of our findings to automated tools.

软件系统中的源代码在词汇、语法和API使用级别上具有很高的重复性。本文从语义层面对源代码的重复性、包容性和可组合性进行了大规模的研究。我们收集了一个由9224个Java项目组成的大型数据集，其中包含2.79亿个类文件、17.54亿个方法和187M个sloc。对于项目中的每个方法，我们构建程序依赖图(PDG)来表示例程，并比较PDG之间以及其中的子图。我们发现，在一个项目中，12.1%的例程是重复的，大多数例程重复2-7次。总的来说，这些例程具有很强的项目特异性，只有3.3%的例程在1-4个其他项目中完全重复，最多重复8次。我们还发现，26.1%和7.27%的例程包含在其他例程中，即分别作为项目内其他地方和其他项目中的其他例程的一部分实现。除了琐碎的例程外，它们的重复性和包容性与它们的复杂性无关。通过PDG中的每个变量切片子图定义子例程，我们发现14.3%的例程具有重复的所有子例程。例程中很大比例的子例程可以在其他地方找到/重用。我们收集了8,764,971个唯一子例程(其中包含323,564个唯一JDK子例程)作为代码搜索/合成的基本单元。我们还提供了我们的发现对自动化工具的实际意义。

{"title":"A Large-Scale Study on Repetitiveness, Containment, and Composability of Routines in Open-Source Projects","authors":"A. Nguyen, H. Nguyen, T. Nguyen","doi":"10.1145/2901739.2901759","DOIUrl":"https://doi.org/10.1145/2901739.2901759","url":null,"abstract":"Source code in software systems has been shown to have a good degree of repetitiveness at the lexical, syntactical, and API usage levels. This paper presents a large-scale study on the repetitiveness, containment, and composability of source code at the semantic level. We collected a large dataset consisting of 9,224 Java projects with 2.79M class files, 17.54M methods with 187M SLOCs. For each method in a project, we build the program dependency graph (PDG) to represent a routine, and compare PDGs with one another as well as the subgraphs within them. We found that within a project, 12.1% of the routines are repeated, and most of them repeat from 2–7 times. As entirety, the routines are quite project-specific with only 3.3% of them exactly repeating in 1–4 other projects with at most 8 times. We also found that 26.1% and 7.27% of the routines are contained in other routine(s), i.e., implemented as part of other routine(s) elsewhere within a project and in other projects, respectively. Except for trivial routines, their repetitiveness and containment is independent of their complexity. Defining a subroutine via a per-variable slicing subgraph in a PDG, we found that 14.3% of all routines have all of their subroutines repeated. A high percentage of subroutines in a routine can be found/reused elsewhere. We collected 8,764,971 unique subroutines (with 323,564 unique JDK subroutines) as basic units for code searching/synthesis. We also provide practical implications of our findings to automated tools.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"21 1","pages":"362-373"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84252999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Sentiment Analysis in Tickets for IT Support IT支持票务中的情感分析

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901781

Cássio Castaldi Araújo Blaz, Karin Becker

Sentiment analysis has been adopted in software engineeringfor problems such as software usability and sentimentof developers in open-source projects. This paper proposesa method to evaluate the sentiment contained in tickets forIT (Information Technology) support.IT tickets are broadin coverage (e.g. infrastructure, software), and involve errors,incidents, requests, etc. The main challenge is to automaticallydistinguish between factual information, whichis intrinsically negative (e.g. error description), from thesentiment embedded in the description. Our approach isto automatically create a Domain Dictionary that containsterms with sentiment in the IT context, used to filter termsin ticket for sentiment analysis. We experiment and evaluatethree approaches for calculating the polarity of terms intickets. Our study was developed using 34,895 tickets fromfive organizations, from which we randomly selected 2,333tickets to compose a Gold Standard. Our best results displayan average precision and recall of 82.83% and 88.42%, whichoutperforms the compared sentiment analysis solutions.

情感分析在软件工程中被用于解决软件可用性和开源项目中开发人员的情感等问题。本文提出了一种评价信息技术支持票证中情感的方法。IT票据的覆盖范围很广(例如，基础设施、软件)，并且涉及错误、事件、请求等。主要的挑战是自动区分本质上是负面的事实信息(例如错误描述)和嵌入在描述中的情感。我们的方法是自动创建一个领域字典，其中包含IT上下文中具有情感的术语，用于过滤票据中的术语以进行情感分析。我们实验和评估了三种方法来计算门票项的极性。我们的研究使用了来自五个组织的34,895张票，从中我们随机选择了2,333张票来组成一个金标准。我们的最佳结果显示平均精度和召回率分别为82.83%和88.42%，优于对比的情感分析解决方案。

引用次数: 36

Feature Toggles: Practitioner Practices and a Case Study 功能切换:从业者实践和案例研究

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901745

Md Tajmilur Rahman, Louis-Philippe Querel, Peter C. Rigby, Bram Adams

Continuous delivery and rapid releases have led to innovative techniques for integrating new features and bug fixes into a new release faster. To reduce the probability of integration conflicts, major software companies, including Google, Facebook and Netflix, use feature toggles to incrementally integrate and test new features instead of integrating the feature only when it’s ready. Even after release, feature toggles allow operations managers to quickly disable a new feature that is behaving erratically or to enable certain features only for certain groups of customers. Since literature on feature toggles is surprisingly slim, this paper tries to understand the prevalence and impact of feature toggles. First, we conducted a quantitative analysis of feature toggle usage across 39 releases of Google Chrome (spanning five years of release history). Then, we studied the technical debt involved with feature toggles by mining a spreadsheet used by Google developers for feature toggle maintenance. Finally, we performed thematic analysis of videos and blog posts of release engineers at major software companies in order to further understand the strengths and drawbacks of feature toggles in practice. We also validated our findings with four Google developers. We find that toggles can reconcile rapid releases with long-term feature development and allow flexible control over which features to deploy. However they also introduce technical debt and additional maintenance for developers.

持续交付和快速发布带来了创新技术，可以更快地将新特性和错误修复集成到新版本中。为了减少集成冲突的可能性，包括Google、Facebook和Netflix在内的主要软件公司使用功能切换来逐步集成和测试新功能，而不是只在准备好时才集成功能。即使在发布之后，功能切换也允许运营经理快速禁用行为不正常的新功能，或者仅为某些客户组启用某些功能。由于关于功能切换的文献非常少，本文试图了解功能切换的流行程度和影响。首先，我们对39个Google Chrome版本(跨越5年的发布历史)的功能切换使用情况进行了定量分析。然后，我们通过挖掘Google开发人员用于功能切换维护的电子表格，研究了与功能切换相关的技术债务。最后，我们对主要软件公司发布工程师的视频和博客文章进行了专题分析，以便进一步了解功能切换在实践中的优点和缺点。我们还与四位谷歌开发者一起验证了我们的发现。我们发现切换可以协调快速发布和长期特性开发，并允许灵活地控制部署哪些特性。然而，它们也为开发人员带来了技术债务和额外的维护。

{"title":"Feature Toggles: Practitioner Practices and a Case Study","authors":"Md Tajmilur Rahman, Louis-Philippe Querel, Peter C. Rigby, Bram Adams","doi":"10.1145/2901739.2901745","DOIUrl":"https://doi.org/10.1145/2901739.2901745","url":null,"abstract":"Continuous delivery and rapid releases have led to innovative techniques for integrating new features and bug fixes into a new release faster. To reduce the probability of integration conflicts, major software companies, including Google, Facebook and Netflix, use feature toggles to incrementally integrate and test new features instead of integrating the feature only when it’s ready. Even after release, feature toggles allow operations managers to quickly disable a new feature that is behaving erratically or to enable certain features only for certain groups of customers. Since literature on feature toggles is surprisingly slim, this paper tries to understand the prevalence and impact of feature toggles. First, we conducted a quantitative analysis of feature toggle usage across 39 releases of Google Chrome (spanning five years of release history). Then, we studied the technical debt involved with feature toggles by mining a spreadsheet used by Google developers for feature toggle maintenance. Finally, we performed thematic analysis of videos and blog posts of release engineers at major software companies in order to further understand the strengths and drawbacks of feature toggles in practice. We also validated our findings with four Google developers. We find that toggles can reconcile rapid releases with long-term feature development and allow flexible control over which features to deploy. However they also introduce technical debt and additional maintenance for developers.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"74 1","pages":"201-211"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90271724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Using Dynamic and Contextual Features to Predict Issue Lifetime in GitHub Projects 使用动态和上下文特性来预测GitHub项目中的问题生命周期

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901751

R. Kikas, M. Dumas, Dietmar Pfahl

Methods for predicting issue lifetime can help software project managers to prioritize issues and allocate resources accordingly. Previous studies on issue lifetime prediction have focused on models built from static features, meaning features calculated at one snapshot of the issue's lifetime based on data associated to the issue itself. However, during its lifetime, an issue typically receives comments from various stakeholders, which may carry valuable insights into its perceived priority and difficulty and may thus be exploited to update lifetime predictions. Moreover, the lifetime of an issue depends not only on characteristics of the issue itself, but also on the state of the project as a whole. Hence, issue lifetime prediction may benefit from taking into account features capturing the issue's context (contextual features). In this work, we analyze issues from more than 4000 GitHub projects and build models to predict, at different points in an issue's lifetime, whether or not the issue will close within a given calendric period, by combining static, dynamic and contextual features. The results show that dynamic and contextual features complement the predictive power of static ones, particularly for long-term predictions.

预测问题生命周期的方法可以帮助软件项目经理对问题进行优先排序，并相应地分配资源。之前关于问题生命周期预测的研究主要集中在基于静态特征的模型上，即基于与问题本身相关的数据，在问题生命周期的一个快照中计算出的特征。然而，在其生命周期中，一个问题通常会收到来自不同涉众的评论，这些评论可能会对其感知到的优先级和难度产生有价值的见解，因此可能会被用来更新生命周期预测。此外，问题的持续时间不仅取决于问题本身的特征，还取决于整个项目的状态。因此，考虑到捕获问题上下文的特性(上下文特性)，问题生命周期预测可能会受益。在这项工作中，我们分析了来自4000多个GitHub项目的问题，并建立了模型，通过结合静态、动态和上下文特征，在问题生命周期的不同时间点预测问题是否会在给定的日历期内结束。结果表明，动态和上下文特征补充了静态特征的预测能力，特别是对于长期预测。

{"title":"Using Dynamic and Contextual Features to Predict Issue Lifetime in GitHub Projects","authors":"R. Kikas, M. Dumas, Dietmar Pfahl","doi":"10.1145/2901739.2901751","DOIUrl":"https://doi.org/10.1145/2901739.2901751","url":null,"abstract":"Methods for predicting issue lifetime can help software project managers to prioritize issues and allocate resources accordingly. Previous studies on issue lifetime prediction have focused on models built from static features, meaning features calculated at one snapshot of the issue's lifetime based on data associated to the issue itself. However, during its lifetime, an issue typically receives comments from various stakeholders, which may carry valuable insights into its perceived priority and difficulty and may thus be exploited to update lifetime predictions. Moreover, the lifetime of an issue depends not only on characteristics of the issue itself, but also on the state of the project as a whole. Hence, issue lifetime prediction may benefit from taking into account features capturing the issue's context (contextual features). In this work, we analyze issues from more than 4000 GitHub projects and build models to predict, at different points in an issue's lifetime, whether or not the issue will close within a given calendric period, by combining static, dynamic and contextual features. The results show that dynamic and contextual features complement the predictive power of static ones, particularly for long-term predictions.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"118 1","pages":"291-302"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79522980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

Mining Duplicate Questions of Stack Overflow 挖掘堆栈溢出的重复问题

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901770

Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider

Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.

Stack Overflow是一个关注编程问题的热门问答网站。尽管努力防止提出已经得到回答的问题，但该网站仍存在重复问题。这可能会导致开发人员不必要地等待一个问题得到回答，而这个问题已经被提出和回答了。该网站目前依赖于它的版主和有很高声誉的用户手动将这些问题标记为重复，这不仅导致回复延迟，而且需要额外的努力。在本文中，我们首先执行手动调查，以了解为什么用户在Stack Overflow中提交重复问题。基于我们的手工调查，我们提出了一种分类技术，该技术使用许多精心选择的特征来识别重复的问题。使用大量问题的评估表明，我们的技术可以以合理的准确性检测重复问题。我们还将我们的技术与用于检测重复问题的最先进技术DupPredictor进行了比较，我们发现我们提出的技术比该技术具有更好的召回率。

{"title":"Mining Duplicate Questions of Stack Overflow","authors":"Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider","doi":"10.1145/2901739.2901770","DOIUrl":"https://doi.org/10.1145/2901739.2901770","url":null,"abstract":"Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"33 1","pages":"402-412"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79709463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

AndroZoo: Collecting Millions of Android Apps for the Research Community AndroZoo:为研究社区收集数百万Android应用程序

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903508

Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, Yves Le Traon

We present a growing collection of Android Applications col-lected from several sources, including the official GooglePlay app market. Our dataset, AndroZoo, currently contains more than three million apps, each of which has beenanalysed by tens of different AntiVirus products to knowwhich applications are detected as Malware. We provide thisdataset to contribute to ongoing research efforts, as well asto enable new potential research topics on Android Apps.By releasing our dataset to the research community, we alsoaim at encouraging our fellow researchers to engage in reproducible experiments.

我们从几个来源收集了越来越多的Android应用程序，包括官方GooglePlay应用程序市场。我们的数据集AndroZoo目前包含超过300万个应用程序，每个应用程序都经过数十种不同的防病毒产品的分析，以了解哪些应用程序被检测为恶意软件。我们提供这个数据集是为了促进正在进行的研究工作，以及在Android应用程序上启用新的潜在研究主题。通过向研究界发布我们的数据集，我们还旨在鼓励我们的同行研究人员参与可重复的实验。

引用次数: 696

Understanding the Exception Handling Strategies of Java Libraries: An Empirical Study 理解Java库的异常处理策略:一个实证研究

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901757

Demóstenes Sena, Roberta Coelho, U. Kulesza, R. Bonifácio

This paper presents an empirical study whose goal was to investigate the exception handling strategies adopted by Java libraries and their potential impact on the client applications. In this study, exception flow analysis was used in combination with manual inspections in order: (i) to characterize the exception handling strategies of existing Java libraries from the perspective of their users; and (ii) to identify exception handling anti-patterns. We extended an existing static analysis tool to reason about exception flows and handler actions of 656 Java libraries selected from 145 categories in the Maven Central Repository. The study findings suggest a current trend of a high number of undocumented API runtime exceptions (i.e., @throws in Javadoc) and Unintended Handler problem. Moreover, we could also identify a considerable number of occurrences of exception handling anti-patterns (e.g. Catch and Ignore). Finally, we have also analyzed 647 bug issues of the 7 most popular libraries and identified that 20.71% of the reports are defects related to the problems of the exception strategies and anti-patterns identified in our study. The results of this study point to the need of tools to better understand and document the exception handling behavior of libraries.

本文提出了一项实证研究，其目标是调查Java库采用的异常处理策略及其对客户端应用程序的潜在影响。在本研究中，将异常流分析与人工检查结合使用，以便:(i)从用户的角度描述现有Java库的异常处理策略;(ii)识别异常处理反模式。我们扩展了一个现有的静态分析工具，以分析从Maven中央存储库的145个类别中选择的656个Java库的异常流和处理程序操作。研究结果表明，当前的趋势是大量未记录的API运行时异常(即Javadoc中的@throws)和意外处理程序问题。此外，我们还可以识别大量异常处理反模式的出现(例如Catch和Ignore)。最后，我们还分析了7个最流行的库的647个bug问题，并确定20.71%的报告是与我们研究中发现的异常策略和反模式问题相关的缺陷。这项研究的结果指出需要工具来更好地理解和记录库的异常处理行为。

{"title":"Understanding the Exception Handling Strategies of Java Libraries: An Empirical Study","authors":"Demóstenes Sena, Roberta Coelho, U. Kulesza, R. Bonifácio","doi":"10.1145/2901739.2901757","DOIUrl":"https://doi.org/10.1145/2901739.2901757","url":null,"abstract":"This paper presents an empirical study whose goal was to investigate the exception handling strategies adopted by Java libraries and their potential impact on the client applications. In this study, exception flow analysis was used in combination with manual inspections in order: (i) to characterize the exception handling strategies of existing Java libraries from the perspective of their users; and (ii) to identify exception handling anti-patterns. We extended an existing static analysis tool to reason about exception flows and handler actions of 656 Java libraries selected from 145 categories in the Maven Central Repository. The study findings suggest a current trend of a high number of undocumented API runtime exceptions (i.e., @throws in Javadoc) and Unintended Handler problem. Moreover, we could also identify a considerable number of occurrences of exception handling anti-patterns (e.g. Catch and Ignore). Finally, we have also analyzed 647 bug issues of the 7 most popular libraries and identified that 20.71% of the reports are defects related to the problems of the exception strategies and anti-patterns identified in our study. The results of this study point to the need of tools to better understand and document the exception handling behavior of libraries.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"16 1","pages":"212-222"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75372666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Mining the Modern Code Review Repositories: A Dataset of People, Process and Product 挖掘现代代码评审存储库:人、过程和产品的数据集

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903504

Xin Yang, R. Kula, Norihiro Yoshida, Hajimu Iida

In this paper, we present a collection of Modern Code Review data for five open source projects. The data showcases mined data from both an integrated peer review system and source code repositories. We present an easy–to–use andricher data structure to retrieve the 1.) People 2.) Process and 3.) Product aspects of the peer review. This paperpresents the extraction methodology, the dataset structure, and a collection of database dumps.

在本文中，我们展示了五个开源项目的现代代码审查数据集。数据展示了从集成的同行评审系统和源代码存储库中挖掘的数据。我们提出了一个易于使用和更丰富的数据结构来检索1。人2。)过程和3)产品方面的同行评审。本文介绍了抽取方法、数据集结构和数据库转储集合。

引用次数: 51

The Unreasonable Effectiveness of Traditional Information Retrieval in Crash Report Deduplication 崩溃报告重复数据删除中传统信息检索的有效性不合理

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901766

Hazel Victoria Campbell, E. Santos, Abram Hindle

Organizations like Mozilla, Microsoft, and Apple are floodedwith thousands of automated crash reports per day. Although crash reports contain valuable information for debugging, there are often too many for developers to examineindividually. Therefore, in industry, crash reports are oftenautomatically grouped together in buckets. Ubuntu’s repository contains crashes from hundreds of software systemsavailable with Ubuntu. A variety of crash report bucketing methods are evaluated using data collected by Ubuntu’sApport automated crash reporting system. The trade-off between precision and recall of numerous scalable crash deduplication techniques is explored. A set of criteria that acrash deduplication method must meet is presented and several methods that meet these criteria are evaluated on anew dataset. The evaluations presented in this paper showthat using off-the-shelf information retrieval techniques, thatwere not designed to be used with crash reports, outperformother techniques which are specifically designed for the taskof crash bucketing at realistic industrial scales. This researchindicates that automated crash bucketing still has a lot ofroom for improvement, especially in terms of identifier tokenization.

像Mozilla、微软和苹果这样的组织每天都被成千上万的自动崩溃报告淹没。虽然崩溃报告包含有价值的调试信息，但对于开发人员来说，单独检查的信息太多了。因此，在工业中，崩溃报告通常自动分组到桶中。Ubuntu的存储库包含来自数百个Ubuntu软件系统的崩溃。使用Ubuntu的apport自动崩溃报告系统收集的数据来评估各种崩溃报告方法。探讨了许多可扩展的崩溃重复数据删除技术的精度和召回率之间的权衡。提出了一组崩溃重复数据删除方法必须满足的标准，并在新的数据集上对满足这些标准的几种方法进行了评估。本文提出的评估表明，使用现成的信息检索技术，而不是设计用于崩溃报告，优于专门为现实工业规模的崩溃桶任务设计的技术。这项研究表明，自动崩溃桶仍然有很大的改进空间，特别是在标识符标记化方面。

{"title":"The Unreasonable Effectiveness of Traditional Information Retrieval in Crash Report Deduplication","authors":"Hazel Victoria Campbell, E. Santos, Abram Hindle","doi":"10.1145/2901739.2901766","DOIUrl":"https://doi.org/10.1145/2901739.2901766","url":null,"abstract":"Organizations like Mozilla, Microsoft, and Apple are floodedwith thousands of automated crash reports per day. Although crash reports contain valuable information for debugging, there are often too many for developers to examineindividually. Therefore, in industry, crash reports are oftenautomatically grouped together in buckets. Ubuntu’s repository contains crashes from hundreds of software systemsavailable with Ubuntu. A variety of crash report bucketing methods are evaluated using data collected by Ubuntu’sApport automated crash reporting system. The trade-off between precision and recall of numerous scalable crash deduplication techniques is explored. A set of criteria that acrash deduplication method must meet is presented and several methods that meet these criteria are evaluated on anew dataset. The evaluations presented in this paper showthat using off-the-shelf information retrieval techniques, thatwere not designed to be used with crash reports, outperformother techniques which are specifically designed for the taskof crash bucketing at realistic industrial scales. This researchindicates that automated crash bucketing still has a lot ofroom for improvement, especially in terms of identifier tokenization.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"14 1","pages":"269-280"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79954975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Analyzing Developer Sentiment in Commit Logs 分析提交日志中的开发人员情绪

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903501

Vinayak Sinha, A. Lazar, Bonita Sharif

The paper presents an analysis of developer commit logs for GitHub projects. In particular, developer sentiment in commits is analyzed across 28,466 projects within a seven year time frame. We use the Boa infrastructure’s online query system to generate commit logs as well as files that were changed during the commit. We analyze the commits in three categories: large, medium, and small based on the number of commits using a sentiment analysis tool. In addition, we also group the data based on the day of week the commit was made and map the sentiment to the file change history to determine if there was any correlation. Although a majority of the sentiment was neutral, the negative sentiment was about 10% more than the positive sentiment overall. Tuesdays seem to have the most negative sentiment overall. In addition, we do find a strong correlation between the number of files changed and the sentiment expressed by the commits the files were part of. Future work and implications of these results are discussed.

本文对GitHub项目的开发人员提交日志进行了分析。特别是，在七年的时间框架内分析了28,466个项目中提交的开发人员情绪。我们使用Boa基础设施的在线查询系统来生成提交日志以及在提交期间更改的文件。我们使用情感分析工具根据提交的数量将提交分为三类:大、中、小。此外，我们还根据提交的星期对数据进行分组，并将情绪映射到文件更改历史，以确定是否存在任何相关性。虽然大多数人的情绪是中性的，但总的来说，负面情绪比正面情绪多出10%左右。总体而言，周二的负面情绪似乎最为强烈。此外，我们确实发现在被更改的文件数量和文件所属的提交所表达的情绪之间存在很强的相关性。讨论了未来的工作和这些结果的意义。

引用次数: 90

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀