2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)最新文献_第2页

A Panel Data Set of Cryptocurrency Development Activity on GitHub GitHub上加密货币开发活动的面板数据集

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00037

R. V. Tonder, Asher Trockman, Claire Le Goues

Cryptocurrencies are a significant development in recent years, featuring in global news, the financial sector, and academic research. They also hold a significant presence in open source development, comprising some of the most popular repositories on GitHub. Their openly developed software artifacts thus present a unique and exclusive avenue to quantitatively observe human activity, effort, and software growth for cryptocurrencies. Our data set marks the first concentrated effort toward high-fidelity panel data of cryptocurrency development for a wide range of metrics. The data set is foremost a quantitative measure of developer activity for budding open source cryptocurrency development. We collect metrics like daily commits, contributors, lines of code changes, stars, forks, and subscribers. We also include financial data for each cryptocurrency: the daily price and market capitalization. The data set includes data for 236 cryptocurrencies for 380 days (roughly January 2018 to January 2019). We discuss particularly interesting research opportunities for this combination of data, and release new tooling to enable continuing data collection for future research opportunities as development and application of cryptocurrencies mature.

加密货币是近年来的一个重大发展，在全球新闻、金融部门和学术研究中都有突出表现。它们在开源开发中也占有重要地位，包括GitHub上一些最受欢迎的存储库。因此，他们公开开发的软件构件为定量观察加密货币的人类活动、努力和软件增长提供了一个独特而独特的途径。我们的数据集标志着针对广泛指标的加密货币开发的高保真面板数据的首次集中努力。该数据集首先是对新兴开源加密货币开发的开发人员活动的定量衡量。我们收集诸如每日提交、贡献者、代码变更行、星星、分叉和订阅者等指标。我们还包括每种加密货币的财务数据:每日价格和市值。该数据集包括380天(大约2018年1月至2019年1月)的236种加密货币的数据。我们讨论了这种数据组合的特别有趣的研究机会，并发布了新的工具，以便随着加密货币的开发和应用的成熟，为未来的研究机会持续收集数据。

{"title":"A Panel Data Set of Cryptocurrency Development Activity on GitHub","authors":"R. V. Tonder, Asher Trockman, Claire Le Goues","doi":"10.1109/MSR.2019.00037","DOIUrl":"https://doi.org/10.1109/MSR.2019.00037","url":null,"abstract":"Cryptocurrencies are a significant development in recent years, featuring in global news, the financial sector, and academic research. They also hold a significant presence in open source development, comprising some of the most popular repositories on GitHub. Their openly developed software artifacts thus present a unique and exclusive avenue to quantitatively observe human activity, effort, and software growth for cryptocurrencies. Our data set marks the first concentrated effort toward high-fidelity panel data of cryptocurrency development for a wide range of metrics. The data set is foremost a quantitative measure of developer activity for budding open source cryptocurrency development. We collect metrics like daily commits, contributors, lines of code changes, stars, forks, and subscribers. We also include financial data for each cryptocurrency: the daily price and market capitalization. The data set includes data for 236 cryptocurrencies for 380 days (roughly January 2018 to January 2019). We discuss particularly interesting research opportunities for this combination of data, and release new tooling to enable continuing data collection for future research opportunities as development and application of cryptocurrencies mature.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"90 1","pages":"186-190"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85911406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Data-Driven Solutions to Detect API Compatibility Issues in Android: An Empirical Study 在Android中检测API兼容性问题的数据驱动解决方案:一个实证研究

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00055

Simone Scalabrino, G. Bavota, M. Linares-Vásquez, Michele Lanza, R. Oliveto

Android apps are inextricably linked to the official Android APIs. Such a strong form of dependency implies that changes introduced in new versions of the Android APIs can severely impact the apps' code, for example because of deprecated or removed APIs. In reaction to those changes, mobile app developers are expected to adapt their code and avoid compatibility issues. To support developers, approaches have been proposed to automatically identify API compatibility issues in Android apps. The state-of-the-art approach, named CiD, is a data-driven solution learning how to detect those issues by analyzing the changes in the history of Android APIs ("API side" learning). While it can successfully identify compatibility issues, it cannot recommend coding solutions. We devised an alternative data-driven approach, named ACRYL. ACRYL learns from changes implemented in other apps in response to API changes ("client side" learning). This allows not only to detect compatibility issues, but also to suggest a fix. When empirically comparing the two tools, we found that there is no clear winner, since the two approaches are highly complementary, in that they identify almost disjointed sets of API compatibility issues. Our results point to the future possibility of combining the two approaches, trying to learn detection/fixing rules on both the API and the client side.

Android应用程序与官方Android api有着千丝万缕的联系。如此强烈的依赖性意味着，新版本Android api中引入的更改可能会严重影响应用程序的代码，例如由于已弃用或已删除的api。为了应对这些变化，手机应用开发者应该调整自己的代码，避免兼容性问题。为了支持开发人员，已经提出了自动识别Android应用程序中的API兼容性问题的方法。最先进的方法，名为CiD，是一种数据驱动的解决方案，学习如何通过分析Android API的历史变化来检测这些问题(“API端”学习)。虽然它可以成功地识别兼容性问题，但它不能推荐编码解决方案。我们设计了另一种数据驱动的方法，命名为ACRYL。ACRYL从其他应用程序中实现的变化中学习，以响应API的变化(“客户端”学习)。这不仅可以检测兼容性问题，还可以提出修复建议。在对这两种工具进行经验比较时，我们发现没有明显的赢家，因为这两种方法是高度互补的，因为它们确定了几乎脱节的API兼容性问题集。我们的研究结果表明，未来有可能将这两种方法结合起来，尝试学习API和客户端的检测/修复规则。

{"title":"Data-Driven Solutions to Detect API Compatibility Issues in Android: An Empirical Study","authors":"Simone Scalabrino, G. Bavota, M. Linares-Vásquez, Michele Lanza, R. Oliveto","doi":"10.1109/MSR.2019.00055","DOIUrl":"https://doi.org/10.1109/MSR.2019.00055","url":null,"abstract":"Android apps are inextricably linked to the official Android APIs. Such a strong form of dependency implies that changes introduced in new versions of the Android APIs can severely impact the apps' code, for example because of deprecated or removed APIs. In reaction to those changes, mobile app developers are expected to adapt their code and avoid compatibility issues. To support developers, approaches have been proposed to automatically identify API compatibility issues in Android apps. The state-of-the-art approach, named CiD, is a data-driven solution learning how to detect those issues by analyzing the changes in the history of Android APIs (\"API side\" learning). While it can successfully identify compatibility issues, it cannot recommend coding solutions. We devised an alternative data-driven approach, named ACRYL. ACRYL learns from changes implemented in other apps in response to API changes (\"client side\" learning). This allows not only to detect compatibility issues, but also to suggest a fix. When empirically comparing the two tools, we found that there is no clear winner, since the two approaches are highly complementary, in that they identify almost disjointed sets of API compatibility issues. Our results point to the future possibility of combining the two approaches, trying to learn detection/fixing rules on both the API and the client side.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"78 1","pages":"288-298"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83136920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

RmvDroid: Towards A Reliable Android Malware Dataset with App Metadata RmvDroid:迈向一个可靠的Android恶意软件数据集与应用元数据

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00067

Haoyu Wang, Junjun Si, Hao Li, Yao Guo

A large number of research studies have been focused on detecting Android malware in recent years. As a result, a reliable and large-scale malware dataset is essential to build effective malware classifiers and evaluate the performance of different detection techniques. Although several Android malware benchmarks have been widely used in our research community, these benchmarks face several major limitations. First, most of the existing datasets are outdated and cannot reflect current malware evolution trends. Second, most of them only rely on VirusTotal to label the ground truth of malware, while some anti-virus engines on VirusTotal may not always report reliable results. Third, all of them only contain the apps themselves (apks), while other important app information (e.g., app description, user rating, and app installs) is missing, which greatly limits the usage scenarios of these datasets. In this paper, we have created a reliable Android malware dataset based on Google Play's app maintenance results over several years. We first created four snapshots of Google Play in 2014, 2015, 2017 and 2018 respectively. Then we use VirusTotal to label apps with possible sensitive behaviors, and monitor these apps on Google Play to see whether Google has removed them or not. Based on this approach, we have created a malware dataset containing 9,133 samples that belong to 56 malware families with high confidence. We believe this dataset will boost a series of research studies including Android malware detection and classification, mining apps for anomalies, and app store mining, etc.

近年来，大量的研究集中在检测Android恶意软件上。因此，一个可靠和大规模的恶意软件数据集对于构建有效的恶意软件分类器和评估不同检测技术的性能至关重要。尽管在我们的研究社区中已经广泛使用了几个Android恶意软件基准测试，但这些基准测试面临几个主要的限制。首先，大多数现有数据集已经过时，无法反映当前恶意软件的发展趋势。其次，他们中的大多数只依赖VirusTotal来标记恶意软件的基本真相，而VirusTotal上的一些反病毒引擎可能并不总是报告可靠的结果。第三，它们都只包含应用本身(apk)，而其他重要的应用信息(如应用描述、用户评分和应用安装)则缺失，这极大地限制了这些数据集的使用场景。在本文中，我们基于Google Play多年来的应用维护结果创建了一个可靠的Android恶意软件数据集。我们首先分别在2014年、2015年、2017年和2018年创建了4个Google Play快照。然后，我们使用VirusTotal来标记可能存在敏感行为的应用，并在Google Play上监控这些应用，看看谷歌是否已将其删除。基于这种方法，我们创建了一个包含9133个样本的恶意软件数据集，这些样本属于56个恶意软件家族，具有高置信度。我们相信该数据集将推动一系列研究，包括Android恶意软件检测和分类，挖掘异常应用程序和应用商店挖掘等。

{"title":"RmvDroid: Towards A Reliable Android Malware Dataset with App Metadata","authors":"Haoyu Wang, Junjun Si, Hao Li, Yao Guo","doi":"10.1109/MSR.2019.00067","DOIUrl":"https://doi.org/10.1109/MSR.2019.00067","url":null,"abstract":"A large number of research studies have been focused on detecting Android malware in recent years. As a result, a reliable and large-scale malware dataset is essential to build effective malware classifiers and evaluate the performance of different detection techniques. Although several Android malware benchmarks have been widely used in our research community, these benchmarks face several major limitations. First, most of the existing datasets are outdated and cannot reflect current malware evolution trends. Second, most of them only rely on VirusTotal to label the ground truth of malware, while some anti-virus engines on VirusTotal may not always report reliable results. Third, all of them only contain the apps themselves (apks), while other important app information (e.g., app description, user rating, and app installs) is missing, which greatly limits the usage scenarios of these datasets. In this paper, we have created a reliable Android malware dataset based on Google Play's app maintenance results over several years. We first created four snapshots of Google Play in 2014, 2015, 2017 and 2018 respectively. Then we use VirusTotal to label apps with possible sensitive behaviors, and monitor these apps on Google Play to see whether Google has removed them or not. Based on this approach, we have created a malware dataset containing 9,133 samples that belong to 56 malware families with high confidence. We believe this dataset will boost a series of research studies including Android malware detection and classification, mining apps for anomalies, and app store mining, etc.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"404-408"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81181524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Can Issues Reported at Stack Overflow Questions be Reproduced? An Exploratory Study 在堆栈溢出问题中报告的问题可以重现吗?一项探索性研究

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00074

Saikat Mondal, M. M. Rahman, C. Roy

Software developers often look for solutions to their code level problems at Stack Overflow. Hence, they frequently submit their questions with sample code segments and issue descriptions. Unfortunately, it is not always possible to reproduce their reported issues from such code segments. This phenomenon might prevent their questions from getting prompt and appropriate solutions. In this paper, we report an exploratory study on the reproducibility of the issues discussed in 400 questions of Stack Overflow. In particular, we parse, compile, execute and even carefully examine the code segments from these questions, spent a total of 200 man hours, and then attempt to reproduce their programming issues. The outcomes of our study are two-fold. First, we find that 68% of the code segments require minor and major modifications in order to reproduce the issues reported by the developers. On the contrary, 22% code segments completely fail to reproduce the issues. We also carefully investigate why these issues could not be reproduced and then provide evidence-based guidelines for writing effective code examples for Stack Overflow questions. Second, we investigate the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues.

软件开发人员经常在Stack Overflow上寻找代码级问题的解决方案。因此，他们经常提交带有示例代码段和问题描述的问题。不幸的是，从这样的代码段中重现他们报告的问题并不总是可能的。这种现象可能会阻止他们的问题得到及时和适当的解决。在本文中，我们报告了对400个堆栈溢出问题中讨论的问题的再现性的探索性研究。特别是，我们解析、编译、执行甚至仔细检查这些问题中的代码段，总共花费了200个工时，然后试图重现它们的编程问题。我们研究的结果是双重的。首先，我们发现68%的代码段需要大大小小的修改才能重现开发人员报告的问题。相反，22%的代码段完全无法重现问题。我们还仔细调查了为什么这些问题不能重现，然后提供基于证据的指导方针，为堆栈溢出问题编写有效的代码示例。其次，我们研究了问题可再现性状态(问题)与相应的答案元数据(如可接受答案的存在)之间的相关性。根据我们的分析，具有可重复问题的问题比具有不可重复问题的问题获得可接受答案的机会至少高三倍。

{"title":"Can Issues Reported at Stack Overflow Questions be Reproduced? An Exploratory Study","authors":"Saikat Mondal, M. M. Rahman, C. Roy","doi":"10.1109/MSR.2019.00074","DOIUrl":"https://doi.org/10.1109/MSR.2019.00074","url":null,"abstract":"Software developers often look for solutions to their code level problems at Stack Overflow. Hence, they frequently submit their questions with sample code segments and issue descriptions. Unfortunately, it is not always possible to reproduce their reported issues from such code segments. This phenomenon might prevent their questions from getting prompt and appropriate solutions. In this paper, we report an exploratory study on the reproducibility of the issues discussed in 400 questions of Stack Overflow. In particular, we parse, compile, execute and even carefully examine the code segments from these questions, spent a total of 200 man hours, and then attempt to reproduce their programming issues. The outcomes of our study are two-fold. First, we find that 68% of the code segments require minor and major modifications in order to reproduce the issues reported by the developers. On the contrary, 22% code segments completely fail to reproduce the issues. We also carefully investigate why these issues could not be reproduced and then provide evidence-based guidelines for writing effective code examples for Stack Overflow questions. Second, we investigate the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"479-489"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82777640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

What do Developers Know About Machine Learning: A Study of ML Discussions on StackOverflow 开发者对机器学习了解多少:StackOverflow上的ML讨论研究

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00052

A. A. Bangash, Hareem Sahar, S. Chowdhury, A. W. Wong, Abram Hindle, Karim Ali

Machine learning, a branch of Artificial Intelligence, is now popular in software engineering community and is successfully used for problems like bug prediction, and software development effort estimation. Developers' understanding of machine learning, however, is not clear, and we require investigation to understand what educators should focus on, and how different online programming discussion communities can be more helpful. We conduct a study on Stack Overflow (SO) machine learning related posts using the SOTorrent dataset. We found that some machine learning topics are significantly more discussed than others, and others need more attention. We also found that topic generation with Latent Dirichlet Allocation (LDA) can suggest more appropriate tags that can make a machine learning post more visible and thus can help in receiving immediate feedback from sites like SO.

机器学习是人工智能的一个分支，现在在软件工程界很流行，并成功地用于错误预测和软件开发工作量估计等问题。然而，开发人员对机器学习的理解并不清楚，我们需要进行调查，以了解教育者应该关注什么，以及不同的在线编程讨论社区如何更有帮助。我们使用SOTorrent数据集对Stack Overflow (SO)机器学习相关帖子进行了研究。我们发现，一些机器学习主题比其他主题被讨论得更多，而另一些主题则需要更多的关注。我们还发现，使用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)的主题生成可以建议更合适的标签，使机器学习帖子更可见，从而有助于从SO等网站获得即时反馈。

引用次数: 36

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? 开发者在他们的GitHub项目中引用StackOverflow帖子的频率和内容?

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00047

Saraj Singh Manes, Olga Baysal

Stack Overflow (SO) is a popular Q&A forum for software developers, providing a large number of copyable code snippets. While GitHub is an independent code collaboration platform, developers often reuse SO code in their GitHub projects. In this paper, we investigate how often GitHub developers re-use code snippets from the SO forum, as well as what concepts they are more likely to reference in their code. To accomplish our goal, we mine SOTorrent dataset that provides connectivity between code snippets on the SO posts with software projects hosted on GitHub. We then study the characteristics of GitHub projects that reference SO posts and what popular SO discussions can be found in GitHub projects. Our results demonstrate that on average developers make 45 references to SO posts in their projects, with the highest number of references being made within the JavaScript code. We also found that 79% of the SO posts with code snippets that are referenced in GitHub code do change over time (at least ones) raising code maintainability and reliability concerns.

Stack Overflow (SO)是软件开发人员的热门问答论坛，提供了大量可复制的代码片段。虽然GitHub是一个独立的代码协作平台，但开发人员经常在他们的GitHub项目中重用SO代码。在本文中，我们调查了GitHub开发人员重用SO论坛代码片段的频率，以及他们更有可能在代码中引用哪些概念。为了实现我们的目标，我们挖掘了SOTorrent数据集，该数据集提供了SO帖子上的代码片段与托管在GitHub上的软件项目之间的连接。然后，我们研究了引用SO帖子的GitHub项目的特点，以及在GitHub项目中可以找到哪些流行的SO讨论。我们的结果表明，开发人员在他们的项目中平均引用45个SO帖子，其中JavaScript代码中引用的次数最多。我们还发现，在GitHub代码中引用的代码片段中，79%的SO帖子确实会随着时间的推移而发生变化(至少一次)，从而引发了对代码可维护性和可靠性的担忧。

引用次数: 11

Can Duplicate Questions on Stack Overflow Benefit the Software Development Community? 关于堆栈溢出的重复问题能使软件开发社区受益吗?

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00046

Durham Abric, Oliver E. Clark, M. Caminiti, Keheliya Gallaba, Shane McIntosh

Duplicate questions on Stack Overflow are questions that are flagged as being conceptually equivalent to a previously posted question. Stack Overflow suggests that duplicate questions should not be discussed by users, but rather that attention should be redirected to their previously posted counterparts. Roughly 53% of closed Stack Overflow posts are closed due to duplication. Despite their supposed overlapping content, user activity suggests duplicates may generate additional or superior answers. Approximately 9% of duplicates receive more views than their original counterparts despite being closed. In this paper, we analyze duplicate questions from two perspectives. First, we analyze the experience of those who post duplicates using activity and reputation-based heuristics. Second, we compare the content of duplicates both in terms of their questions and answers to determine the degree of similarity between each duplicate pair. Through analysis of the MSR challenge dataset, we find that although duplicate questions are more likely to be created by inexperienced users, they often receive dissimilar answers to their original counterparts. Indeed, supplementary textual analysis using Natural Language Processing (NLP) techniques suggests duplicate questions provide additional information about the underlying concepts being discussed. We recommend that the Stack Overflow's duplication policy be revised to account for the benefits that leaving duplicate questions open may have for the developer community.

Stack Overflow上的重复问题是被标记为概念上等同于先前发布的问题的问题。Stack Overflow建议用户不应该讨论重复的问题，而是应该将注意力转移到他们之前发布的对应问题上。大约53%关闭的Stack Overflow帖子是由于重复而关闭的。尽管他们的内容被认为是重叠的，但用户活动表明，重复的内容可能会产生额外的或更好的答案。大约有9%的复制品在关闭后的浏览次数比原版还要多。在本文中，我们从两个角度来分析重复问题。首先，我们使用活动和基于声誉的启发式分析那些发布重复内容的人的经验。其次，我们根据问题和答案来比较重复的内容，以确定每个重复对之间的相似程度。通过对MSR挑战数据集的分析，我们发现虽然重复问题更有可能是由经验不足的用户创建的，但他们通常会收到与原始问题不同的答案。事实上，使用自然语言处理(NLP)技术的补充文本分析表明，重复的问题提供了关于正在讨论的潜在概念的额外信息。我们建议修改Stack Overflow的重复策略，以考虑到保留重复问题可能给开发人员社区带来的好处。

{"title":"Can Duplicate Questions on Stack Overflow Benefit the Software Development Community?","authors":"Durham Abric, Oliver E. Clark, M. Caminiti, Keheliya Gallaba, Shane McIntosh","doi":"10.1109/MSR.2019.00046","DOIUrl":"https://doi.org/10.1109/MSR.2019.00046","url":null,"abstract":"Duplicate questions on Stack Overflow are questions that are flagged as being conceptually equivalent to a previously posted question. Stack Overflow suggests that duplicate questions should not be discussed by users, but rather that attention should be redirected to their previously posted counterparts. Roughly 53% of closed Stack Overflow posts are closed due to duplication. Despite their supposed overlapping content, user activity suggests duplicates may generate additional or superior answers. Approximately 9% of duplicates receive more views than their original counterparts despite being closed. In this paper, we analyze duplicate questions from two perspectives. First, we analyze the experience of those who post duplicates using activity and reputation-based heuristics. Second, we compare the content of duplicates both in terms of their questions and answers to determine the degree of similarity between each duplicate pair. Through analysis of the MSR challenge dataset, we find that although duplicate questions are more likely to be created by inexperienced users, they often receive dissimilar answers to their original counterparts. Indeed, supplementary textual analysis using Natural Language Processing (NLP) techniques suggests duplicate questions provide additional information about the underlying concepts being discussed. We recommend that the Stack Overflow's duplication policy be revised to account for the benefits that leaving duplicate questions open may have for the developer community.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"230-234"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79115236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

An Empirical History of Permission Requests and Mistakes in Open Source Android Apps 开源Android应用中权限请求和错误的经验历史

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00090

Gian Luca Scoccia, Anthony S Peruma, Virginia Pujols, Ben Christians, Daniel E. Krutz

Android applications (apps) rely upon proper permission usage to ensure that the user's privacy and security are adequately protected. Unfortunately, developers frequently misuse app permissions in a variety of ways ranging from using too many permissions to not correctly adhering to Android's defined permission guidelines. The implications of these permissionissues (possible permission problems) can range from harming the user's perception of the app to significantly impacting their privacy and security. An imperative component to creating more secure apps that better protect a user's privacy is an improved understanding of how and when these issues are being introduced and repaired. While there are existing permissions-analysis tools and Android datasets, there are no available datasets that contain a large-scale empirical history of permission changes and mistakes. This limitation inhibits both developers and researchers from empirically studying and constructing a holistic understanding of permission-related issues. To address this shortfall with existing resources, we created a dataset of permission-based changes and permission-issues in open source Android apps. Our unique dataset contains information from 2,002 apps with commits from 10,601 unique committers, totaling 789,577 commits. We accomplished this by mining app repositories from F-Droid, extracting their version and commit histories, and analyzing this information using two permission analysis tools. Our work creates the foundation for future research in permission decisions and mistakes. Complete project details and data is available on our project website: https://mobilepermissions.github.io

Android应用程序(应用程序)依赖于适当的权限使用，以确保用户的隐私和安全得到充分保护。不幸的是，开发者经常以各种方式滥用应用权限，从使用过多权限到没有正确遵守Android定义的权限指南。这些权限问题(可能的权限问题)的影响范围从损害用户对应用程序的看法到严重影响他们的隐私和安全。创建更安全的应用程序以更好地保护用户隐私的一个重要组成部分是更好地了解这些问题是如何以及何时引入和修复的。虽然有现有的权限分析工具和Android数据集，但没有包含权限更改和错误的大规模经验历史的可用数据集。这一限制限制了开发人员和研究人员对权限相关问题的经验研究和整体理解。为了解决现有资源的不足，我们在开源Android应用中创建了一个基于许可的变化和许可问题的数据集。我们独特的数据集包含来自10,601个提交者的2,002个应用程序的提交信息，总计789,577次提交。我们通过从F-Droid中挖掘应用程序存储库，提取它们的版本和提交历史，并使用两个权限分析工具分析这些信息来实现这一点。我们的工作为未来关于许可决策和错误的研究奠定了基础。完整的项目细节和数据可在我们的项目网站:https://mobilepermissions.github.io

{"title":"An Empirical History of Permission Requests and Mistakes in Open Source Android Apps","authors":"Gian Luca Scoccia, Anthony S Peruma, Virginia Pujols, Ben Christians, Daniel E. Krutz","doi":"10.1109/MSR.2019.00090","DOIUrl":"https://doi.org/10.1109/MSR.2019.00090","url":null,"abstract":"Android applications (apps) rely upon proper permission usage to ensure that the user's privacy and security are adequately protected. Unfortunately, developers frequently misuse app permissions in a variety of ways ranging from using too many permissions to not correctly adhering to Android's defined permission guidelines. The implications of these permissionissues (possible permission problems) can range from harming the user's perception of the app to significantly impacting their privacy and security. An imperative component to creating more secure apps that better protect a user's privacy is an improved understanding of how and when these issues are being introduced and repaired. While there are existing permissions-analysis tools and Android datasets, there are no available datasets that contain a large-scale empirical history of permission changes and mistakes. This limitation inhibits both developers and researchers from empirically studying and constructing a holistic understanding of permission-related issues. To address this shortfall with existing resources, we created a dataset of permission-based changes and permission-issues in open source Android apps. Our unique dataset contains information from 2,002 apps with commits from 10,601 unique committers, totaling 789,577 commits. We accomplished this by mining app repositories from F-Droid, extracting their version and commit histories, and analyzing this information using two permission analysis tools. Our work creates the foundation for future research in permission decisions and mistakes. Complete project details and data is available on our project website: https://mobilepermissions.github.io","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"363 1","pages":"597-601"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75413307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

SeSaMe: A Data Set of Semantically Similar Java Methods SeSaMe:语义相似的Java方法的数据集

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00079

Marius Kamp, Patrick Kreutzer, M. Philippsen

In the past, techniques for detecting similarly behaving code fragments were often only evaluated with small, artificial oracles or with code originating from programming competitions. Such code fragments differ largely from production codes. To enable more realistic evaluations, this paper presents SeSaMe, a data set of method pairs that are classified according to their semantic similarity. We applied text similarity measures on JavaDoc comments mined from 11 open source repositories and manually classified a selection of 857 pairs.

在过去，检测行为相似的代码片段的技术通常只使用小型的人工预言机或来自编程竞赛的代码进行评估。这样的代码片段与产品代码有很大的不同。为了实现更真实的评估，本文提出了SeSaMe，这是一个根据语义相似度进行分类的方法对数据集。我们对从11个开源存储库中挖掘的JavaDoc注释应用了文本相似性度量，并手动分类了857对。

引用次数: 5

A Data Set of Program Invariants and Error Paths 程序不变量和错误路径的数据集

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00026

Dirk Beyer

The analysis of correctness proofs and counterexamples of program source code is an important way to gain insights into methods that could make it easier in the future to find invariants to prove a program correct or to find bugs. The availability of high-quality data is often a limiting factor for researchers who want to study real program invariants and real bugs. The described data set provides a large collection of concrete verification results, which can be used in research projects as data source or for evaluation purposes. Each result is made available as verification witness, which represents either program invariants that were used to prove the program correct (correctness witness) or an error path to replay the actual bug (violation witness). The verification results are taken from actual verification runs on 10522 verification problems, using the 31 verification tools that participated in the 8th edition of the International Competition on Software Verification (SV-COMP). The collection contains a total of 125720 verification witnesses together with various meta data and a map to relate a witness to the C program that it originates from. Data set is available at: https://doi.org/10.5281/zenodo.2559175

对程序源代码的正确性证明和反例的分析是深入了解方法的重要途径，这些方法可以使将来更容易地找到不变量来证明程序的正确性或查找错误。对于想要研究真正的程序不变量和真正的bug的研究人员来说，高质量数据的可用性通常是一个限制因素。所描述的数据集提供了大量具体验证结果的集合，可作为研究项目的数据源或用于评估目的。每个结果都可用作验证见证，它表示用于证明程序正确的程序不变量(正确性见证)或重播实际错误的错误路径(违例见证)。验证结果来自于对10522个验证问题的实际验证运行，使用了参加第8届国际软件验证竞赛(SV-COMP)的31个验证工具。该集合共包含125720个验证见证，以及各种元数据和将见证与它所源自的C程序关联起来的映射。数据集可在:https://doi.org/10.5281/zenodo.2559175

{"title":"A Data Set of Program Invariants and Error Paths","authors":"Dirk Beyer","doi":"10.1109/MSR.2019.00026","DOIUrl":"https://doi.org/10.1109/MSR.2019.00026","url":null,"abstract":"The analysis of correctness proofs and counterexamples of program source code is an important way to gain insights into methods that could make it easier in the future to find invariants to prove a program correct or to find bugs. The availability of high-quality data is often a limiting factor for researchers who want to study real program invariants and real bugs. The described data set provides a large collection of concrete verification results, which can be used in research projects as data source or for evaluation purposes. Each result is made available as verification witness, which represents either program invariants that were used to prove the program correct (correctness witness) or an error path to replay the actual bug (violation witness). The verification results are taken from actual verification runs on 10522 verification problems, using the 31 verification tools that participated in the 8th edition of the International Competition on Software Verification (SV-COMP). The collection contains a total of 125720 verification witnesses together with various meta data and a map to relate a witness to the C program that it originates from. Data set is available at: https://doi.org/10.5281/zenodo.2559175","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"12 1","pages":"111-115"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84982022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1