2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)最新文献

英文中文

A Deeper Look into Bug Fixes: Patterns, Replacements, Deletions, and Additions 更深入地了解错误修复:模式、替换、删除和添加

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903495

Mauricio Soto, Ferdian Thung, Chu-Pan Wong, Claire Le Goues, D. Lo

Many implementations of research techniques that automatically repair software bugs target programs written in C. Work that targets Java often begins from or compares to direct translations of such techniques to a Java context. However, Java and C are very different languages, and Java should be studied to inform the construction of repair approaches to target it. We conduct a large-scale study of bug-fixing commits in Java projects, focusing on assumptions underlying common search-based repair approaches. We make observations that can be leveraged to guide high quality automatic software repair to target Java specifically, including common and uncommon statement modifications in human patches and the applicability of previously-proposed patch construction operators in the Java context.

许多自动修复软件错误的研究技术的实现都是针对用c编写的程序的。针对Java的工作通常是从将此类技术直接翻译到Java上下文中开始的。但是，Java和C是非常不同的语言，应该研究Java，以便为构建针对它的修复方法提供信息。我们对Java项目中的bug修复提交进行了大规模的研究，重点关注基于通用搜索的修复方法的假设。我们做了一些观察，可以用来指导高质量的自动软件修复，特别是针对Java，包括人类补丁中的常见和不常见的语句修改，以及以前提出的补丁构建操作符在Java上下文中的适用性。

引用次数: 56

A Dataset of Simplified Syntax Trees for C# c#简化语法树的数据集

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903507

Sebastian Proksch, Sven Amann, Sarah Nadi, M. Mezini

In this paper, we present a curated collection of 2833 C# solutions taken from Github. We encode the data in a new intermediate representation (IR) that facilitates further analysis by restricting the complexity of the syntax tree and by avoiding implicit information. The dataset is intended as a standardized input for research on recommendation systems for software engineering, but is also useful in many other areas that analyze source code.

在本文中，我们从Github上精选了2833个c#解决方案。我们用一种新的中间表示(IR)对数据进行编码，这种中间表示通过限制语法树的复杂性和避免隐式信息来促进进一步的分析。该数据集旨在作为软件工程推荐系统研究的标准化输入，但在分析源代码的许多其他领域也很有用。

引用次数: 14

Does Your Configuration Code Smell? 你的配置代码有异味吗?

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901761

Tushar Sharma, Marios Fragkoulis, D. Spinellis

Infrastructure as Code (IaC) is the practice of specifying computing system configurations through code, and managing them through traditional software engineering methods. The wide adoption of configuration management and increasing size and complexity of the associated code, prompt for assessing, maintaining, and improving the configuration code's quality. In this context, traditional software engineering knowledge and best practices associated with code quality management can be leveraged to assess and manage configuration code quality. We propose a catalog of 13 implementation and 11 design configuration smells, where each smell violates recommended best practices for configuration code. We analyzed 4,621 Puppet repositories containing 8.9 million lines of code and detected the cataloged implementation and design configuration smells. Our analysis reveals that the design configuration smells show 9% higher average co-occurrence among themselves than the implementation configuration smells. We also observed that configuration smells belonging to a smell category tend to co-occur with configuration smells belonging to another smell category when correlation is computed by volume of identified smells. Finally, design configuration smell density shows negative correlation whereas implementation configuration smell density exhibits no correlation with the size of a configuration management system.

基础设施即代码(IaC)是通过代码指定计算系统配置，并通过传统软件工程方法管理它们的实践。配置管理的广泛采用以及相关代码的规模和复杂性的增加，促使了对配置代码质量的评估、维护和改进。在这种情况下，可以利用与代码质量管理相关的传统软件工程知识和最佳实践来评估和管理配置代码质量。我们提出了一个包含13种实现和11种设计配置气味的目录，其中每种气味都违反了配置代码的推荐最佳实践。我们分析了包含890万行代码的4,621个Puppet存储库，并检测了编录的实现和设计配置气味。我们的分析表明，与实现配置气味相比，设计配置气味之间的平均共现率高出9%。我们还观察到，当通过识别气味的体积计算相关性时，属于气味类别的配置气味倾向于与属于另一种气味类别的配置气味共同发生。最后，设计配置气味密度与配置管理系统的大小呈负相关，而实现配置气味密度与配置管理系统的大小没有相关性。

{"title":"Does Your Configuration Code Smell?","authors":"Tushar Sharma, Marios Fragkoulis, D. Spinellis","doi":"10.1145/2901739.2901761","DOIUrl":"https://doi.org/10.1145/2901739.2901761","url":null,"abstract":"Infrastructure as Code (IaC) is the practice of specifying computing system configurations through code, and managing them through traditional software engineering methods. The wide adoption of configuration management and increasing size and complexity of the associated code, prompt for assessing, maintaining, and improving the configuration code's quality. In this context, traditional software engineering knowledge and best practices associated with code quality management can be leveraged to assess and manage configuration code quality. We propose a catalog of 13 implementation and 11 design configuration smells, where each smell violates recommended best practices for configuration code. We analyzed 4,621 Puppet repositories containing 8.9 million lines of code and detected the cataloged implementation and design configuration smells. Our analysis reveals that the design configuration smells show 9% higher average co-occurrence among themselves than the implementation configuration smells. We also observed that configuration smells belonging to a smell category tend to co-occur with configuration smells belonging to another smell category when correlation is computed by volume of identified smells. Finally, design configuration smell density shows negative correlation whereas implementation configuration smell density exhibits no correlation with the size of a configuration management system.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"2006 1","pages":"189-200"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82436582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Topic Modeling of NASA Space System Problem Reports: Research in Practice NASA空间系统问题报告的主题建模:实践研究

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901760

L. Layman, A. Nikora, Joshua Meek, T. Menzies

Problem reports at NASA are similar to bug reports: they capture defects found during test, post-launch operational anomalies, and document the investigation and corrective action of the issue. These artifacts are a rich source of lessons learned for NASA, but are expensive to analyze since problem reports are comprised primarily of natural language text. We apply {topic modeling to a corpus of NASA problem reports to extract trends in testing and operational failures. We collected 16,669 problem reports from six NASA space flight missions and applied Latent Dirichlet Allocation topic modeling to the document corpus. We analyze the most popular topics within and across missions, and how popular topics changed over the lifetime of a mission. We find that hardware material and flight software issues are common during the integration and testing phase, while ground station software and equipment issues are more common during the operations phase. We identify a number of challenges in topic modeling for trend analysis: 1) that the process of selecting the topic modeling parameters lacks definitive guidance, 2) defining semantically-meaningful topic labels requires non-trivial effort and domain expertise, 3) topic models derived from the combined corpus of the six missions were biased toward the larger missions, and 4) topics must be semantically distinct as well as cohesive to be useful. Nonetheless, topic modeling can identify problem themes within missions and across mission lifetimes, providing useful feedback to engineers and project managers.

NASA的问题报告类似于bug报告:它们捕获在测试期间发现的缺陷，发射后的操作异常，并记录问题的调查和纠正措施。这些工件为NASA提供了丰富的经验，但是由于问题报告主要由自然语言文本组成，因此分析成本很高。我们将{主题建模应用于NASA问题报告的语料库，以提取测试和操作失败的趋势。我们收集了来自6个NASA太空飞行任务的16,669个问题报告，并将Latent Dirichlet Allocation主题建模应用于文档语料库。我们分析任务内部和任务之间最受欢迎的话题，以及在任务生命周期中流行话题的变化情况。我们发现硬件材料和飞行软件问题在集成和测试阶段很常见，而地面站软件和设备问题在操作阶段更常见。我们在趋势分析的主题建模中发现了一些挑战:1)选择主题建模参数的过程缺乏明确的指导;2)定义语义上有意义的主题标签需要付出巨大的努力和领域专业知识;3)从六个任务的组合语料库中衍生的主题模型偏向于更大的任务;4)主题必须在语义上不同，并且要有凝聚力才能有用。尽管如此，主题建模可以识别任务内和任务生命周期内的问题主题，为工程师和项目经理提供有用的反馈。

{"title":"Topic Modeling of NASA Space System Problem Reports: Research in Practice","authors":"L. Layman, A. Nikora, Joshua Meek, T. Menzies","doi":"10.1145/2901739.2901760","DOIUrl":"https://doi.org/10.1145/2901739.2901760","url":null,"abstract":"Problem reports at NASA are similar to bug reports: they capture defects found during test, post-launch operational anomalies, and document the investigation and corrective action of the issue. These artifacts are a rich source of lessons learned for NASA, but are expensive to analyze since problem reports are comprised primarily of natural language text. We apply {topic modeling to a corpus of NASA problem reports to extract trends in testing and operational failures. We collected 16,669 problem reports from six NASA space flight missions and applied Latent Dirichlet Allocation topic modeling to the document corpus. We analyze the most popular topics within and across missions, and how popular topics changed over the lifetime of a mission. We find that hardware material and flight software issues are common during the integration and testing phase, while ground station software and equipment issues are more common during the operations phase. We identify a number of challenges in topic modeling for trend analysis: 1) that the process of selecting the topic modeling parameters lacks definitive guidance, 2) defining semantically-meaningful topic labels requires non-trivial effort and domain expertise, 3) topic models derived from the combined corpus of the six missions were biased toward the larger missions, and 4) topics must be semantically distinct as well as cohesive to be useful. Nonetheless, topic modeling can identify problem themes within missions and across mission lifetimes, providing useful feedback to engineers and project managers.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"32 1","pages":"303-314"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84161841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

How the R Community Creates and Curates Knowledge: A Comparative Study of Stack Overflow and Mailing Lists R社区如何创造和管理知识:堆栈溢出和邮件列表的比较研究

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901772

A. Zagalsky, Carlos Gómez Teshima, D. Germán, M. Storey, Germán Poo-Caamaño

One of the many effects of social media in software development is the flourishing of very large communities of practice where members share a common interest, such as programming languages, frameworks, and tools. These communities of practice use many different communication channels but little is known about how these communities create, share, and curate knowledge using such channels. In this paper, we report a qualitative study of how one community of practice—the R software development community—creates and curates knowledge associated with questions and answers (Q&A) in two of its main communication channels: the R-tag in Stack Overflow and the R-users mailing list. The results reveal that knowledge is created and curated in two main forms: participatory, where multiple members explicitly collaborate to build knowledge, and crowdsourced, where individuals work independently of each other. The contribution of this paper is a characterization of knowledge types that are exchanged by these communities of practice, including a description of the reasons why members choose one channel over the other. Finally, this paper enumerates a set of recommendations to assist practitioners in the use of multiple channels for Q&A.

社交媒体在软件开发中的诸多影响之一，是促成了非常大的实践社区的繁荣，在这些社区中，成员拥有共同的兴趣，比如编程语言、框架和工具。这些实践社区使用许多不同的沟通渠道，但是对于这些社区如何使用这些渠道创建、共享和管理知识知之甚少。在本文中，我们报告了一项关于一个实践社区——R软件开发社区——如何在其两个主要通信渠道(Stack Overflow中的R标签和R用户邮件列表)中创建和管理与问答(Q&A)相关的知识的定性研究。结果表明，知识的创造和管理主要有两种形式:参与式，即多个成员明确合作建立知识;众包式，即个人相互独立工作。本文的贡献是描述了这些实践社区交换的知识类型，包括描述了成员选择一种渠道而不是另一种渠道的原因。最后，本文列举了一组建议，以帮助从业者使用多种渠道进行问答。

{"title":"How the R Community Creates and Curates Knowledge: A Comparative Study of Stack Overflow and Mailing Lists","authors":"A. Zagalsky, Carlos Gómez Teshima, D. Germán, M. Storey, Germán Poo-Caamaño","doi":"10.1145/2901739.2901772","DOIUrl":"https://doi.org/10.1145/2901739.2901772","url":null,"abstract":"One of the many effects of social media in software development is the flourishing of very large communities of practice where members share a common interest, such as programming languages, frameworks, and tools. These communities of practice use many different communication channels but little is known about how these communities create, share, and curate knowledge using such channels. In this paper, we report a qualitative study of how one community of practice—the R software development community—creates and curates knowledge associated with questions and answers (Q&A) in two of its main communication channels: the R-tag in Stack Overflow and the R-users mailing list. The results reveal that knowledge is created and curated in two main forms: participatory, where multiple members explicitly collaborate to build knowledge, and crowdsourced, where individuals work independently of each other. The contribution of this paper is a characterization of knowledge types that are exchanged by these communities of practice, including a description of the reasons why members choose one channel over the other. Finally, this paper enumerates a set of recommendations to assist practitioners in the use of multiple channels for Q&A.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"8 1","pages":"441-451"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83706004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Judging a Commit by Its Cover: Correlating Commit Message Entropy with Build Status on Travis-CI 从外表判断提交:将提交信息熵与Travis-CI上的构建状态相关联

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903493

E. Santos, Abram Hindle

Developers summarize their changes to code in commit messages.When a message seems “unusual’', however, this puts doubt into the quality of the code contained in the commit. We trained n-gram language models and used cross-entropy as an indicator of commit message “unusualness” of over 120,000 commits from open source projects.Build statuses collected from Travis-CI were used as a proxy for code quality. We then compared the distributions of failed and successful commits with regards to the “unusualness'’ of their commit message. Our analysis yielded significant results when correlating cross-entropy with build status.

开发人员在提交消息中总结他们对代码的更改。然而，当一条消息看起来“不寻常”时，就会对提交中包含的代码的质量产生怀疑。我们训练了n-gram语言模型，并使用交叉熵作为来自开源项目的超过120,000个提交的提交消息“不寻常”的指示器。从Travis-CI收集的构建状态被用作代码质量的代理。然后，我们比较了失败和成功提交的分布，比较了提交消息的“不寻常性”。当交叉熵与构建状态相关联时，我们的分析产生了显著的结果。

引用次数: 32

Analysis of Exception Handling Patterns in Java Projects: An Empirical Study Java项目中异常处理模式的分析:一个实证研究

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903499

Suman Nakshatri, Maithri Hegde, Sahithi Thandra

Exception handling is a powerful tool provided by many pro- gramming languages to help developers deal with unforeseen conditions. Java is one of the few programming languages to enforce an additional compilation check on certain sub- classes of the Exception class through checked exceptions. As part of this study, empirical data was extracted from soft- ware projects developed in Java. The intent is to explore how developers respond to checked exceptions and identify common patterns used by them to deal with exceptions, checked or otherwise. Bloch’s book - “Effective Java” [1] was used as reference for best practices in exception handling - these recommendations were compared against results from the empirical data. Results of this study indicate that most programmers ignore checked exceptions and leave them un- noticed. Additionally, it is observed that classes higher in the exception class hierarchy are more frequently used as compared to specific exception subclasses.

异常处理是许多编程语言提供的一个强大的工具，可以帮助开发人员处理不可预见的情况。Java是少数几种通过检查异常对Exception类的某些子类强制执行额外编译检查的编程语言之一。作为本研究的一部分，从Java开发的软件项目中提取了经验数据。其目的是探索开发人员如何响应已检查异常，并确定他们用于处理已检查或未检查异常的通用模式。Bloch的书——“Effective Java”[1]被用作异常处理最佳实践的参考——这些建议与经验数据的结果进行了比较。这项研究的结果表明，大多数程序员忽略检查异常，使他们不被注意。此外，可以观察到，与特定的异常子类相比，在异常类层次结构中较高的类使用频率更高。

引用次数: 33

MUBench: A Benchmark for API-Misuse Detectors MUBench: api误用检测器的基准测试

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903506

Sven Amann, Sarah Nadi, H. Nguyen, T. Nguyen, M. Mezini

Over the last few years, researchers proposed a multitude of automated bug-detection approaches that mine a class of bugs that we call API misuses. Evaluations on a variety of software products show both the omnipresence of such misuses and the ability of the approaches to detect them. This work presents MuBench, a dataset of 89 API misuses that we collected from 33 real-world projects and a survey. With the dataset we empirically analyze the prevalence of API misuses compared to other types of bugs, finding that they are rare, but almost always cause crashes. Furthermore, we discuss how to use it to benchmark and compare API-misuse detectors.

在过去的几年里，研究人员提出了许多自动化的bug检测方法，这些方法可以挖掘一类我们称之为API滥用的bug。对各种软件产品的评估显示了这种滥用的无所不在和检测它们的方法的能力。这项工作展示了MuBench，这是我们从33个实际项目和一项调查中收集的89个API误用数据集。有了这个数据集，我们实证分析了与其他类型的bug相比，API滥用的流行程度，发现它们很少见，但几乎总是会导致崩溃。此外，我们还讨论了如何使用它来对api滥用检测器进行基准测试和比较。

引用次数: 78

Adressing Problems with External Validity of Repository Mining Studies Through a Smart Data Platform 通过智能数据平台解决存储库挖掘研究的外部有效性问题

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2901753

Fabian Trautsch, S. Herbold, Philip Makedonski, J. Grabowski

Research in software repository mining has grown considerably the last decade. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the external validity of results. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created the prototype SmartSHARK that implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present SmartSHARK and discuss our experiences regarding the use of SmartSHARK and the mentioned problems.

软件存储库挖掘的研究在过去十年中有了长足的发展。由于该调查地点的数据驱动性质，我们确定了当前最先进技术中的几个问题，这些问题对结果的外部有效性构成威胁。在许多研究中，数据集的大量重复使用可能会在数据本身存在问题的情况下使结果无效。此外，许多研究的数据和/或实施是不可获得的，这阻碍了结果的复制，从而降低了研究之间的可比性。即使关于研究的所有信息都是可用的，所用工具的多样性也会使它们的复制变得非常困难。在本文中，我们讨论了通过集成数据收集和分析的基于云的平台来解决这些问题的潜在解决方案。我们创建了原型SmartSHARK来实现我们的方法。使用SmartSHARK，我们从几个项目中收集数据并创建不同的分析示例。在本文中，我们介绍了SmartSHARK，并讨论了我们使用SmartSHARK的经验和所提到的问题。

{"title":"Adressing Problems with External Validity of Repository Mining Studies Through a Smart Data Platform","authors":"Fabian Trautsch, S. Herbold, Philip Makedonski, J. Grabowski","doi":"10.1145/2901739.2901753","DOIUrl":"https://doi.org/10.1145/2901739.2901753","url":null,"abstract":"Research in software repository mining has grown considerably the last decade. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the external validity of results. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created the prototype SmartSHARK that implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present SmartSHARK and discuss our experiences regarding the use of SmartSHARK and the mentioned problems.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"36 1","pages":"97-108"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75291273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Multi-extract and Multi-level Dataset of Mozilla Issue Tracking History Mozilla问题跟踪历史的多提取和多级数据集

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

Pub Date : 2016-05-14 DOI: 10.1145/2901739.2903502

Jiaxin Zhu, Minghui Zhou, Hong Mei

Many studies analyze issue tracking repositories to understand and support software development. To facilitate the analyses, we share a Mozilla issue tracking dataset covering a 15-year history. The dataset includes three extracts and multiple levels for each extract. The three extracts were retrieved through two channels, a front-end (web user interface (UI)), and a back-end (official database dump) of Mozilla Bugzilla at three different times. The variations (dynamics) among extracts provide space for researchers to reproduce and validate their studies, while revealing potential opportunities for studies that otherwise could not be conducted. We provide different data levels for each extract ranging from raw data to standardized data as well as to the calculated data level for targeting specific research questions. Data retrieving and processing scripts related to each data level are offered too. By employing the multi-level structure, analysts can more efficiently start an inquiry from the standardized level and easily trace the data chain when necessary (e.g., to verify if a phenomenon reflected by the data is an actual event). We applied this dataset to several published studies and intend to expand the multi-level and multi-extract feature to other software engineering datasets.

许多研究分析问题跟踪存储库来理解和支持软件开发。为了便于分析，我们分享了一个涵盖15年历史的Mozilla问题跟踪数据集。该数据集包括三个提取，每个提取都有多个级别。这三个摘要在三个不同的时间通过两个通道检索，即Mozilla Bugzilla的前端(web用户界面(UI))和后端(官方数据库转储)。萃取物之间的变化(动态)为研究人员提供了再现和验证其研究的空间，同时揭示了无法进行的研究的潜在机会。我们为每个提取提供不同的数据级别，从原始数据到标准化数据以及针对特定研究问题的计算数据级别。并提供了与各个数据层相关的数据检索和处理脚本。通过采用多层次结构，分析人员可以更有效地从标准化层面开始查询，并在必要时轻松跟踪数据链(例如，验证数据反映的现象是否为实际事件)。我们将该数据集应用于几项已发表的研究，并打算将多层次和多提取特征扩展到其他软件工程数据集。

{"title":"Multi-extract and Multi-level Dataset of Mozilla Issue Tracking History","authors":"Jiaxin Zhu, Minghui Zhou, Hong Mei","doi":"10.1145/2901739.2903502","DOIUrl":"https://doi.org/10.1145/2901739.2903502","url":null,"abstract":"Many studies analyze issue tracking repositories to understand and support software development. To facilitate the analyses, we share a Mozilla issue tracking dataset covering a 15-year history. The dataset includes three extracts and multiple levels for each extract. The three extracts were retrieved through two channels, a front-end (web user interface (UI)), and a back-end (official database dump) of Mozilla Bugzilla at three different times. The variations (dynamics) among extracts provide space for researchers to reproduce and validate their studies, while revealing potential opportunities for studies that otherwise could not be conducted. We provide different data levels for each extract ranging from raw data to standardized data as well as to the calculated data level for targeting specific research questions. Data retrieving and processing scripts related to each data level are offered too. By employing the multi-level structure, analysts can more efficiently start an inquiry from the standardized level and easily trace the data chain when necessary (e.g., to verify if a phenomenon reflected by the data is an actual event). We applied this dataset to several published studies and intend to expand the multi-level and multi-extract feature to other software engineering datasets.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"34 1","pages":"472-475"},"PeriodicalIF":0.0,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72759891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀