2009 6th IEEE International Working Conference on Mining Software Repositories最新文献

英文中文

MapReduce as a general framework to support research in Mining Software Repositories (MSR) MapReduce作为支持挖掘软件存储库(MSR)研究的通用框架

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069477

Weiyi Shang, Z. Jiang, Bram Adams, A. Hassan

Researchers continue to demonstrate the benefits of Mining Software Repositories (MSR) for supporting software development and research activities. However, as the mining process is time and resource intensive, they often create their own distributed platforms and use various optimizations to speed up and scale up their analysis. These platforms are project-specific, hard to reuse, and offer minimal debugging and deployment support. In this paper, we propose the use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on the source control repositories of the Eclipse, BIRT and Datatools projects, we demonstrate that the migration effort to MapReduce is minimal and that the benefits are significant, as running time of the migrated J-REX is only 30% to 50% of the original J-REX's. This paper documents our experience with the migration, and highlights the benefits and challenges of the MapReduce framework in the MSR community.

研究人员继续展示挖掘软件存储库(MSR)支持软件开发和研究活动的好处。然而，由于挖掘过程是时间和资源密集型的，他们经常创建自己的分布式平台，并使用各种优化来加速和扩展他们的分析。这些平台是特定于项目的，难以重用，并且只提供很少的调试和部署支持。在本文中，我们提出使用分布式计算平台MapReduce来支持MSR的研究。作为概念验证，我们将J-REX(一个优化的进化代码提取器)迁移到Hadoop (MapReduce的开源实现)上运行。通过对Eclipse、BIRT和Datatools项目的源代码控制存储库的案例研究，我们证明了迁移到MapReduce的工作量是最小的，而且好处是显著的，因为迁移后的J-REX的运行时间仅为原始J-REX的30%到50%。本文记录了我们在迁移方面的经验，并强调了MapReduce框架在MSR社区中的好处和挑战。

{"title":"MapReduce as a general framework to support research in Mining Software Repositories (MSR)","authors":"Weiyi Shang, Z. Jiang, Bram Adams, A. Hassan","doi":"10.1109/MSR.2009.5069477","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069477","url":null,"abstract":"Researchers continue to demonstrate the benefits of Mining Software Repositories (MSR) for supporting software development and research activities. However, as the mining process is time and resource intensive, they often create their own distributed platforms and use various optimizations to speed up and scale up their analysis. These platforms are project-specific, hard to reuse, and offer minimal debugging and deployment support. In this paper, we propose the use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on the source control repositories of the Eclipse, BIRT and Datatools projects, we demonstrate that the migration effort to MapReduce is minimal and that the benefits are significant, as running time of the migrated J-REX is only 30% to 50% of the original J-REX's. This paper documents our experience with the migration, and highlights the benefits and challenges of the MapReduce framework in the MSR community.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126929761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Assigning bug reports using a vocabulary-based expertise model of developers 使用基于词汇表的开发人员专业知识模型分配bug报告

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069491

Do Matter, Adrian Kuhn, Oscar Nierstrasz

For popular software systems, the number of daily submitted bug reports is high. Triaging these incoming reports is a time consuming task. Part of the bug triage is the assignment of a report to a developer with the appropriate expertise. In this paper, we present an approach to automatically suggest developers who have the appropriate expertise for handling a bug report. We model developer expertise using the vocabulary found in their source code contributions and compare this vocabulary to the vocabulary of bug reports. We evaluate our approach by comparing the suggested experts to the persons who eventually worked on the bug. Using eight years of Eclipse development as a case study, we achieve 33.6% top-1 precision and 71.0% top-10 recall.

对于流行的软件系统，每天提交的bug报告数量很高。对这些传入报告进行分类是一项耗时的任务。bug分类的一部分是将报告分配给具有适当专业知识的开发人员。在本文中，我们提出了一种方法来自动推荐具有处理bug报告的适当专业知识的开发人员。我们使用开发人员贡献的源代码中的词汇表对开发人员的专业知识进行建模，并将该词汇表与bug报告中的词汇表进行比较。我们通过比较建议的专家和最终处理bug的人员来评估我们的方法。使用8年的Eclipse开发作为案例研究，我们实现了33.6%的前1名精度和71.0%的前10名召回率。

引用次数: 229

Mining search topics from a code search engine usage log 从代码搜索引擎使用日志中挖掘搜索主题

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069489

S. Bajracharya, C. Lopes

We present a topic modeling analysis of a year long usage log of Koders, one of the major commercial code search engines. This analysis contributes to the understanding of what users of code search engines are looking for. Observations on the prevalence of these topics among the users, and on how search and download activities vary across topics, leads to the conclusion that users who find code search engines usable are those who already know to a high level of specificity what to look for. This paper presents a general categorization of these topics that provides insights on the different ways code search engine users express their queries. The findings support the conclusion that existing code search engines provide only a subset of the various information needs of the users when compared to the categories of queries they look at.

本文对主要商业代码搜索引擎之一Koders长达一年的使用日志进行了主题建模分析。这种分析有助于理解代码搜索引擎的用户在寻找什么。通过观察这些主题在用户中的流行程度，以及搜索和下载活动在不同主题之间的差异，可以得出这样的结论:发现代码搜索引擎可用的用户是那些已经高度明确地知道要查找什么的用户。本文提出了这些主题的一般分类，提供了对代码搜索引擎用户表达查询的不同方式的见解。这些发现支持了这样一个结论，即与用户查看的查询类别相比，现有的代码搜索引擎只提供了用户各种信息需求的一个子集。

引用次数: 54

On what basis to recommend: Changesets or interactions? 在什么基础上推荐:变更集还是交互?

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069494

Sarah Rastkar, G. Murphy

Different flavours of recommendation systems have been proposed to help software developers perform software evolution tasks. A number of these recommendation systems are based on changesets. When changeset information is used, recommendations are based on only the end result of the activity undertaken to complete a task. In this paper, we report on an investigation that compared how recommendations based on changesets compare to recommendations based on interactions collected as a programmer performed the task that resulted in a changeset. To provide a common basis for the comparison, our investigation considered how bug reports considered similar based on changeset information compare to bug reports considered similar based on interaction information. We found that there is no direct relationship between the bug reports found similar with the different methods, suggesting that each comparison methods captures a different aspect of the problem.

人们提出了不同风格的推荐系统来帮助软件开发人员执行软件进化任务。许多这样的推荐系统都是基于变更集的。当使用变更集信息时，建议仅基于为完成任务而进行的活动的最终结果。在本文中，我们报告了一项调查，该调查比较了基于变更集的建议与基于程序员执行导致变更集的任务时收集的交互的建议的比较。为了提供一个通用的比较基础，我们的调查考虑了基于变更集信息的bug报告与基于交互信息的bug报告的相似性。我们发现，用不同方法发现的相似bug报告之间没有直接关系，这表明每种比较方法捕获了问题的不同方面。

引用次数: 8

Learning from defect removals 从缺陷移除中学习

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069500

N. Ayewah, W. Pugh

Recent research has tried to identify changes in source code repositories that fix bugs by linking these changes to reports in issue tracking systems. These changes have been traced back to the point in time when they were previously modified as a way of identifying bug introducing changes. But we observe that not all changes linked to bug tracking systems are fixing bugs; some are enhancing the code. Furthermore, not all fixes are applied at the point in the code where the bug was originally introduced. We flesh out these observations with a manual review of several software projects, and use this opportunity to see how many defects are in the scope of static analysis tools.

最近的研究试图通过将这些更改链接到问题跟踪系统中的报告来识别源代码存储库中修复错误的更改。这些更改可以追溯到它们之前被修改的时间点，作为识别引入更改的bug的一种方式。但我们观察到，并非所有与bug跟踪系统相关的更改都在修复bug;一些公司正在改进代码。此外，并不是所有的修复都应用于最初引入错误的代码点。我们通过对几个软件项目的手工审查来充实这些观察结果，并利用这个机会来查看静态分析工具范围内有多少缺陷。

引用次数: 7

Mining the Jazz repository: Challenges and opportunities 挖掘Jazz存储库:挑战和机遇

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069495

Kim Herzig, A. Zeller

By integrating various development and collaboration tools into one single platform, the Jazz environment offers several opportunities for software repository miners. In particular, Jazz offers full traceability from the initial requirements via work packages and work assignments to the final changes and tests; all these features can be easily accessed and leveraged for better prediction and recommendation systems. In this paper, we share our initial experiences from mining the Jazz repository. We also give a short overview of the retrieved data sets and discuss possible problems of the Jazz repository and the platform itself.

通过将各种开发和协作工具集成到一个平台中，Jazz环境为软件存储库挖掘者提供了许多机会。特别是，Jazz提供了从初始需求到工作包和工作分配到最终更改和测试的完整可追溯性;所有这些特性都可以很容易地访问并用于更好的预测和推荐系统。在本文中，我们将分享挖掘Jazz存储库的初步经验。我们还简要概述了检索到的数据集，并讨论了Jazz存储库和平台本身可能存在的问题。

引用次数: 18

Does calling structure information improve the accuracy of fault prediction? 调用结构信息是否提高了故障预测的准确性?

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069481

Yonghee Shin, Robert M. Bell, T. Ostrand, E. Weyuker

Previous studies have shown that software code attributes, such as lines of source code, and history information, such as the number of code changes and the number of faults in prior releases of software, are useful for predicting where faults will occur. In this study of an industrial software system, we investigate the effectiveness of adding information about calling structure to fault prediction models. The addition of calling structure information to a model based solely on non-calling structure code attributes provided noticeable improvement in prediction accuracy, but only marginally improved the best model based on history and non-calling structure code attributes. The best model based on history and non-calling structure code attributes outperformed the best model based on calling and non-calling structure code attributes.

以前的研究已经表明，软件代码属性(例如源代码行)和历史信息(例如代码更改的数量和先前软件版本中的错误数量)对于预测错误发生的位置非常有用。本文以一个工业软件系统为研究对象，研究了在故障预测模型中加入调用结构信息的有效性。将调用结构信息添加到仅基于非调用结构代码属性的模型中可以显著提高预测精度，但仅对基于历史和非调用结构代码属性的最佳模型有轻微的改进。基于历史和非调用结构代码属性的最佳模型优于基于调用和非调用结构代码属性的最佳模型。

引用次数: 44

Using Latent Dirichlet Allocation for automatic categorization of software 基于潜狄利克雷分配的自动分类软件

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069496

Kai Tian, Meghan Revelle, D. Poshyvanyk

In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.

在本文中，我们提出了一种称为LACT的技术，用于对开源存储库中的软件系统进行自动分类。LACT是一种基于潜在狄利克雷分配的信息检索方法，该方法用于将源代码文档作为概率主题的混合物进行索引和分析。为了初步评估，我们进行了两项研究。在第一项研究中，将LACT与现有的工具mudabblue进行比较，将41个用C编写的软件系统划分为问题领域类别。结果表明，LACT可以自动生成有意义的分类名称，分类结果与mudabblue相当。在第二项研究中，我们将LACT应用于43个用C/ c++、Java、c#、PHP和Perl等不同编程语言编写的软件系统。结果表明，无论底层编程语言或范式如何，LACT都可以有效地用于软件系统的自动分类。此外，两项研究都表明，LACT可以识别基于库、体系结构或编程语言的几个新类别，与手动分类和现有技术相比，这是一个有希望的改进。

{"title":"Using Latent Dirichlet Allocation for automatic categorization of software","authors":"Kai Tian, Meghan Revelle, D. Poshyvanyk","doi":"10.1109/MSR.2009.5069496","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069496","url":null,"abstract":"In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125017996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 170

Using association rules to study the co-evolution of production & test code 利用关联规则研究生产代码与测试代码的协同演化

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069493

Z. Lubsen, A. Zaidman, M. Pinzger

Unit tests are generally acknowledged as an important aid to produce high quality code, as they provide quick feedback to developers on the correctness of their code. In order to achieve high quality, well-maintained tests are needed. Ideally, tests co-evolve with the production code to test changes as soon as possible. In this paper, we explore an approach based on association rule mining to determine whether production and test code co-evolve synchronously. Through two case studies, one with an open source and another one with an industrial software system, we show that our association rule mining approach allows one to assess the co-evolution of product and test code in a software project and, moreover, to uncover the distribution of programmer effort over pure coding, pure testing, or a more test-driven-like practice.

单元测试通常被认为是生成高质量代码的重要辅助工具，因为它们为开发人员提供了关于代码正确性的快速反馈。为了达到高质量，需要良好维护的测试。理想情况下，测试与产品代码共同发展，以尽快测试更改。在本文中，我们探索了一种基于关联规则挖掘的方法来确定生产代码和测试代码是否同步共同进化。通过两个案例研究，一个是开放源码的，另一个是工业软件系统的，我们展示了我们的关联规则挖掘方法允许评估软件项目中产品和测试代码的共同演变，而且，揭示了程序员在纯编码、纯测试或更像测试驱动的实践上的工作分布。

引用次数: 37

On mining data across software repositories 关于跨软件存储库挖掘数据

2009 6th IEEE International Working Conference on Mining Software Repositories

Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069498

P. Anbalagan, M. Vouk

Software repositories provide abundance of valuable information about open source projects. With the increase in the size of the data maintained by the repositories, automated extraction of such data from individual repositories, as well as of linked information across repositories, has become a necessity. In this paper we describe a framework that uses web scraping to automatically mine repositories and link information across repositories. We discuss two implementations of the framework. In the first implementation, we automatically identify and collect security problem reports from project repositories that deploy the Bugzilla bug tracker using related vulnerability information from the National Vulnerability Database. In the second, we collect security problem reports for projects that deploy the Launchpad bug tracker along with related vulnerability information from the National Vulnerability Database. We have evaluated our tool on various releases of Fedora, Ubuntu, Suse, RedHat, and Firefox projects. The percentage of security bugs identified using our tool is consistent with that reported by other researchers.

软件存储库提供了大量关于开放源码项目的有价值的信息。随着存储库维护的数据规模的增加，从单个存储库中自动提取这些数据以及跨存储库链接的信息已经成为一种必要。在本文中，我们描述了一个使用web抓取来自动挖掘存储库和跨存储库链接信息的框架。我们将讨论该框架的两种实现。在第一个实现中，我们使用来自国家漏洞数据库的相关漏洞信息，从部署Bugzilla漏洞跟踪器的项目存储库中自动识别和收集安全问题报告。其次，我们收集部署Launchpad漏洞跟踪器的项目的安全问题报告以及来自国家漏洞数据库的相关漏洞信息。我们已经在不同版本的Fedora、Ubuntu、Suse、RedHat和Firefox项目上评估了我们的工具。使用我们的工具发现的安全漏洞的百分比与其他研究人员报告的一致。

引用次数: 18

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2009 6th IEEE International Working Conference on Mining Software Repositories

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀