2012 9th IEEE Working Conference on Mining Software Repositories (MSR)最新文献

英文中文

Developing an h-index for OSS developers 为开源软件开发人员开发h指数

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224288

A. Capiluppi, Alexander Serebrenik, A. Youssef

The public data available in Open Source Software (OSS) repositories has been used for many practical reasons: detecting community structures; identifying key roles among developers; understanding software quality; predicting the arousal of bugs in large OSS systems, and so on; but also to formulate and validate new metrics and proof-of-concepts on general, non-OSS specific, software engineering aspects. One of the results that has not emerged yet from the analysis of OSS repositories is how to help the “career advancement” of developers: given the available data on products and processes used in OSS development, it should be possible to produce measurements to identify and describe a developer, that could be used externally as a measure of recognition and experience. This paper builds on top of the h-index, used in academic contexts, and which is used to determine the recognition of a researcher among her peers. By creating similar indices for OSS (or any) developers, this work could help defining a baseline for measuring and comparing the contributions of OSS developers in an objective, open and reproducible way.

开源软件(OSS)存储库中可用的公共数据有许多实际用途:检测社区结构;确定开发人员中的关键角色;理解软件质量;预测大型OSS系统中的bug，等等;还要在一般的、非oss特定的软件工程方面制定和验证新的度量标准和概念证明。从对OSS存储库的分析中尚未出现的结果之一是如何帮助开发人员的“职业发展”:给定在OSS开发中使用的产品和过程的可用数据，应该有可能产生识别和描述开发人员的度量，这可以在外部用作识别和经验的度量。本文建立在学术环境中使用的h指数之上，h指数用于确定研究人员在同行中的认可度。通过为OSS(或任何)开发人员创建类似的索引，这项工作可以帮助定义一个基线，以客观、开放和可重复的方式度量和比较OSS开发人员的贡献。

引用次数: 16

A Linked Data platform for mining software repositories 用于挖掘软件存储库的关联数据平台

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224296

I. Keivanloo, C. Forbes, Aseel Hmood, Mostafa Erfani, Christopher Neal, George Peristerakis, J. Rilling

The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.

软件存储库的挖掘包括从现有的软件存储库中提取基本信息和增值信息。存储库将被挖掘，以由不同的利益相关者(例如研究人员、管理人员)为各种目的提取事实。为了避免不必要的预处理和分析步骤，需要共享和集成基本事实和增值事实。在本研究中，我们介绍了一个开放和协作的软件数据集共享平台SeCold。SeCold提供首个在线软件生态系统关联数据平台，支持主要版本控制、问题跟踪和质量评估系统的数据提取和实时数据集集成。在第一个版本中，该数据集包含大约20亿个事实，例如源代码语句、软件许可证和来自18000个软件项目的代码克隆。在第二个版本中，SeCold项目将包含从问题跟踪器和版本控制系统中挖掘的额外事实。我们的方法基于与维基百科相同的基本原则:研究人员和工具开发人员共享从他们的工具中获得的分析结果，并将其作为第二门户网站的一部分发布，从而使其成为全球知识领域的一个组成部分。SeCold项目是关联数据数据集云的官方成员，目前是网络上第八大可用在线数据集。

{"title":"A Linked Data platform for mining software repositories","authors":"I. Keivanloo, C. Forbes, Aseel Hmood, Mostafa Erfani, Christopher Neal, George Peristerakis, J. Rilling","doi":"10.1109/MSR.2012.6224296","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224296","url":null,"abstract":"The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126589453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

What does software engineering community microblog about? 软件工程社区微博是关于什么的?

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224287

Yuan Tian, Palakorn Achananuparp, Nelman Lubis Ibrahim, D. Lo, Ee-Peng Lim

Microblogging is a new trend to communicate and to disseminate information. One microblog post could potentially reach millions of users. Millions of microblogs are generated on a daily basis on popular sites such as Twitter. The popularity of microblogging among programmers, software engineers, and software users has also led to their use of microblogs to communicate software engineering issues apart from using emails and other traditional communication channels. Understanding how millions of users use microblogs in software engineering related activities would shed light on ways we could leverage the fast evolving microblogging content to aid software development efforts. In this work, we perform a preliminary study on what the software engineering community microblogs about. We analyze the content of microblogs from Twitter and categorize the types of microblogs that are posted. We investigate the relative popularity of each category of microblogs. We also investigate what kinds of microblogs are diffused more widely in the Twitter network via the “retweet” feature. Our experiments show that microblogs commonly contain job openings, news, questions and answers, or links to download new tools and code. We find that microblogs concerning real-world events are more widely diffused in the Twitter network.

微博是信息交流和传播的新趋势。一条微博有可能触及数百万用户。在Twitter等热门网站上，每天都会产生数百万条微博。微博在程序员、软件工程师和软件用户中的流行也导致他们使用微博来沟通软件工程问题，而不是使用电子邮件和其他传统的沟通渠道。了解数百万用户如何在软件工程相关活动中使用微博，将有助于我们利用快速发展的微博内容来帮助软件开发工作。在这项工作中，我们对软件工程社区微博的内容进行了初步的研究。我们分析来自Twitter的微博内容，并对发布的微博类型进行分类。我们调查了每一类微博的相对受欢迎程度。我们还研究了什么样的微博通过“转发”功能在Twitter网络中传播得更广泛。我们的实验表明，微博通常包含职位空缺、新闻、问题和答案，或者下载新工具和代码的链接。我们发现，与现实世界事件相关的微博在Twitter网络中传播更为广泛。

{"title":"What does software engineering community microblog about?","authors":"Yuan Tian, Palakorn Achananuparp, Nelman Lubis Ibrahim, D. Lo, Ee-Peng Lim","doi":"10.1109/MSR.2012.6224287","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224287","url":null,"abstract":"Microblogging is a new trend to communicate and to disseminate information. One microblog post could potentially reach millions of users. Millions of microblogs are generated on a daily basis on popular sites such as Twitter. The popularity of microblogging among programmers, software engineers, and software users has also led to their use of microblogs to communicate software engineering issues apart from using emails and other traditional communication channels. Understanding how millions of users use microblogs in software engineering related activities would shed light on ways we could leverage the fast evolving microblogging content to aid software development efforts. In this work, we perform a preliminary study on what the software engineering community microblogs about. We analyze the content of microblogs from Twitter and categorize the types of microblogs that are posted. We investigate the relative popularity of each category of microblogs. We also investigate what kinds of microblogs are diffused more widely in the Twitter network via the “retweet” feature. Our experiments show that microblogs commonly contain job openings, news, questions and answers, or links to download new tools and code. We find that microblogs concerning real-world events are more widely diffused in the Twitter network.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122460628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Explaining software defects using topic models 使用主题模型解释软件缺陷

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224280

T. Chen, Stephen W. Thomas, M. Nagappan, A. Hassan

Researchers have proposed various metrics based on measurable aspects of the source code entities (e.g., methods, classes, files, or modules) and the social structure of a software project in an effort to explain the relationships between software development and software defects. However, these metrics largely ignore the actual functionality, i.e., the conceptual concerns, of a software system, which are the main technical concepts that reflect the business logic or domain of the system. For instance, while lines of code may be a good general measure for defects, a large entity responsible for simple I/O tasks is likely to have fewer defects than a small entity responsible for complicated compiler implementation details. In this paper, we study the effect of conceptual concerns on code quality. We use a statistical topic modeling technique to approximate software concerns as topics; we then propose various metrics on these topics to help explain the defect-proneness (i.e., quality) of the entities. Paramount to our proposed metrics is that they take into account the defect history of each topic. Case studies on multiple versions of Mozilla Firefox, Eclipse, and Mylyn show that (i) some topics are much more defect-prone than others, (ii) defect-prone topics tend to remain so over time, and (iii) defect-prone topics provide additional explanatory power for code quality over existing structural and historical metrics.

为了解释软件开发和软件缺陷之间的关系，研究人员已经提出了基于源代码实体(例如，方法、类、文件或模块)的可度量方面和软件项目的社会结构的各种度量标准。然而，这些量度在很大程度上忽略了软件系统的实际功能，即概念关注点，这是反映系统的业务逻辑或领域的主要技术概念。例如，虽然代码行数可能是缺陷的良好通用度量，但是负责简单I/O任务的大型实体可能比负责复杂编译器实现细节的小型实体具有更少的缺陷。在本文中，我们研究了概念关注对代码质量的影响。我们使用统计主题建模技术将软件关注点近似为主题;然后，我们就这些主题提出各种度量，以帮助解释实体的缺陷倾向(即质量)。我们建议的度量最重要的是它们考虑到每个主题的缺陷历史。对Mozilla Firefox、Eclipse和Mylyn多个版本的案例研究表明:(i)一些主题比其他主题更容易出现缺陷，(ii)随着时间的推移，容易出现缺陷的主题倾向于保持这种状态，以及(iii)相对于现有的结构和历史度量，容易出现缺陷的主题为代码质量提供了额外的解释能力。

{"title":"Explaining software defects using topic models","authors":"T. Chen, Stephen W. Thomas, M. Nagappan, A. Hassan","doi":"10.1109/MSR.2012.6224280","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224280","url":null,"abstract":"Researchers have proposed various metrics based on measurable aspects of the source code entities (e.g., methods, classes, files, or modules) and the social structure of a software project in an effort to explain the relationships between software development and software defects. However, these metrics largely ignore the actual functionality, i.e., the conceptual concerns, of a software system, which are the main technical concepts that reflect the business logic or domain of the system. For instance, while lines of code may be a good general measure for defects, a large entity responsible for simple I/O tasks is likely to have fewer defects than a small entity responsible for complicated compiler implementation details. In this paper, we study the effect of conceptual concerns on code quality. We use a statistical topic modeling technique to approximate software concerns as topics; we then propose various metrics on these topics to help explain the defect-proneness (i.e., quality) of the entities. Paramount to our proposed metrics is that they take into account the defect history of each topic. Case studies on multiple versions of Mozilla Firefox, Eclipse, and Mylyn show that (i) some topics are much more defect-prone than others, (ii) defect-prone topics tend to remain so over time, and (iii) defect-prone topics provide additional explanatory power for code quality over existing structural and historical metrics.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"312 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133271127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

Trendy bugs: Topic trends in the Android bug reports 流行bug: Android bug报告中的主题趋势

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224268

Lee Martie, Vijay Krishna Palepu, Hitesh Sajnani, C. Lopes

Studying vast volumes of bug and issue discussions can give an understanding of what the community has been most concerned about, however the magnitude of documents can overload the analyst. We present an approach to analyze the development of the Android open source project by observing trends in the bug discussions in the Android open source project public issue tracker. This informs us of the features or parts of the project that are more problematic at any given point of time. In turn, this can be used to aid resource allocation (such as time and man power) to parts or features. We support these ideas by presenting the results of issue topic distributions over time using statistical analysis of the bug descriptions and comments for the Android open source project. Furthermore, we show relationships between those time distributions and major development releases of the Android OS.

研究大量的bug和问题讨论可以了解社区最关心的问题，但是大量的文档可能会使分析人员负担过重。我们提出了一种方法，通过观察Android开源项目公共问题跟踪器中bug讨论的趋势来分析Android开源项目的发展。这告诉我们在任何给定的时间点项目的特性或部分是更有问题的。反过来，这可以用来帮助资源分配(如时间和人力)到部件或特性。我们通过对Android开源项目的bug描述和评论进行统计分析，呈现问题主题随时间分布的结果，以此来支持这些想法。此外，我们还展示了这些时间分布与Android操作系统主要开发版本之间的关系。

引用次数: 33

Think locally, act globally: Improving defect and effort prediction models 局部思考，全局行动:改进缺陷和工作预测模型

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224300

Nicolas Bettenburg, M. Nagappan, A. Hassan

Much research energy in software engineering is focused on the creation of effort and defect prediction models. Such models are important means for practitioners to judge their current project situation, optimize the allocation of their resources, and make informed future decisions. However, software engineering data contains a large amount of variability. Recent research demonstrates that such variability leads to poor fits of machine learning models to the underlying data, and suggests splitting datasets into more fine-grained subsets with similar properties. In this paper, we present a comparison of three different approaches for creating statistical regression models to model and predict software defects and development effort. Global models are trained on the whole dataset. In contrast, local models are trained on subsets of the dataset. Last, we build a global model that takes into account local characteristics of the data. We evaluate the performance of these three approaches in a case study on two defect and two effort datasets. We find that for both types of data, local models show a significantly increased fit to the data compared to global models. The substantial improvements in both relative and absolute prediction errors demonstrate that this increased goodness of fit is valuable in practice. Finally, our experiments suggest that trends obtained from global models are too general for practical recommendations. At the same time, local models provide a multitude of trends which are only valid for specific subsets of the data. Instead, we advocate the use of trends obtained from global models that take into account local characteristics, as they combine the best of both worlds.

软件工程中的许多研究精力都集中在工作量和缺陷预测模型的创建上。这些模型是从业者判断当前项目情况、优化资源分配以及做出明智的未来决策的重要手段。然而，软件工程数据包含大量的可变性。最近的研究表明，这种可变性导致机器学习模型与底层数据的拟合不良，并建议将数据集分成具有相似属性的更细粒度的子集。在本文中，我们对创建统计回归模型来建模和预测软件缺陷和开发工作的三种不同方法进行了比较。全局模型在整个数据集上进行训练。相反，局部模型是在数据集的子集上训练的。最后，我们建立了一个考虑数据局部特征的全局模型。我们在两个缺陷和两个工作量数据集的案例研究中评估了这三种方法的性能。我们发现，对于这两种类型的数据，与全球模型相比，局部模型对数据的拟合程度显著提高。相对和绝对预测误差的显著改善表明，这种拟合优度的提高在实践中是有价值的。最后，我们的实验表明，从全球模型中获得的趋势对于实际建议来说过于笼统。同时，局部模型提供了大量只对特定数据子集有效的趋势。相反，我们提倡使用从考虑了当地特点的全球模型中获得的趋势，因为它们结合了两个世界的优点。

{"title":"Think locally, act globally: Improving defect and effort prediction models","authors":"Nicolas Bettenburg, M. Nagappan, A. Hassan","doi":"10.1109/MSR.2012.6224300","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224300","url":null,"abstract":"Much research energy in software engineering is focused on the creation of effort and defect prediction models. Such models are important means for practitioners to judge their current project situation, optimize the allocation of their resources, and make informed future decisions. However, software engineering data contains a large amount of variability. Recent research demonstrates that such variability leads to poor fits of machine learning models to the underlying data, and suggests splitting datasets into more fine-grained subsets with similar properties. In this paper, we present a comparison of three different approaches for creating statistical regression models to model and predict software defects and development effort. Global models are trained on the whole dataset. In contrast, local models are trained on subsets of the dataset. Last, we build a global model that takes into account local characteristics of the data. We evaluate the performance of these three approaches in a case study on two defect and two effort datasets. We find that for both types of data, local models show a significantly increased fit to the data compared to global models. The substantial improvements in both relative and absolute prediction errors demonstrate that this increased goodness of fit is valuable in practice. Finally, our experiments suggest that trends obtained from global models are too general for practical recommendations. At the same time, local models provide a multitude of trends which are only valid for specific subsets of the data. Instead, we advocate the use of trends obtained from global models that take into account local characteristics, as they combine the best of both worlds.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131544271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

GHTorrent: Github's data from a firehose GHTorrent: Github的数据从一个消防软管

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224294

Georgios Gousios, D. Spinellis

A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects' repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service. In this paper, we present the project's design and initial implementation and demonstrate how the provided datasets can be queried and processed.

许多经验软件工程研究的一个共同需求是从软件存储库中获取和管理数据。在过去的几年里，GitHub已经成为一个流行的项目托管、镜像和协作平台。GitHub提供了一个广泛的REST API，它使研究人员能够检索提交到项目存储库的事件和通过用户对项目资源的操作生成的事件。GHTorrent旨在创建一个可扩展的GitHub事件流和持久数据的离线镜像，并将其作为一项服务提供给研究社区。在本文中，我们介绍了该项目的设计和初步实现，并演示了如何查询和处理所提供的数据集。

引用次数: 261

A qualitative study on performance bugs 性能bug的定性研究

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224281

Shahed Zaman, Bram Adams, A. Hassan

Software performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require changes to more code than non-performance bugs. In order to be able to improve the resolution of performance bugs, a better understanding is needed of the current practice and shortcomings of reporting, reproducing, tracking and fixing performance bugs. This paper qualitatively studies a random sample of 400 performance and non-performance bug reports of Mozilla Firefox and Google Chrome across four dimensions (Impact, Context, Fix and Fix validation). We found that developers and users face problems in reproducing performance bugs and have to spend more time discussing performance bugs than other kinds of bugs. Sometimes performance regressions are tolerated as a tradeoff to improve something else.

软件性能是使软件在竞争激烈的市场中脱颖而出的重要品质之一。然而，在早期的工作中，我们发现性能缺陷比非性能缺陷需要更多的时间来修复，需要更有经验的开发人员来修复，并且需要修改更多的代码。为了能够改进性能错误的解决方案，需要更好地理解当前的实践和报告、再现、跟踪和修复性能错误的缺点。本文从四个维度(影响、上下文、修复和修复验证)对Mozilla Firefox和Google Chrome的400个性能和非性能bug报告进行了定性研究。我们发现开发人员和用户在再现性能缺陷方面面临问题，并且不得不花费比其他类型的缺陷更多的时间来讨论性能缺陷。有时，性能退化是可以容忍的，因为这是为了改进其他东西而做出的权衡。

引用次数: 128

App store mining and analysis: MSR for app stores 应用商店挖掘和分析:应用商店的MSR

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224306

M. Harman, Yue Jia, Yuanyuan Zhang

This paper introduces app store mining and analysis as a form of software repository mining. Unlike other software repositories traditionally used in MSR work, app stores usually do not provide source code. However, they do provide a wealth of other information in the form of pricing and customer reviews. Therefore, we use data mining to extract feature information, which we then combine with more readily available information to analyse apps' technical, customer and business aspects. We applied our approach to the 32,108 non-zero priced apps available in the Blackberry app store in September 2011. Our results show that there is a strong correlation between customer rating and the rank of app downloads, though perhaps surprisingly, there is no correlation between price and downloads, nor between price and rating. More importantly, we show that these correlation findings carry over to (and are even occasionally enhanced within) the space of data mined app features, providing evidence that our `App store MSR' approach can be valuable to app developers.

本文介绍了应用商店挖掘和分析作为软件存储库挖掘的一种形式。与MSR工作中传统使用的其他软件库不同，应用商店通常不提供源代码。然而，它们确实以定价和客户评论的形式提供了丰富的其他信息。因此，我们使用数据挖掘来提取特征信息，然后将其与更容易获得的信息结合起来，分析应用程序的技术、客户和业务方面。我们将此方法应用于2011年9月黑莓应用商店中32108款非零定价应用。我们的研究结果显示，用户评价和应用下载量之间存在很强的相关性，但令人惊讶的是，价格和下载量之间没有相关性，价格和评价之间也没有相关性。更重要的是，我们表明这些相关性发现延续到(甚至偶尔在数据挖掘应用功能的空间中得到增强)，证明我们的“应用商店MSR”方法对应用开发者是有价值的。

引用次数: 348

Mining usage data and development artifacts 挖掘使用数据和开发工件

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

Pub Date : 2012-06-02 DOI: 10.1109/MSR.2012.6224305

Olga Baysal, Reid Holmes, Michael W. Godfrey

Software repository mining techniques generally focus on analyzing, unifying, and querying different kinds of development artifacts, such as source code, version control meta-data, defect tracking data, and electronic communication. In this work, we demonstrate how adding real-world usage data enables addressing broader questions of how software systems are actually used in practice, and by inference how development characteristics ultimately affect deployment, adoption, and usage. In particular, we explore how usage data that has been extracted from web server logs can be unified with product release history to study questions that concern both users' detailed dynamic behaviour as well as broad adoption trends across different deployment environments. To validate our approach, we performed a study of two open source web browsers: Firefox and Chrome. We found that while Chrome is being adopted at a consistent rate across platforms, Linux users have an order of magnitude higher rate of Firefox adoption. Also, Firefox adoption has been concentrated mainly in North America, while Chrome users appear to be more evenly distributed across the globe. Finally, we detected no evidence in age-specific differences in navigation behaviour among Chrome and Firefox users; however, we hypothesize that younger users are more likely to have more up-to-date versions than more mature users.

软件存储库挖掘技术通常关注于分析、统一和查询不同类型的开发工件，例如源代码、版本控制元数据、缺陷跟踪数据和电子通信。在这项工作中，我们演示了添加真实世界的使用数据如何能够解决软件系统在实践中如何实际使用的更广泛的问题，以及通过推断开发特征最终如何影响部署、采用和使用。特别是，我们探索如何将从web服务器日志中提取的使用数据与产品发布历史相统一，以研究涉及用户详细动态行为以及不同部署环境中广泛采用趋势的问题。为了验证我们的方法，我们对两种开源浏览器Firefox和Chrome进行了研究。我们发现，虽然Chrome的跨平台采用率是一致的，但Linux用户对Firefox的采用率要高一个数量级。此外，Firefox的用户主要集中在北美，而Chrome的用户似乎分布在全球各地。最后，我们没有发现Chrome和Firefox用户在导航行为上存在年龄差异的证据;然而，我们假设年轻用户比成熟用户更有可能拥有最新版本。

{"title":"Mining usage data and development artifacts","authors":"Olga Baysal, Reid Holmes, Michael W. Godfrey","doi":"10.1109/MSR.2012.6224305","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224305","url":null,"abstract":"Software repository mining techniques generally focus on analyzing, unifying, and querying different kinds of development artifacts, such as source code, version control meta-data, defect tracking data, and electronic communication. In this work, we demonstrate how adding real-world usage data enables addressing broader questions of how software systems are actually used in practice, and by inference how development characteristics ultimately affect deployment, adoption, and usage. In particular, we explore how usage data that has been extracted from web server logs can be unified with product release history to study questions that concern both users' detailed dynamic behaviour as well as broad adoption trends across different deployment environments. To validate our approach, we performed a study of two open source web browsers: Firefox and Chrome. We found that while Chrome is being adopted at a consistent rate across platforms, Linux users have an order of magnitude higher rate of Firefox adoption. Also, Firefox adoption has been concentrated mainly in North America, while Chrome users appear to be more evenly distributed across the globe. Finally, we detected no evidence in age-specific differences in navigation behaviour among Chrome and Firefox users; however, we hypothesize that younger users are more likely to have more up-to-date versions than more mature users.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125155500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 9th IEEE Working Conference on Mining Software Repositories (MSR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀