Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224288
A. Capiluppi, Alexander Serebrenik, A. Youssef
The public data available in Open Source Software (OSS) repositories has been used for many practical reasons: detecting community structures; identifying key roles among developers; understanding software quality; predicting the arousal of bugs in large OSS systems, and so on; but also to formulate and validate new metrics and proof-of-concepts on general, non-OSS specific, software engineering aspects. One of the results that has not emerged yet from the analysis of OSS repositories is how to help the “career advancement” of developers: given the available data on products and processes used in OSS development, it should be possible to produce measurements to identify and describe a developer, that could be used externally as a measure of recognition and experience. This paper builds on top of the h-index, used in academic contexts, and which is used to determine the recognition of a researcher among her peers. By creating similar indices for OSS (or any) developers, this work could help defining a baseline for measuring and comparing the contributions of OSS developers in an objective, open and reproducible way.
{"title":"Developing an h-index for OSS developers","authors":"A. Capiluppi, Alexander Serebrenik, A. Youssef","doi":"10.1109/MSR.2012.6224288","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224288","url":null,"abstract":"The public data available in Open Source Software (OSS) repositories has been used for many practical reasons: detecting community structures; identifying key roles among developers; understanding software quality; predicting the arousal of bugs in large OSS systems, and so on; but also to formulate and validate new metrics and proof-of-concepts on general, non-OSS specific, software engineering aspects. One of the results that has not emerged yet from the analysis of OSS repositories is how to help the “career advancement” of developers: given the available data on products and processes used in OSS development, it should be possible to produce measurements to identify and describe a developer, that could be used externally as a measure of recognition and experience. This paper builds on top of the h-index, used in academic contexts, and which is used to determine the recognition of a researcher among her peers. By creating similar indices for OSS (or any) developers, this work could help defining a baseline for measuring and comparing the contributions of OSS developers in an objective, open and reproducible way.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133482484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224296
I. Keivanloo, C. Forbes, Aseel Hmood, Mostafa Erfani, Christopher Neal, George Peristerakis, J. Rilling
The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.
{"title":"A Linked Data platform for mining software repositories","authors":"I. Keivanloo, C. Forbes, Aseel Hmood, Mostafa Erfani, Christopher Neal, George Peristerakis, J. Rilling","doi":"10.1109/MSR.2012.6224296","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224296","url":null,"abstract":"The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126589453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microblogging is a new trend to communicate and to disseminate information. One microblog post could potentially reach millions of users. Millions of microblogs are generated on a daily basis on popular sites such as Twitter. The popularity of microblogging among programmers, software engineers, and software users has also led to their use of microblogs to communicate software engineering issues apart from using emails and other traditional communication channels. Understanding how millions of users use microblogs in software engineering related activities would shed light on ways we could leverage the fast evolving microblogging content to aid software development efforts. In this work, we perform a preliminary study on what the software engineering community microblogs about. We analyze the content of microblogs from Twitter and categorize the types of microblogs that are posted. We investigate the relative popularity of each category of microblogs. We also investigate what kinds of microblogs are diffused more widely in the Twitter network via the “retweet” feature. Our experiments show that microblogs commonly contain job openings, news, questions and answers, or links to download new tools and code. We find that microblogs concerning real-world events are more widely diffused in the Twitter network.
{"title":"What does software engineering community microblog about?","authors":"Yuan Tian, Palakorn Achananuparp, Nelman Lubis Ibrahim, D. Lo, Ee-Peng Lim","doi":"10.1109/MSR.2012.6224287","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224287","url":null,"abstract":"Microblogging is a new trend to communicate and to disseminate information. One microblog post could potentially reach millions of users. Millions of microblogs are generated on a daily basis on popular sites such as Twitter. The popularity of microblogging among programmers, software engineers, and software users has also led to their use of microblogs to communicate software engineering issues apart from using emails and other traditional communication channels. Understanding how millions of users use microblogs in software engineering related activities would shed light on ways we could leverage the fast evolving microblogging content to aid software development efforts. In this work, we perform a preliminary study on what the software engineering community microblogs about. We analyze the content of microblogs from Twitter and categorize the types of microblogs that are posted. We investigate the relative popularity of each category of microblogs. We also investigate what kinds of microblogs are diffused more widely in the Twitter network via the “retweet” feature. Our experiments show that microblogs commonly contain job openings, news, questions and answers, or links to download new tools and code. We find that microblogs concerning real-world events are more widely diffused in the Twitter network.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122460628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224280
T. Chen, Stephen W. Thomas, M. Nagappan, A. Hassan
Researchers have proposed various metrics based on measurable aspects of the source code entities (e.g., methods, classes, files, or modules) and the social structure of a software project in an effort to explain the relationships between software development and software defects. However, these metrics largely ignore the actual functionality, i.e., the conceptual concerns, of a software system, which are the main technical concepts that reflect the business logic or domain of the system. For instance, while lines of code may be a good general measure for defects, a large entity responsible for simple I/O tasks is likely to have fewer defects than a small entity responsible for complicated compiler implementation details. In this paper, we study the effect of conceptual concerns on code quality. We use a statistical topic modeling technique to approximate software concerns as topics; we then propose various metrics on these topics to help explain the defect-proneness (i.e., quality) of the entities. Paramount to our proposed metrics is that they take into account the defect history of each topic. Case studies on multiple versions of Mozilla Firefox, Eclipse, and Mylyn show that (i) some topics are much more defect-prone than others, (ii) defect-prone topics tend to remain so over time, and (iii) defect-prone topics provide additional explanatory power for code quality over existing structural and historical metrics.
{"title":"Explaining software defects using topic models","authors":"T. Chen, Stephen W. Thomas, M. Nagappan, A. Hassan","doi":"10.1109/MSR.2012.6224280","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224280","url":null,"abstract":"Researchers have proposed various metrics based on measurable aspects of the source code entities (e.g., methods, classes, files, or modules) and the social structure of a software project in an effort to explain the relationships between software development and software defects. However, these metrics largely ignore the actual functionality, i.e., the conceptual concerns, of a software system, which are the main technical concepts that reflect the business logic or domain of the system. For instance, while lines of code may be a good general measure for defects, a large entity responsible for simple I/O tasks is likely to have fewer defects than a small entity responsible for complicated compiler implementation details. In this paper, we study the effect of conceptual concerns on code quality. We use a statistical topic modeling technique to approximate software concerns as topics; we then propose various metrics on these topics to help explain the defect-proneness (i.e., quality) of the entities. Paramount to our proposed metrics is that they take into account the defect history of each topic. Case studies on multiple versions of Mozilla Firefox, Eclipse, and Mylyn show that (i) some topics are much more defect-prone than others, (ii) defect-prone topics tend to remain so over time, and (iii) defect-prone topics provide additional explanatory power for code quality over existing structural and historical metrics.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"312 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133271127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224268
Lee Martie, Vijay Krishna Palepu, Hitesh Sajnani, C. Lopes
Studying vast volumes of bug and issue discussions can give an understanding of what the community has been most concerned about, however the magnitude of documents can overload the analyst. We present an approach to analyze the development of the Android open source project by observing trends in the bug discussions in the Android open source project public issue tracker. This informs us of the features or parts of the project that are more problematic at any given point of time. In turn, this can be used to aid resource allocation (such as time and man power) to parts or features. We support these ideas by presenting the results of issue topic distributions over time using statistical analysis of the bug descriptions and comments for the Android open source project. Furthermore, we show relationships between those time distributions and major development releases of the Android OS.
{"title":"Trendy bugs: Topic trends in the Android bug reports","authors":"Lee Martie, Vijay Krishna Palepu, Hitesh Sajnani, C. Lopes","doi":"10.1109/MSR.2012.6224268","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224268","url":null,"abstract":"Studying vast volumes of bug and issue discussions can give an understanding of what the community has been most concerned about, however the magnitude of documents can overload the analyst. We present an approach to analyze the development of the Android open source project by observing trends in the bug discussions in the Android open source project public issue tracker. This informs us of the features or parts of the project that are more problematic at any given point of time. In turn, this can be used to aid resource allocation (such as time and man power) to parts or features. We support these ideas by presenting the results of issue topic distributions over time using statistical analysis of the bug descriptions and comments for the Android open source project. Furthermore, we show relationships between those time distributions and major development releases of the Android OS.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124910156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224300
Nicolas Bettenburg, M. Nagappan, A. Hassan
Much research energy in software engineering is focused on the creation of effort and defect prediction models. Such models are important means for practitioners to judge their current project situation, optimize the allocation of their resources, and make informed future decisions. However, software engineering data contains a large amount of variability. Recent research demonstrates that such variability leads to poor fits of machine learning models to the underlying data, and suggests splitting datasets into more fine-grained subsets with similar properties. In this paper, we present a comparison of three different approaches for creating statistical regression models to model and predict software defects and development effort. Global models are trained on the whole dataset. In contrast, local models are trained on subsets of the dataset. Last, we build a global model that takes into account local characteristics of the data. We evaluate the performance of these three approaches in a case study on two defect and two effort datasets. We find that for both types of data, local models show a significantly increased fit to the data compared to global models. The substantial improvements in both relative and absolute prediction errors demonstrate that this increased goodness of fit is valuable in practice. Finally, our experiments suggest that trends obtained from global models are too general for practical recommendations. At the same time, local models provide a multitude of trends which are only valid for specific subsets of the data. Instead, we advocate the use of trends obtained from global models that take into account local characteristics, as they combine the best of both worlds.
{"title":"Think locally, act globally: Improving defect and effort prediction models","authors":"Nicolas Bettenburg, M. Nagappan, A. Hassan","doi":"10.1109/MSR.2012.6224300","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224300","url":null,"abstract":"Much research energy in software engineering is focused on the creation of effort and defect prediction models. Such models are important means for practitioners to judge their current project situation, optimize the allocation of their resources, and make informed future decisions. However, software engineering data contains a large amount of variability. Recent research demonstrates that such variability leads to poor fits of machine learning models to the underlying data, and suggests splitting datasets into more fine-grained subsets with similar properties. In this paper, we present a comparison of three different approaches for creating statistical regression models to model and predict software defects and development effort. Global models are trained on the whole dataset. In contrast, local models are trained on subsets of the dataset. Last, we build a global model that takes into account local characteristics of the data. We evaluate the performance of these three approaches in a case study on two defect and two effort datasets. We find that for both types of data, local models show a significantly increased fit to the data compared to global models. The substantial improvements in both relative and absolute prediction errors demonstrate that this increased goodness of fit is valuable in practice. Finally, our experiments suggest that trends obtained from global models are too general for practical recommendations. At the same time, local models provide a multitude of trends which are only valid for specific subsets of the data. Instead, we advocate the use of trends obtained from global models that take into account local characteristics, as they combine the best of both worlds.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131544271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224294
Georgios Gousios, D. Spinellis
A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects' repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service. In this paper, we present the project's design and initial implementation and demonstrate how the provided datasets can be queried and processed.
{"title":"GHTorrent: Github's data from a firehose","authors":"Georgios Gousios, D. Spinellis","doi":"10.1109/MSR.2012.6224294","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224294","url":null,"abstract":"A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects' repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service. In this paper, we present the project's design and initial implementation and demonstrate how the provided datasets can be queried and processed.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121275316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224281
Shahed Zaman, Bram Adams, A. Hassan
Software performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require changes to more code than non-performance bugs. In order to be able to improve the resolution of performance bugs, a better understanding is needed of the current practice and shortcomings of reporting, reproducing, tracking and fixing performance bugs. This paper qualitatively studies a random sample of 400 performance and non-performance bug reports of Mozilla Firefox and Google Chrome across four dimensions (Impact, Context, Fix and Fix validation). We found that developers and users face problems in reproducing performance bugs and have to spend more time discussing performance bugs than other kinds of bugs. Sometimes performance regressions are tolerated as a tradeoff to improve something else.
{"title":"A qualitative study on performance bugs","authors":"Shahed Zaman, Bram Adams, A. Hassan","doi":"10.1109/MSR.2012.6224281","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224281","url":null,"abstract":"Software performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require changes to more code than non-performance bugs. In order to be able to improve the resolution of performance bugs, a better understanding is needed of the current practice and shortcomings of reporting, reproducing, tracking and fixing performance bugs. This paper qualitatively studies a random sample of 400 performance and non-performance bug reports of Mozilla Firefox and Google Chrome across four dimensions (Impact, Context, Fix and Fix validation). We found that developers and users face problems in reproducing performance bugs and have to spend more time discussing performance bugs than other kinds of bugs. Sometimes performance regressions are tolerated as a tradeoff to improve something else.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122721920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224306
M. Harman, Yue Jia, Yuanyuan Zhang
This paper introduces app store mining and analysis as a form of software repository mining. Unlike other software repositories traditionally used in MSR work, app stores usually do not provide source code. However, they do provide a wealth of other information in the form of pricing and customer reviews. Therefore, we use data mining to extract feature information, which we then combine with more readily available information to analyse apps' technical, customer and business aspects. We applied our approach to the 32,108 non-zero priced apps available in the Blackberry app store in September 2011. Our results show that there is a strong correlation between customer rating and the rank of app downloads, though perhaps surprisingly, there is no correlation between price and downloads, nor between price and rating. More importantly, we show that these correlation findings carry over to (and are even occasionally enhanced within) the space of data mined app features, providing evidence that our `App store MSR' approach can be valuable to app developers.
{"title":"App store mining and analysis: MSR for app stores","authors":"M. Harman, Yue Jia, Yuanyuan Zhang","doi":"10.1109/MSR.2012.6224306","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224306","url":null,"abstract":"This paper introduces app store mining and analysis as a form of software repository mining. Unlike other software repositories traditionally used in MSR work, app stores usually do not provide source code. However, they do provide a wealth of other information in the form of pricing and customer reviews. Therefore, we use data mining to extract feature information, which we then combine with more readily available information to analyse apps' technical, customer and business aspects. We applied our approach to the 32,108 non-zero priced apps available in the Blackberry app store in September 2011. Our results show that there is a strong correlation between customer rating and the rank of app downloads, though perhaps surprisingly, there is no correlation between price and downloads, nor between price and rating. More importantly, we show that these correlation findings carry over to (and are even occasionally enhanced within) the space of data mined app features, providing evidence that our `App store MSR' approach can be valuable to app developers.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126850623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-02DOI: 10.1109/MSR.2012.6224305
Olga Baysal, Reid Holmes, Michael W. Godfrey
Software repository mining techniques generally focus on analyzing, unifying, and querying different kinds of development artifacts, such as source code, version control meta-data, defect tracking data, and electronic communication. In this work, we demonstrate how adding real-world usage data enables addressing broader questions of how software systems are actually used in practice, and by inference how development characteristics ultimately affect deployment, adoption, and usage. In particular, we explore how usage data that has been extracted from web server logs can be unified with product release history to study questions that concern both users' detailed dynamic behaviour as well as broad adoption trends across different deployment environments. To validate our approach, we performed a study of two open source web browsers: Firefox and Chrome. We found that while Chrome is being adopted at a consistent rate across platforms, Linux users have an order of magnitude higher rate of Firefox adoption. Also, Firefox adoption has been concentrated mainly in North America, while Chrome users appear to be more evenly distributed across the globe. Finally, we detected no evidence in age-specific differences in navigation behaviour among Chrome and Firefox users; however, we hypothesize that younger users are more likely to have more up-to-date versions than more mature users.
{"title":"Mining usage data and development artifacts","authors":"Olga Baysal, Reid Holmes, Michael W. Godfrey","doi":"10.1109/MSR.2012.6224305","DOIUrl":"https://doi.org/10.1109/MSR.2012.6224305","url":null,"abstract":"Software repository mining techniques generally focus on analyzing, unifying, and querying different kinds of development artifacts, such as source code, version control meta-data, defect tracking data, and electronic communication. In this work, we demonstrate how adding real-world usage data enables addressing broader questions of how software systems are actually used in practice, and by inference how development characteristics ultimately affect deployment, adoption, and usage. In particular, we explore how usage data that has been extracted from web server logs can be unified with product release history to study questions that concern both users' detailed dynamic behaviour as well as broad adoption trends across different deployment environments. To validate our approach, we performed a study of two open source web browsers: Firefox and Chrome. We found that while Chrome is being adopted at a consistent rate across platforms, Linux users have an order of magnitude higher rate of Firefox adoption. Also, Firefox adoption has been concentrated mainly in North America, while Chrome users appear to be more evenly distributed across the globe. Finally, we detected no evidence in age-specific differences in navigation behaviour among Chrome and Firefox users; however, we hypothesize that younger users are more likely to have more up-to-date versions than more mature users.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125155500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}