Nikolaos Bafatakis, Niels Boecker, Wenjie. Boon, Martin Cabello Salazar, J. Krinke, Gazi Oznacar, Robert White
Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation (r = -0.87, p < 10^-7) between the vote score a post received and the average number of violations per statement for snippets in such posts.
全世界的软件开发人员都使用Stack Overflow (SO)来交互和交换代码片段。研究人员还使用SO来收集推荐系统使用的代码片段。但是,以前的工作表明,SO上的代码可能存在质量问题,例如安全性或许可证问题。我们在SO上分析Python代码,以确定其编码风格的遵从性。从1,962,535个标有“python”的代码片段中,我们提取了至少6条python代码语句的407,097个片段。令人惊讶的是,93.87%的提取片段包含样式违规,平均每个语句有0.7个违规,而且大量片段的比例要高得多。因此,研究人员和开发人员应该意识到,SO上的代码片段可能并不代表良好的编码风格。此外,虽然用户声誉似乎与编码风格合规性无关,但对于投票得分在-10到20之间的帖子,我们发现帖子收到的投票得分与帖子中每个语句片段的平均违规次数之间存在很强的相关性(r = -0.87, p < 10^-7)。
{"title":"Python Coding Style Compliance on Stack Overflow","authors":"Nikolaos Bafatakis, Niels Boecker, Wenjie. Boon, Martin Cabello Salazar, J. Krinke, Gazi Oznacar, Robert White","doi":"10.1109/MSR.2019.00042","DOIUrl":"https://doi.org/10.1109/MSR.2019.00042","url":null,"abstract":"Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation (r = -0.87, p < 10^-7) between the vote score a post received and the average number of violations per statement for snippets in such posts.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"52 1","pages":"210-214"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78098230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many techniques have been proposed for mining software repositories, predicting code quality and evaluating code changes. Prior work has established links between code ownership and churn metrics, and software quality at file and directory level based on changes that fix bugs. Other metrics have been used to evaluate individual code changes based on preceding changes that induce fixes. This paper combines the two approaches in an empirical study of assessing risk of code changes using established code ownership and churn metrics with fix inducing changes on a large proprietary code repository. We establish a machine learning model for change risk classification which achieves average precision of 0.76 using metrics from prior works and 0.90 using a wider array of metrics. Our results suggest that code ownership metrics can be applied in change risk classification models based on fix inducing changes.
{"title":"Empirical Study in using Version Histories for Change Risk Classification","authors":"Max Kiehn, Xiangyi Pan, F. Camci","doi":"10.1109/MSR.2019.00018","DOIUrl":"https://doi.org/10.1109/MSR.2019.00018","url":null,"abstract":"Many techniques have been proposed for mining software repositories, predicting code quality and evaluating code changes. Prior work has established links between code ownership and churn metrics, and software quality at file and directory level based on changes that fix bugs. Other metrics have been used to evaluate individual code changes based on preceding changes that induce fixes. This paper combines the two approaches in an empirical study of assessing risk of code changes using established code ownership and churn metrics with fix inducing changes on a large proprietary code repository. We establish a machine learning model for change risk classification which achieves average precision of 0.76 using metrics from prior works and 0.90 using a wider array of metrics. Our results suggest that code ownership metrics can be applied in change risk classification models based on fix inducing changes.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"184 1","pages":"58-62"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86086960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adithya Raghuraman, Truong Ho-Quang, M. Chaudron, A. Serebrenik, Bogdan Vasilescu
The benefits of modeling the design to improve the quality and maintainability of software systems have long been advocated and recognized. Yet, the empirical evidence on this remains scarce. In this paper, we fill this gap by reporting on an empirical study of the relationship between UML modeling and software defect proneness in a large sample of open-source GitHub projects. Using statistical modeling, and controlling for confounding variables, we show that projects containing traces of UML models in their repositories experience, on average, a statistically minorly different number of software defects (as mined from their issue trackers) than projects without traces of UML models.
{"title":"Does UML Modeling Associate with Lower Defect Proneness?: A Preliminary Empirical Investigation","authors":"Adithya Raghuraman, Truong Ho-Quang, M. Chaudron, A. Serebrenik, Bogdan Vasilescu","doi":"10.1109/MSR.2019.00024","DOIUrl":"https://doi.org/10.1109/MSR.2019.00024","url":null,"abstract":"The benefits of modeling the design to improve the quality and maintainability of software systems have long been advocated and recognized. Yet, the empirical evidence on this remains scarce. In this paper, we fill this gap by reporting on an empirical study of the relationship between UML modeling and software defect proneness in a large sample of open-source GitHub projects. Using statistical modeling, and controlling for confounding variables, we show that projects containing traces of UML models in their repositories experience, on average, a statistically minorly different number of software defects (as mined from their issue trackers) than projects without traces of UML models.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"58 1","pages":"101-104"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80461443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stack Overflow is the most-widely-used online question-and-answer platform for software developers to solve problems and communicate experience. Stack Overflow believes in the power of community editing, which means that one is able to edit questions without the changes going through peer review. Stack Overflow users may make edits to questions for a variety of reasons, among others, to improve the question and try to obtain more answers. However, to date the relationship between edit actions on questions and the number of answers that they collect is unknown. In this paper, we perform an empirical study on Stack Overflow to understand the relationship between edit actions and number of answers obtained in different dimensions from different attributes of the edited questions. We find that questions are more commonly edited by question owners, on bodies with relatively big changes before obtaining an accepted answer. However, edited questions that obtained more answers in a shorter time, were edited by other users rather than question owners, and their edits tended to be small, focused on titles and in adding addendums.
{"title":"What Edits are Done on the Highly Answered Questions in Stack Overflow? An Empirical Study","authors":"Xianhao Jin, Francisco Servant","doi":"10.1109/MSR.2019.00045","DOIUrl":"https://doi.org/10.1109/MSR.2019.00045","url":null,"abstract":"Stack Overflow is the most-widely-used online question-and-answer platform for software developers to solve problems and communicate experience. Stack Overflow believes in the power of community editing, which means that one is able to edit questions without the changes going through peer review. Stack Overflow users may make edits to questions for a variety of reasons, among others, to improve the question and try to obtain more answers. However, to date the relationship between edit actions on questions and the number of answers that they collect is unknown. In this paper, we perform an empirical study on Stack Overflow to understand the relationship between edit actions and number of answers obtained in different dimensions from different attributes of the edited questions. We find that questions are more commonly edited by question owners, on bodies with relatively big changes before obtaining an accepted answer. However, edited questions that obtained more answers in a shorter time, were edited by other users rather than question owners, and their edits tended to be small, focused on titles and in adding addendums.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"31 4 1","pages":"225-229"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77253625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Introduction: The establishment of the Mining Software Repositories (MSR) Data Showcase conference track has encouraged researchers to provide more data sets as a basis for further empirical studies. Objectives: Examine the usage of the data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Methods: Data track papers were collected from the MSR Data Showcase and through the manual inspection of older MSR proceedings. The use of data papers was established through citation searching followed by reading the studies that have cited them. Data papers were then clustered based on their content, whereas their citations were classified according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. Results: We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of citations. MSR data papers are cited less than other MSR papers. A considerable number of the citations stem from the teams that authored the data papers. Publications providing repository data and metadata are the most frequent data papers and the most often cited ones. Mobile application data papers are the least common ones, but the second most frequently cited. Conclusion: Data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, and by providing incentives for the enrichment of existing data collections.
{"title":"Standing on Shoulders or Feet? The Usage of the MSR Data Papers","authors":"Zoe Kotti, D. Spinellis","doi":"10.1109/MSR.2019.00085","DOIUrl":"https://doi.org/10.1109/MSR.2019.00085","url":null,"abstract":"Introduction: The establishment of the Mining Software Repositories (MSR) Data Showcase conference track has encouraged researchers to provide more data sets as a basis for further empirical studies. Objectives: Examine the usage of the data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Methods: Data track papers were collected from the MSR Data Showcase and through the manual inspection of older MSR proceedings. The use of data papers was established through citation searching followed by reading the studies that have cited them. Data papers were then clustered based on their content, whereas their citations were classified according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. Results: We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of citations. MSR data papers are cited less than other MSR papers. A considerable number of the citations stem from the teams that authored the data papers. Publications providing repository data and metadata are the most frequent data papers and the most often cited ones. Mobile application data papers are the least common ones, but the second most frequently cited. Conclusion: Data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, and by providing incentives for the enrichment of existing data collections.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"39 1","pages":"565-576"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88441585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software merging researchers constantly need empirical data of real-world merge scenarios to analyze. Such data is currently extracted through individual and isolated efforts, often with non-systematically designed scripts that may not easily scale to large studies. This hinders replication and proper comparison of results. In this paper, we introduce MERGANSER, a scalable and easy-to-use tool for extracting and analyzing merge scenarios in Git repositories. In addition to extracting basic information about merge scenarios from Git history, our tool also replays each merge to detect conflicts and stores the corresponding information of conflicting files and regions. We design a normalized and extensible SQL data schema to store the information of the analyzed repositories, merge scenarios and involved commits, and merge replays and conflicts. By running only one command, our proposed tool clones the target repositories, detects their merge scenarios, and stores their information in a SQL database. MERGANSER is written in Python and released under the MIT license. In this tool paper, we describe MERGANSER's architecture and provide guidance for its usage in practice.
{"title":"Scalable Software Merging Studies with MERGANSER","authors":"Moein Owhadi-Kareshk, Sarah Nadi","doi":"10.1109/MSR.2019.00084","DOIUrl":"https://doi.org/10.1109/MSR.2019.00084","url":null,"abstract":"Software merging researchers constantly need empirical data of real-world merge scenarios to analyze. Such data is currently extracted through individual and isolated efforts, often with non-systematically designed scripts that may not easily scale to large studies. This hinders replication and proper comparison of results. In this paper, we introduce MERGANSER, a scalable and easy-to-use tool for extracting and analyzing merge scenarios in Git repositories. In addition to extracting basic information about merge scenarios from Git history, our tool also replays each merge to detect conflicts and stores the corresponding information of conflicting files and regions. We design a normalized and extensible SQL data schema to store the information of the analyzed repositories, merge scenarios and involved commits, and merge replays and conflicts. By running only one command, our proposed tool clones the target repositories, detects their merge scenarios, and stores their information in a SQL database. MERGANSER is written in Python and released under the MIT license. In this tool paper, we describe MERGANSER's architecture and provide guidance for its usage in practice.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"81 1","pages":"560-564"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80973613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Wickert, Michael Reif, Michael Eichberg, Anam Dodhy, M. Mezini
Cryptographic APIs (Crypto APIs) provide the foundations for the development of secure applications. Unfortunately, most applications do not use Crypto APIs securely and end up being insecure, e.g., by the usage of an outdated algorithm, a constant initialization vector, or an inappropriate hashing algorithm. Two different studies [1], [2] have recently shown that 88% to 95% of those applications using Crypto APIs are insecure due to misuses. To facilitate further research on these kinds of misuses, we created a collection of 201 misuses found in real-world applications along with a classification of those misuses. In the provided dataset, each misuse consists of the corresponding open-source project, the project's build information, a description of the misuse, and the misuse's location. Further, we integrated our dataset into MUBench [3], a benchmark for API misuse detection. Our dataset provides a foundation for research on Crypto API misuses. For example, it can be used to evaluate the precision and recall of detection tools, as a foundation for studies related to Crypto API misuses, or as a training set.
{"title":"A Dataset of Parametric Cryptographic Misuses","authors":"A. Wickert, Michael Reif, Michael Eichberg, Anam Dodhy, M. Mezini","doi":"10.1109/MSR.2019.00023","DOIUrl":"https://doi.org/10.1109/MSR.2019.00023","url":null,"abstract":"Cryptographic APIs (Crypto APIs) provide the foundations for the development of secure applications. Unfortunately, most applications do not use Crypto APIs securely and end up being insecure, e.g., by the usage of an outdated algorithm, a constant initialization vector, or an inappropriate hashing algorithm. Two different studies [1], [2] have recently shown that 88% to 95% of those applications using Crypto APIs are insecure due to misuses. To facilitate further research on these kinds of misuses, we created a collection of 201 misuses found in real-world applications along with a classification of those misuses. In the provided dataset, each misuse consists of the corresponding open-source project, the project's build information, a description of the misuse, and the misuse's location. Further, we integrated our dataset into MUBench [3], a benchmark for API misuse detection. Our dataset provides a foundation for research on Crypto API misuses. For example, it can be used to evaluate the precision and recall of detection tools, as a foundation for studies related to Crypto API misuses, or as a training set.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"9 1","pages":"96-100"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85300880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jens Dietrich, David J. Pearce, Jacob Stringer, Amjed Tahir, Kelly Blincoe
Many modern software systems are built on top of existing packages (modules, components, libraries). The increasing number and complexity of dependencies has given rise to automated dependency management where package managers resolve symbolic dependencies against a central repository. When declaring dependencies, developers face various choices, such as whether or not to declare a fixed version or a range of versions. The former results in runtime behaviour that is easier to predict, whilst the latter enables flexibility in resolution that can, for example, prevent different versions of the same package being included and facilitates the automated deployment of bug fixes. We study the choices developers make across 17 different package managers, investigating over 70 million dependencies. This is complemented by a survey of 170 developers. We find that many package managers support — and the respective community adapts — flexible versioning practices. This does not always work: developers struggle to find the sweet spot between the predictability of fixed version dependencies, and the agility of flexible ones, and depending on their experience, adjust practices. We see some uptake of semantic versioning in some package managers, supported by tools. However, there is no evidence that projects switch to semantic versioning on a large scale. The results of this study can guide further research into better practices for automated dependency management, and aid the adaptation of semantic versioning.
{"title":"Dependency Versioning in the Wild","authors":"Jens Dietrich, David J. Pearce, Jacob Stringer, Amjed Tahir, Kelly Blincoe","doi":"10.1109/MSR.2019.00061","DOIUrl":"https://doi.org/10.1109/MSR.2019.00061","url":null,"abstract":"Many modern software systems are built on top of existing packages (modules, components, libraries). The increasing number and complexity of dependencies has given rise to automated dependency management where package managers resolve symbolic dependencies against a central repository. When declaring dependencies, developers face various choices, such as whether or not to declare a fixed version or a range of versions. The former results in runtime behaviour that is easier to predict, whilst the latter enables flexibility in resolution that can, for example, prevent different versions of the same package being included and facilitates the automated deployment of bug fixes. We study the choices developers make across 17 different package managers, investigating over 70 million dependencies. This is complemented by a survey of 170 developers. We find that many package managers support — and the respective community adapts — flexible versioning practices. This does not always work: developers struggle to find the sweet spot between the predictability of fixed version dependencies, and the agility of flexible ones, and depending on their experience, adjust practices. We see some uptake of semantic versioning in some package managers, supported by tools. However, there is no evidence that projects switch to semantic versioning on a large scale. The results of this study can guide further research into better practices for automated dependency management, and aid the adaptation of semantic versioning.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"44 1","pages":"349-359"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76464029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitris Mitropoulos, P. Louridas, Vitalis Salis, D. Spinellis
JavaScript is one of the web's key building blocks. It is used by the majority of web sites and it is supported by all modern browsers. We present the first large-scale study of client-side JavaScript code over time. Specifically, we have collected and analyzed a dataset containing daily snapshots of JavaScript code coming from Alexa's Top 10000 web sites (~7.5 GB per day) for nine consecutive months, to study different temporal aspects of web client code. We found that scripts change often; typically every few days, indicating a rapid pace in web applications development. We also found that the lifetime of web sites themselves, measured as the time between JavaScript changes, is also short, in the same time scale. We then performed a qualitative analysis to investigate the nature of the changes that take place. We found that apart from standard changes such as the introduction of new functions, many changes are related to online configuration management. In addition, we examined JavaScript code reuse over time and especially the widespread reliance on third-party libraries. Furthermore, we observed how quality issues evolve by employing established static analysis tools to identify potential software bugs, whose evolution we tracked over time. Our results show that quality issues seem to persist over time, while vulnerable libraries tend to decrease.
{"title":"Time Present and Time Past: Analyzing the Evolution of JavaScript Code in the Wild","authors":"Dimitris Mitropoulos, P. Louridas, Vitalis Salis, D. Spinellis","doi":"10.1109/MSR.2019.00029","DOIUrl":"https://doi.org/10.1109/MSR.2019.00029","url":null,"abstract":"JavaScript is one of the web's key building blocks. It is used by the majority of web sites and it is supported by all modern browsers. We present the first large-scale study of client-side JavaScript code over time. Specifically, we have collected and analyzed a dataset containing daily snapshots of JavaScript code coming from Alexa's Top 10000 web sites (~7.5 GB per day) for nine consecutive months, to study different temporal aspects of web client code. We found that scripts change often; typically every few days, indicating a rapid pace in web applications development. We also found that the lifetime of web sites themselves, measured as the time between JavaScript changes, is also short, in the same time scale. We then performed a qualitative analysis to investigate the nature of the changes that take place. We found that apart from standard changes such as the introduction of new functions, many changes are related to online configuration management. In addition, we examined JavaScript code reuse over time and especially the widespread reliance on third-party libraries. Furthermore, we observed how quality issues evolve by employing established static analysis tools to identify potential software bugs, whose evolution we tracked over time. Our results show that quality issues seem to persist over time, while vulnerable libraries tend to decrease.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"74 1","pages":"126-137"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88040738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus
Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.
开源软件(OSS)对现代社会是必不可少的,尽管对单个(通常是中心)项目进行了大量研究,但对整个OSS生态系统的外围只有有限的了解。例如,周边数千万的项目是如何通过技术依赖、代码共享或知识流相互连接的?为了回答这些问题,我们a)为FLOSS项目创建了一个非常大且经常更新的版本控制数据集,名为代码世界(World of Code, WoC), b)提供了基本的工具,用于进行依赖于测量所有FLOSS项目之间相互依赖性的研究。我们当前的WoC实现能够每月更新一次,包含超过120亿个git对象。为了评估其研究潜力并为其使用创建场景,我们使用WoC进行了几项研究任务。特别是,我们发现它能够支持趋势评估、生态系统测量和包使用的确定。我们期望WoC能够推动对OSS开发的全球特性的调查,从而增加整个OSS生态系统的弹性。我们的基础设施有助于发现关键的技术依赖、代码流和社会网络,这些网络为确定驱动FLOSS活动和创新的关系的结构和演变提供了基础。
{"title":"World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data","authors":"Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus","doi":"10.1109/MSR.2019.00031","DOIUrl":"https://doi.org/10.1109/MSR.2019.00031","url":null,"abstract":"Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"143-154"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87763930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}