2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)最新文献_第4页

Python Coding Style Compliance on Stack Overflow Python代码风格遵从堆栈溢出

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00042

Nikolaos Bafatakis, Niels Boecker, Wenjie. Boon, Martin Cabello Salazar, J. Krinke, Gazi Oznacar, Robert White

Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation (r = -0.87, p < 10^-7) between the vote score a post received and the average number of violations per statement for snippets in such posts.

全世界的软件开发人员都使用Stack Overflow (SO)来交互和交换代码片段。研究人员还使用SO来收集推荐系统使用的代码片段。但是，以前的工作表明，SO上的代码可能存在质量问题，例如安全性或许可证问题。我们在SO上分析Python代码，以确定其编码风格的遵从性。从1,962,535个标有“python”的代码片段中，我们提取了至少6条python代码语句的407,097个片段。令人惊讶的是，93.87%的提取片段包含样式违规，平均每个语句有0.7个违规，而且大量片段的比例要高得多。因此，研究人员和开发人员应该意识到，SO上的代码片段可能并不代表良好的编码风格。此外，虽然用户声誉似乎与编码风格合规性无关，但对于投票得分在-10到20之间的帖子，我们发现帖子收到的投票得分与帖子中每个语句片段的平均违规次数之间存在很强的相关性(r = -0.87, p < 10^-7)。

{"title":"Python Coding Style Compliance on Stack Overflow","authors":"Nikolaos Bafatakis, Niels Boecker, Wenjie. Boon, Martin Cabello Salazar, J. Krinke, Gazi Oznacar, Robert White","doi":"10.1109/MSR.2019.00042","DOIUrl":"https://doi.org/10.1109/MSR.2019.00042","url":null,"abstract":"Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation (r = -0.87, p < 10^-7) between the vote score a post received and the average number of violations per statement for snippets in such posts.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"52 1","pages":"210-214"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78098230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Empirical Study in using Version Histories for Change Risk Classification 版本历史用于变更风险分类的实证研究

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00018

Max Kiehn, Xiangyi Pan, F. Camci

Many techniques have been proposed for mining software repositories, predicting code quality and evaluating code changes. Prior work has established links between code ownership and churn metrics, and software quality at file and directory level based on changes that fix bugs. Other metrics have been used to evaluate individual code changes based on preceding changes that induce fixes. This paper combines the two approaches in an empirical study of assessing risk of code changes using established code ownership and churn metrics with fix inducing changes on a large proprietary code repository. We establish a machine learning model for change risk classification which achieves average precision of 0.76 using metrics from prior works and 0.90 using a wider array of metrics. Our results suggest that code ownership metrics can be applied in change risk classification models based on fix inducing changes.

已经提出了许多用于挖掘软件存储库、预测代码质量和评估代码更改的技术。先前的工作已经建立了代码所有权和流失指标之间的联系，以及基于修复错误的更改的文件和目录级别的软件质量。其他指标已被用于基于引起修复的先前更改来评估单个代码更改。本文将这两种方法结合在一起，进行了一项实证研究，使用已建立的代码所有权和在大型专有代码存储库上进行修复诱导更改的流失度量来评估代码更改的风险。我们建立了一个用于变更风险分类的机器学习模型，使用先前工作的指标实现了0.76的平均精度，使用更广泛的指标实现了0.90的平均精度。我们的结果表明，代码所有权度量可以应用于基于修复诱导变更的变更风险分类模型中。

引用次数: 5

Does UML Modeling Associate with Lower Defect Proneness?: A Preliminary Empirical Investigation UML建模是否与较低的缺陷倾向相关联?:初步实证调查

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00024

Adithya Raghuraman, Truong Ho-Quang, M. Chaudron, A. Serebrenik, Bogdan Vasilescu

The benefits of modeling the design to improve the quality and maintainability of software systems have long been advocated and recognized. Yet, the empirical evidence on this remains scarce. In this paper, we fill this gap by reporting on an empirical study of the relationship between UML modeling and software defect proneness in a large sample of open-source GitHub projects. Using statistical modeling, and controlling for confounding variables, we show that projects containing traces of UML models in their repositories experience, on average, a statistically minorly different number of software defects (as mined from their issue trackers) than projects without traces of UML models.

对设计进行建模以提高软件系统的质量和可维护性的好处一直被提倡和认可。然而，关于这一点的经验证据仍然很少。在本文中，我们通过在一个大的开源GitHub项目样本中报告UML建模和软件缺陷倾向之间关系的实证研究来填补这一空白。使用统计建模，并控制混淆变量，我们显示在其存储库中包含UML模型跟踪的项目，平均而言，与没有UML模型跟踪的项目相比，在统计上有少量不同数量的软件缺陷(从它们的问题跟踪器中挖掘)。

引用次数: 7

What Edits are Done on the Highly Answered Questions in Stack Overflow? An Empirical Study 对堆栈溢出中高度回答的问题进行了哪些编辑?实证研究

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00045

Xianhao Jin, Francisco Servant

Stack Overflow is the most-widely-used online question-and-answer platform for software developers to solve problems and communicate experience. Stack Overflow believes in the power of community editing, which means that one is able to edit questions without the changes going through peer review. Stack Overflow users may make edits to questions for a variety of reasons, among others, to improve the question and try to obtain more answers. However, to date the relationship between edit actions on questions and the number of answers that they collect is unknown. In this paper, we perform an empirical study on Stack Overflow to understand the relationship between edit actions and number of answers obtained in different dimensions from different attributes of the edited questions. We find that questions are more commonly edited by question owners, on bodies with relatively big changes before obtaining an accepted answer. However, edited questions that obtained more answers in a shorter time, were edited by other users rather than question owners, and their edits tended to be small, focused on titles and in adding addendums.

Stack Overflow是软件开发人员解决问题和交流经验的最广泛使用的在线问答平台。Stack Overflow相信社区编辑的力量，这意味着人们可以编辑问题而无需经过同行评审。Stack Overflow用户可能出于各种原因对问题进行编辑，其中包括改进问题并尝试获得更多答案。然而，到目前为止，问题的编辑操作与它们收集的答案数量之间的关系是未知的。本文通过Stack Overflow的实证研究，了解编辑行为与被编辑问题的不同属性在不同维度上获得的答案数之间的关系。我们发现问题通常是由问题所有者编辑的，在得到一个公认的答案之前，问题的主体变化相对较大。然而，在较短时间内获得更多答案的编辑问题是由其他用户而不是问题所有者编辑的，他们的编辑往往很小，主要集中在标题和添加附录上。

{"title":"What Edits are Done on the Highly Answered Questions in Stack Overflow? An Empirical Study","authors":"Xianhao Jin, Francisco Servant","doi":"10.1109/MSR.2019.00045","DOIUrl":"https://doi.org/10.1109/MSR.2019.00045","url":null,"abstract":"Stack Overflow is the most-widely-used online question-and-answer platform for software developers to solve problems and communicate experience. Stack Overflow believes in the power of community editing, which means that one is able to edit questions without the changes going through peer review. Stack Overflow users may make edits to questions for a variety of reasons, among others, to improve the question and try to obtain more answers. However, to date the relationship between edit actions on questions and the number of answers that they collect is unknown. In this paper, we perform an empirical study on Stack Overflow to understand the relationship between edit actions and number of answers obtained in different dimensions from different attributes of the edited questions. We find that questions are more commonly edited by question owners, on bodies with relatively big changes before obtaining an accepted answer. However, edited questions that obtained more answers in a shorter time, were edited by other users rather than question owners, and their edits tended to be small, focused on titles and in adding addendums.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"31 4 1","pages":"225-229"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77253625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Standing on Shoulders or Feet? The Usage of the MSR Data Papers 站在肩膀上还是脚上?MSR数据文件的使用

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00085

Zoe Kotti, D. Spinellis

Introduction: The establishment of the Mining Software Repositories (MSR) Data Showcase conference track has encouraged researchers to provide more data sets as a basis for further empirical studies. Objectives: Examine the usage of the data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Methods: Data track papers were collected from the MSR Data Showcase and through the manual inspection of older MSR proceedings. The use of data papers was established through citation searching followed by reading the studies that have cited them. Data papers were then clustered based on their content, whereas their citations were classified according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. Results: We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of citations. MSR data papers are cited less than other MSR papers. A considerable number of the citations stem from the teams that authored the data papers. Publications providing repository data and metadata are the most frequent data papers and the most often cited ones. Mobile application data papers are the least common ones, but the second most frequently cited. Conclusion: Data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, and by providing incentives for the enrichment of existing data collections.

引言:挖掘软件存储库(MSR)数据展示会议轨道的建立鼓励了研究人员提供更多的数据集作为进一步实证研究的基础。目的:从使用频率、用户和使用目的方面检查MSR论文集中发表的数据论文的使用情况。方法:数据跟踪论文收集自MSR数据展示和通过人工检查旧MSR会议记录。数据论文的使用是通过引文搜索，然后阅读引用它们的研究来确定的。数据论文然后根据它们的内容聚类，而它们的引用是根据软件工程知识体系指南的知识领域分类的。结果:我们发现65%的数据论文被其他研究引用过，被引次数呈长尾分布。MSR数据论文被引用的次数少于其他MSR论文。相当多的引用来自撰写数据论文的团队。提供存储库数据和元数据的出版物是最常见的数据论文，也是最常被引用的出版物。移动应用数据论文是最不常见的，但却是第二常被引用的论文。结论:数据论文为大量的研究提供了基础，但数据论文的利用还有待改进。要做到这一点，可以为它们的出版设定更高的标准，鼓励它们的使用，并为丰富现有的数据收集提供奖励。

{"title":"Standing on Shoulders or Feet? The Usage of the MSR Data Papers","authors":"Zoe Kotti, D. Spinellis","doi":"10.1109/MSR.2019.00085","DOIUrl":"https://doi.org/10.1109/MSR.2019.00085","url":null,"abstract":"Introduction: The establishment of the Mining Software Repositories (MSR) Data Showcase conference track has encouraged researchers to provide more data sets as a basis for further empirical studies. Objectives: Examine the usage of the data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Methods: Data track papers were collected from the MSR Data Showcase and through the manual inspection of older MSR proceedings. The use of data papers was established through citation searching followed by reading the studies that have cited them. Data papers were then clustered based on their content, whereas their citations were classified according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. Results: We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of citations. MSR data papers are cited less than other MSR papers. A considerable number of the citations stem from the teams that authored the data papers. Publications providing repository data and metadata are the most frequent data papers and the most often cited ones. Mobile application data papers are the least common ones, but the second most frequently cited. Conclusion: Data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, and by providing incentives for the enrichment of existing data collections.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"39 1","pages":"565-576"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88441585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Scalable Software Merging Studies with MERGANSER 可扩展的软件合并研究与MERGANSER

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00084

Moein Owhadi-Kareshk, Sarah Nadi

Software merging researchers constantly need empirical data of real-world merge scenarios to analyze. Such data is currently extracted through individual and isolated efforts, often with non-systematically designed scripts that may not easily scale to large studies. This hinders replication and proper comparison of results. In this paper, we introduce MERGANSER, a scalable and easy-to-use tool for extracting and analyzing merge scenarios in Git repositories. In addition to extracting basic information about merge scenarios from Git history, our tool also replays each merge to detect conflicts and stores the corresponding information of conflicting files and regions. We design a normalized and extensible SQL data schema to store the information of the analyzed repositories, merge scenarios and involved commits, and merge replays and conflicts. By running only one command, our proposed tool clones the target repositories, detects their merge scenarios, and stores their information in a SQL database. MERGANSER is written in Python and released under the MIT license. In this tool paper, we describe MERGANSER's architecture and provide guidance for its usage in practice.

软件合并研究人员不断需要现实世界合并场景的经验数据进行分析。这些数据目前是通过个人和孤立的努力提取的，通常使用非系统设计的脚本，可能不容易扩展到大型研究。这妨碍了结果的复制和适当的比较。在本文中，我们介绍MERGANSER，这是一个可扩展且易于使用的工具，用于提取和分析Git存储库中的合并场景。除了从Git历史中提取关于合并场景的基本信息外，我们的工具还重放每次合并以检测冲突，并存储冲突文件和区域的相应信息。我们设计了一个规范化和可扩展的SQL数据模式来存储分析的存储库的信息，合并场景和涉及的提交，合并重播和冲突。通过只运行一个命令，我们建议的工具克隆目标存储库，检测它们的合并场景，并将它们的信息存储在SQL数据库中。MERGANSER是用Python编写的，并在MIT许可下发布。在本文中，我们描述了MERGANSER的体系结构，并为其在实践中的使用提供了指导。

{"title":"Scalable Software Merging Studies with MERGANSER","authors":"Moein Owhadi-Kareshk, Sarah Nadi","doi":"10.1109/MSR.2019.00084","DOIUrl":"https://doi.org/10.1109/MSR.2019.00084","url":null,"abstract":"Software merging researchers constantly need empirical data of real-world merge scenarios to analyze. Such data is currently extracted through individual and isolated efforts, often with non-systematically designed scripts that may not easily scale to large studies. This hinders replication and proper comparison of results. In this paper, we introduce MERGANSER, a scalable and easy-to-use tool for extracting and analyzing merge scenarios in Git repositories. In addition to extracting basic information about merge scenarios from Git history, our tool also replays each merge to detect conflicts and stores the corresponding information of conflicting files and regions. We design a normalized and extensible SQL data schema to store the information of the analyzed repositories, merge scenarios and involved commits, and merge replays and conflicts. By running only one command, our proposed tool clones the target repositories, detects their merge scenarios, and stores their information in a SQL database. MERGANSER is written in Python and released under the MIT license. In this tool paper, we describe MERGANSER's architecture and provide guidance for its usage in practice.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"81 1","pages":"560-564"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80973613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Dataset of Parametric Cryptographic Misuses 参数密码误用数据集

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00023

A. Wickert, Michael Reif, Michael Eichberg, Anam Dodhy, M. Mezini

Cryptographic APIs (Crypto APIs) provide the foundations for the development of secure applications. Unfortunately, most applications do not use Crypto APIs securely and end up being insecure, e.g., by the usage of an outdated algorithm, a constant initialization vector, or an inappropriate hashing algorithm. Two different studies [1], [2] have recently shown that 88% to 95% of those applications using Crypto APIs are insecure due to misuses. To facilitate further research on these kinds of misuses, we created a collection of 201 misuses found in real-world applications along with a classification of those misuses. In the provided dataset, each misuse consists of the corresponding open-source project, the project's build information, a description of the misuse, and the misuse's location. Further, we integrated our dataset into MUBench [3], a benchmark for API misuse detection. Our dataset provides a foundation for research on Crypto API misuses. For example, it can be used to evaluate the precision and recall of detection tools, as a foundation for studies related to Crypto API misuses, or as a training set.

加密api (Crypto api)为安全应用程序的开发提供了基础。不幸的是，大多数应用程序并没有安全地使用Crypto api，最终导致不安全，例如，通过使用过时的算法，常量初始化向量或不适当的哈希算法。两项不同的研究[1]，[2]最近表明，由于滥用，88%到95%使用加密api的应用程序是不安全的。为了促进对这类滥用的进一步研究，我们创建了在实际应用程序中发现的201种滥用的集合，并对这些滥用进行了分类。在提供的数据集中，每个误用都由相应的开源项目、项目的构建信息、误用的描述和误用的位置组成。此外，我们将我们的数据集集成到MUBench[3]中，这是一个用于API误用检测的基准。我们的数据集为加密API滥用的研究提供了基础。例如，它可以用来评估检测工具的精度和召回率，作为与Crypto API滥用相关的研究的基础，或者作为训练集。

{"title":"A Dataset of Parametric Cryptographic Misuses","authors":"A. Wickert, Michael Reif, Michael Eichberg, Anam Dodhy, M. Mezini","doi":"10.1109/MSR.2019.00023","DOIUrl":"https://doi.org/10.1109/MSR.2019.00023","url":null,"abstract":"Cryptographic APIs (Crypto APIs) provide the foundations for the development of secure applications. Unfortunately, most applications do not use Crypto APIs securely and end up being insecure, e.g., by the usage of an outdated algorithm, a constant initialization vector, or an inappropriate hashing algorithm. Two different studies [1], [2] have recently shown that 88% to 95% of those applications using Crypto APIs are insecure due to misuses. To facilitate further research on these kinds of misuses, we created a collection of 201 misuses found in real-world applications along with a classification of those misuses. In the provided dataset, each misuse consists of the corresponding open-source project, the project's build information, a description of the misuse, and the misuse's location. Further, we integrated our dataset into MUBench [3], a benchmark for API misuse detection. Our dataset provides a foundation for research on Crypto API misuses. For example, it can be used to evaluate the precision and recall of detection tools, as a foundation for studies related to Crypto API misuses, or as a training set.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"9 1","pages":"96-100"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85300880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Dependency Versioning in the Wild 野外的依赖版本控制

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00061

Jens Dietrich, David J. Pearce, Jacob Stringer, Amjed Tahir, Kelly Blincoe

Many modern software systems are built on top of existing packages (modules, components, libraries). The increasing number and complexity of dependencies has given rise to automated dependency management where package managers resolve symbolic dependencies against a central repository. When declaring dependencies, developers face various choices, such as whether or not to declare a fixed version or a range of versions. The former results in runtime behaviour that is easier to predict, whilst the latter enables flexibility in resolution that can, for example, prevent different versions of the same package being included and facilitates the automated deployment of bug fixes. We study the choices developers make across 17 different package managers, investigating over 70 million dependencies. This is complemented by a survey of 170 developers. We find that many package managers support — and the respective community adapts — flexible versioning practices. This does not always work: developers struggle to find the sweet spot between the predictability of fixed version dependencies, and the agility of flexible ones, and depending on their experience, adjust practices. We see some uptake of semantic versioning in some package managers, supported by tools. However, there is no evidence that projects switch to semantic versioning on a large scale. The results of this study can guide further research into better practices for automated dependency management, and aid the adaptation of semantic versioning.

许多现代软件系统是建立在现有包(模块、组件、库)之上的。依赖关系的数量和复杂性的增加导致了自动化依赖关系管理，包管理器根据中央存储库解析符号依赖关系。在声明依赖项时，开发人员面临着各种选择，例如是否声明一个固定版本或一系列版本。前者的结果是运行时行为更容易预测，而后者使解决方案更灵活，例如，可以防止包含同一包的不同版本，并促进错误修复的自动部署。我们研究了开发人员在17个不同的包管理器中所做的选择，调查了超过7000万个依赖项。这是对170名开发者的调查的补充。我们发现许多包管理器支持——并且各自的社区也适应了——灵活的版本控制实践。这并不总是有效:开发人员努力在固定版本依赖关系的可预测性和灵活版本依赖关系的敏捷性之间找到最佳平衡点，并根据他们的经验调整实践。我们在一些由工具支持的包管理器中看到了一些语义版本控制。然而，没有证据表明项目会大规模地转向语义版本控制。这项研究的结果可以指导进一步研究自动化依赖管理的更好实践，并帮助适应语义版本控制。

{"title":"Dependency Versioning in the Wild","authors":"Jens Dietrich, David J. Pearce, Jacob Stringer, Amjed Tahir, Kelly Blincoe","doi":"10.1109/MSR.2019.00061","DOIUrl":"https://doi.org/10.1109/MSR.2019.00061","url":null,"abstract":"Many modern software systems are built on top of existing packages (modules, components, libraries). The increasing number and complexity of dependencies has given rise to automated dependency management where package managers resolve symbolic dependencies against a central repository. When declaring dependencies, developers face various choices, such as whether or not to declare a fixed version or a range of versions. The former results in runtime behaviour that is easier to predict, whilst the latter enables flexibility in resolution that can, for example, prevent different versions of the same package being included and facilitates the automated deployment of bug fixes. We study the choices developers make across 17 different package managers, investigating over 70 million dependencies. This is complemented by a survey of 170 developers. We find that many package managers support — and the respective community adapts — flexible versioning practices. This does not always work: developers struggle to find the sweet spot between the predictability of fixed version dependencies, and the agility of flexible ones, and depending on their experience, adjust practices. We see some uptake of semantic versioning in some package managers, supported by tools. However, there is no evidence that projects switch to semantic versioning on a large scale. The results of this study can guide further research into better practices for automated dependency management, and aid the adaptation of semantic versioning.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"44 1","pages":"349-359"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76464029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Time Present and Time Past: Analyzing the Evolution of JavaScript Code in the Wild 现在的时间和过去的时间:分析JavaScript代码在野外的演变

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00029

Dimitris Mitropoulos, P. Louridas, Vitalis Salis, D. Spinellis

JavaScript is one of the web's key building blocks. It is used by the majority of web sites and it is supported by all modern browsers. We present the first large-scale study of client-side JavaScript code over time. Specifically, we have collected and analyzed a dataset containing daily snapshots of JavaScript code coming from Alexa's Top 10000 web sites (~7.5 GB per day) for nine consecutive months, to study different temporal aspects of web client code. We found that scripts change often; typically every few days, indicating a rapid pace in web applications development. We also found that the lifetime of web sites themselves, measured as the time between JavaScript changes, is also short, in the same time scale. We then performed a qualitative analysis to investigate the nature of the changes that take place. We found that apart from standard changes such as the introduction of new functions, many changes are related to online configuration management. In addition, we examined JavaScript code reuse over time and especially the widespread reliance on third-party libraries. Furthermore, we observed how quality issues evolve by employing established static analysis tools to identify potential software bugs, whose evolution we tracked over time. Our results show that quality issues seem to persist over time, while vulnerable libraries tend to decrease.

JavaScript是网络的关键构建模块之一。大多数网站都使用它，所有现代浏览器都支持它。我们首次对客户端JavaScript代码进行了大规模的研究。具体来说，我们收集并分析了一个数据集，该数据集包含来自Alexa排名前10000的网站(每天约7.5 GB)的JavaScript代码的每日快照，连续9个月，以研究web客户端代码的不同时间方面。我们发现剧本经常改变;通常每隔几天，这表明web应用程序开发的速度很快。我们还发现，在相同的时间尺度下，网站本身的生命周期(用JavaScript更改之间的时间来衡量)也很短。然后，我们进行了定性分析，以调查发生的变化的性质。我们发现，除了引入新功能等标准更改外，许多更改都与在线配置管理有关。此外，我们还研究了JavaScript代码的重用，特别是对第三方库的广泛依赖。此外，我们通过使用已建立的静态分析工具来识别潜在的软件缺陷，观察质量问题是如何演变的，我们跟踪了这些缺陷的演变。我们的结果表明，质量问题似乎随着时间的推移而持续存在，而脆弱的库则趋于减少。

{"title":"Time Present and Time Past: Analyzing the Evolution of JavaScript Code in the Wild","authors":"Dimitris Mitropoulos, P. Louridas, Vitalis Salis, D. Spinellis","doi":"10.1109/MSR.2019.00029","DOIUrl":"https://doi.org/10.1109/MSR.2019.00029","url":null,"abstract":"JavaScript is one of the web's key building blocks. It is used by the majority of web sites and it is supported by all modern browsers. We present the first large-scale study of client-side JavaScript code over time. Specifically, we have collected and analyzed a dataset containing daily snapshots of JavaScript code coming from Alexa's Top 10000 web sites (~7.5 GB per day) for nine consecutive months, to study different temporal aspects of web client code. We found that scripts change often; typically every few days, indicating a rapid pace in web applications development. We also found that the lifetime of web sites themselves, measured as the time between JavaScript changes, is also short, in the same time scale. We then performed a qualitative analysis to investigate the nature of the changes that take place. We found that apart from standard changes such as the introduction of new functions, many changes are related to online configuration management. In addition, we examined JavaScript code reuse over time and especially the widespread reliance on third-party libraries. Furthermore, we observed how quality issues evolve by employing established static analysis tools to identify potential software bugs, whose evolution we tracked over time. Our results show that quality issues seem to persist over time, while vulnerable libraries tend to decrease.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"74 1","pages":"126-137"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88040738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data 代码世界:用于挖掘开源VCS数据世界的基础设施

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Pub Date : 2019-05-26 DOI: 10.1109/MSR.2019.00031

Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

开源软件(OSS)对现代社会是必不可少的，尽管对单个(通常是中心)项目进行了大量研究，但对整个OSS生态系统的外围只有有限的了解。例如，周边数千万的项目是如何通过技术依赖、代码共享或知识流相互连接的?为了回答这些问题，我们a)为FLOSS项目创建了一个非常大且经常更新的版本控制数据集，名为代码世界(World of Code, WoC)， b)提供了基本的工具，用于进行依赖于测量所有FLOSS项目之间相互依赖性的研究。我们当前的WoC实现能够每月更新一次，包含超过120亿个git对象。为了评估其研究潜力并为其使用创建场景，我们使用WoC进行了几项研究任务。特别是，我们发现它能够支持趋势评估、生态系统测量和包使用的确定。我们期望WoC能够推动对OSS开发的全球特性的调查，从而增加整个OSS生态系统的弹性。我们的基础设施有助于发现关键的技术依赖、代码流和社会网络，这些网络为确定驱动FLOSS活动和创新的关系的结构和演变提供了基础。

{"title":"World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data","authors":"Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus","doi":"10.1109/MSR.2019.00031","DOIUrl":"https://doi.org/10.1109/MSR.2019.00031","url":null,"abstract":"Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"143-154"},"PeriodicalIF":0.0,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87763930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73