首页 > 最新文献

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)最新文献

英文 中文
Characterizing the Roles of Contributors in Open-Source Scientific Software Projects 描述开源科学软件项目中贡献者的角色
Reed Milewicz, G. Pinto, Paige Rodeghero
The development of scientific software is, more than ever, critical to the practice of science, and this is accompanied by a trend towards more open and collaborative efforts. Unfortunately, there has been little investigation into who is driving the evolution of such scientific software or how the collaboration happens. In this paper, we address this problem. We present an extensive analysis of seven open-source scientific software projects in order to develop an empirically-informed model of the development process. This analysis was complemented by a survey of 72 scientific software developers. In the majority of the projects, we found senior research staff (e.g. professors) to be responsible for half or more of commits (an average commit share of 72%) and heavily involved in architectural concerns (seniors were more likely to interact with files related to the build system, project meta-data, and developer documentation). Juniors (e.g. graduate students) also contribute substantially — in one studied project, juniors made almost 100% of its commits. Still, graduate students had the longest contribution periods among juniors (with 1.72 years of commit activity compared to 0.98 years for postdocs and 4 months for undergraduates). Moreover, we also found that third-party contributors are scarce, contributing for just one day for the project. The results from this study aim to help scientists to better understand their own projects, communities, and the contributors’ behavior, while paving the road for future software engineering research.
科学软件的开发比以往任何时候都对科学实践至关重要,这伴随着一种更加开放和合作的趋势。不幸的是,很少有人调查是谁在推动这种科学软件的发展,或者这种合作是如何发生的。在本文中,我们解决了这个问题。我们对七个开源科学软件项目进行了广泛的分析,以开发开发过程的经验信息模型。这项分析得到了对72名科学软件开发人员的调查的补充。在大多数项目中,我们发现高级研究人员(例如教授)负责一半或更多的提交(平均提交份额为72%),并且大量参与架构问题(高级人员更有可能与构建系统、项目元数据和开发人员文档相关的文件进行交互)。大三学生(例如研究生)也做出了很大的贡献——在一个研究项目中,大三学生几乎完成了100%的提交。然而,在大三学生中,研究生的贡献时间最长(1.72年,而博士后为0.98年,本科生为4个月)。此外,我们还发现第三方贡献者很少,仅为项目贡献一天的时间。这项研究的结果旨在帮助科学家更好地理解他们自己的项目、社区和贡献者的行为,同时为未来的软件工程研究铺平道路。
{"title":"Characterizing the Roles of Contributors in Open-Source Scientific Software Projects","authors":"Reed Milewicz, G. Pinto, Paige Rodeghero","doi":"10.1109/MSR.2019.00069","DOIUrl":"https://doi.org/10.1109/MSR.2019.00069","url":null,"abstract":"The development of scientific software is, more than ever, critical to the practice of science, and this is accompanied by a trend towards more open and collaborative efforts. Unfortunately, there has been little investigation into who is driving the evolution of such scientific software or how the collaboration happens. In this paper, we address this problem. We present an extensive analysis of seven open-source scientific software projects in order to develop an empirically-informed model of the development process. This analysis was complemented by a survey of 72 scientific software developers. In the majority of the projects, we found senior research staff (e.g. professors) to be responsible for half or more of commits (an average commit share of 72%) and heavily involved in architectural concerns (seniors were more likely to interact with files related to the build system, project meta-data, and developer documentation). Juniors (e.g. graduate students) also contribute substantially — in one studied project, juniors made almost 100% of its commits. Still, graduate students had the longest contribution periods among juniors (with 1.72 years of commit activity compared to 0.98 years for postdocs and 4 months for undergraduates). Moreover, we also found that third-party contributors are scarce, contributing for just one day for the project. The results from this study aim to help scientists to better understand their own projects, communities, and the contributors’ behavior, while paving the road for future software engineering research.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"29 1","pages":"421-432"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73415259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
DeepJIT: An End-to-End Deep Learning Framework for Just-in-Time Defect Prediction DeepJIT:用于即时缺陷预测的端到端深度学习框架
Thong Hoang, K. Dam, Yasutaka Kamei, D. Lo, Naoyasu Ubayashi
Software quality assurance efforts often focus on identifying defective code. To find likely defective code early, change-level defect prediction – aka. Just-In-Time (JIT) defect prediction – has been proposed. JIT defect prediction models identify likely defective changes and they are trained using machine learning techniques with the assumption that historical changes are similar to future ones. Most existing JIT defect prediction approaches make use of manually engineered features. Unlike those approaches, in this paper, we propose an end-to-end deep learning framework, named DeepJIT, that automatically extracts features from commit messages and code changes and use them to identify defects. Experiments on two popular software projects (i.e., QT and OPENSTACK) on three evaluation settings (i.e., cross-validation, short-period, and long-period) show that the best variant of DeepJIT (DeepJIT-Combined), compared with the best performing state-of-the-art approach, achieves improvements of 10.36-11.02% for the project QT and 9.51-13.69% for the project OPENSTACK in terms of the Area Under the Curve (AUC).
软件质量保证工作通常集中于识别有缺陷的代码。为了尽早发现可能有缺陷的代码,变更级缺陷预测——又名。即时(JIT)缺陷预测已经被提出。JIT缺陷预测模型识别可能有缺陷的更改,并且使用机器学习技术训练它们,假设历史更改与未来更改相似。大多数现有的JIT缺陷预测方法都使用人工设计的特性。与这些方法不同,在本文中,我们提出了一个端到端的深度学习框架,名为DeepJIT,它可以自动从提交消息和代码更改中提取特征,并使用它们来识别缺陷。在两个流行的软件项目(即QT和OPENSTACK)上进行的三种评估设置(即交叉验证、短周期和长周期)实验表明,与性能最好的最先进方法相比,DeepJIT的最佳变种(DeepJIT- combined)在曲线下面积(AUC)方面实现了QT项目10.36-11.02%和OPENSTACK项目9.51-13.69%的改进。
{"title":"DeepJIT: An End-to-End Deep Learning Framework for Just-in-Time Defect Prediction","authors":"Thong Hoang, K. Dam, Yasutaka Kamei, D. Lo, Naoyasu Ubayashi","doi":"10.1109/MSR.2019.00016","DOIUrl":"https://doi.org/10.1109/MSR.2019.00016","url":null,"abstract":"Software quality assurance efforts often focus on identifying defective code. To find likely defective code early, change-level defect prediction – aka. Just-In-Time (JIT) defect prediction – has been proposed. JIT defect prediction models identify likely defective changes and they are trained using machine learning techniques with the assumption that historical changes are similar to future ones. Most existing JIT defect prediction approaches make use of manually engineered features. Unlike those approaches, in this paper, we propose an end-to-end deep learning framework, named DeepJIT, that automatically extracts features from commit messages and code changes and use them to identify defects. Experiments on two popular software projects (i.e., QT and OPENSTACK) on three evaluation settings (i.e., cross-validation, short-period, and long-period) show that the best variant of DeepJIT (DeepJIT-Combined), compared with the best performing state-of-the-art approach, achieves improvements of 10.36-11.02% for the project QT and 9.51-13.69% for the project OPENSTACK in terms of the Area Under the Curve (AUC).","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"34-45"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79930073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 134
Message from the Chairs of MSR 2019 2019年MSR主席致辞
{"title":"Message from the Chairs of MSR 2019","authors":"","doi":"10.1109/msr.2019.00006","DOIUrl":"https://doi.org/10.1109/msr.2019.00006","url":null,"abstract":"","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90548913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tracing Back Log Data to its Log Statement: From Research to Practice 回溯日志数据到日志声明:从研究到实践
Daan Schipper, M. Aniche, A. Deursen
Logs are widely used as a source of information to understand the activity of computer systems and to monitor their health and stability. However, most log analysis techniques require the link between the log messages in the raw log file and the log statements in the source code that produce them. Several solutions have been proposed to solve this non-trivial challenge, of which the approach based on static analysis reaches the highest accuracy. We, at Adyen, implemented the state-of-the-art research on log parsing in our logging environment and evaluated their accuracy and performance. Our results show that, with some adaptation, the current static analysis techniques are highly efficient and performant. In other words, ready for use.
日志被广泛用作了解计算机系统活动并监控其健康和稳定性的信息来源。但是,大多数日志分析技术需要在原始日志文件中的日志消息和生成它们的源代码中的日志语句之间建立链接。已经提出了几种解决方案来解决这个重要的挑战,其中基于静态分析的方法达到了最高的精度。在Adyen,我们在日志环境中实现了最先进的日志解析研究,并评估了它们的准确性和性能。我们的结果表明,经过一些调整,当前的静态分析技术是高效和高性能的。换句话说,可以使用了。
{"title":"Tracing Back Log Data to its Log Statement: From Research to Practice","authors":"Daan Schipper, M. Aniche, A. Deursen","doi":"10.1109/MSR.2019.00081","DOIUrl":"https://doi.org/10.1109/MSR.2019.00081","url":null,"abstract":"Logs are widely used as a source of information to understand the activity of computer systems and to monitor their health and stability. However, most log analysis techniques require the link between the log messages in the raw log file and the log statements in the source code that produce them. Several solutions have been proposed to solve this non-trivial challenge, of which the approach based on static analysis reaches the highest accuracy. We, at Adyen, implemented the state-of-the-art research on log parsing in our logging environment and evaluated their accuracy and performance. Our results show that, with some adaptation, the current static analysis techniques are highly efficient and performant. In other words, ready for use.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"233 1","pages":"545-549"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73488058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Negative Results on Mining Crypto-API Usage Rules in Android Apps 挖掘Android应用中加密api使用规则的负面结果
Jun Gao, Pingfan Kong, Li Li, Tegawendé F. Bissyandé, Jacques Klein
Android app developers recurrently use crypto-APIs to provide data security to app users. Unfortunately, misuse of APIs only creates an illusion of security and even exposes apps to systematic attacks. It is thus necessary to provide developers with a statically-enforceable list of specifications of crypto-API usage rules. On the one hand, such rules cannot be manually written as the process does not scale to all available APIs. On the other hand, a classical mining approach based on common usage patterns is not relevant in Android, given that a large share of usages include mistakes. In this work, building on the assumption that "developers update API usage instances to fix misuses", we propose to mine a large dataset of updates within about 40 000 real-world app lineages to infer API usage rules. Eventually, our investigations yield negative results on our assumption that API usage updates tend to correct misuses. Actually, it appears that updates that fix misuses may be unintentional: the same misuses patterns are quickly re-introduced by subsequent updates.
Android应用开发者经常使用加密api为应用用户提供数据安全。不幸的是,滥用api只会产生一种安全错觉,甚至使应用程序暴露在系统攻击之下。因此,有必要为开发人员提供一个静态可执行的加密api使用规则规范列表。一方面,这些规则不能手工编写,因为流程不能扩展到所有可用的api。另一方面,基于常见使用模式的经典挖掘方法与Android不相关,因为大部分使用都包含错误。在这项工作中,基于“开发人员更新API使用实例以修复滥用”的假设,我们建议在大约40,000个真实应用程序谱系中挖掘大型更新数据集,以推断API使用规则。最终,我们的调查产生了负面的结果,因为我们假设API使用更新倾向于纠正滥用。实际上,修复误用的更新可能是无意的:相同的误用模式很快被后续更新重新引入。
{"title":"Negative Results on Mining Crypto-API Usage Rules in Android Apps","authors":"Jun Gao, Pingfan Kong, Li Li, Tegawendé F. Bissyandé, Jacques Klein","doi":"10.1109/MSR.2019.00065","DOIUrl":"https://doi.org/10.1109/MSR.2019.00065","url":null,"abstract":"Android app developers recurrently use crypto-APIs to provide data security to app users. Unfortunately, misuse of APIs only creates an illusion of security and even exposes apps to systematic attacks. It is thus necessary to provide developers with a statically-enforceable list of specifications of crypto-API usage rules. On the one hand, such rules cannot be manually written as the process does not scale to all available APIs. On the other hand, a classical mining approach based on common usage patterns is not relevant in Android, given that a large share of usages include mistakes. In this work, building on the assumption that \"developers update API usage instances to fix misuses\", we propose to mine a large dataset of updates within about 40 000 real-world app lineages to infer API usage rules. Eventually, our investigations yield negative results on our assumption that API usage updates tend to correct misuses. Actually, it appears that updates that fix misuses may be unintentional: the same misuses patterns are quickly re-introduced by subsequent updates.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"34 1","pages":"388-398"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81289009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Cross-Language Clone Detection by Learning Over Abstract Syntax Trees 基于抽象语法树学习的跨语言克隆检测
Daniel Perez, S. Chiba
Clone detection across programs written in the same programming language has been studied extensively in the literature. On the contrary, the task of detecting clones across multiple programming languages has not been studied as much, and approaches based on comparison cannot be directly applied. In this paper, we present a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax. Our method uses an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones. To train our network, we present a cross-language code clone dataset - which is to the best of our knowledge the first of its kind - containing around 45,000 code fragments written in Java and Python. We evaluate our approach on the dataset we created and show that our method gives promising results when detecting similarities between code fragments written in Java and Python.
用同一种编程语言编写的程序之间的克隆检测已经在文献中得到了广泛的研究。相反,跨多种编程语言的克隆检测任务研究较少,基于比较的方法不能直接应用。在本文中,我们提出了一种基于半监督机器学习的克隆检测方法,旨在检测具有相似语法的编程语言之间的克隆。我们的方法使用无监督学习方法来学习标记级向量表示,并使用基于lstm的神经网络来预测两个代码片段是否为克隆。为了训练我们的网络,我们提供了一个跨语言代码克隆数据集——据我们所知,这是第一个此类数据集——包含大约45,000个用Java和Python编写的代码片段。我们在我们创建的数据集上评估了我们的方法,并表明我们的方法在检测用Java和Python编写的代码片段之间的相似性时给出了有希望的结果。
{"title":"Cross-Language Clone Detection by Learning Over Abstract Syntax Trees","authors":"Daniel Perez, S. Chiba","doi":"10.1109/MSR.2019.00078","DOIUrl":"https://doi.org/10.1109/MSR.2019.00078","url":null,"abstract":"Clone detection across programs written in the same programming language has been studied extensively in the literature. On the contrary, the task of detecting clones across multiple programming languages has not been studied as much, and approaches based on comparison cannot be directly applied. In this paper, we present a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax. Our method uses an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones. To train our network, we present a cross-language code clone dataset - which is to the best of our knowledge the first of its kind - containing around 45,000 code fragments written in Java and Python. We evaluate our approach on the dataset we created and show that our method gives promising results when detecting similarities between code fragments written in Java and Python.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"21 1","pages":"518-528"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79328748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Generating Commit Messages from Diffs using Pointer-Generator Network 使用指针生成器网络从Diffs生成提交消息
Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, Yu Qian
The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively.
源代码存储库中的提交消息很有价值,但不容易及时手工生成,以便跟踪问题、报告错误和理解代码。近年来的研究表明,深度神经机器翻译方法在提交信息的自动生成方面引起了人们的广泛关注。然而,它们不能处理词汇表外(OOV)词,这些词是特定于上下文的基本标识符,例如代码差异中的类名和方法名。在本文中,我们提出了PtrGNCMsg,这是一种基于改进的序列到序列模型和指针生成器网络的新方法,可以将代码差异转换为提交消息。通过以最大概率搜索最小标识符集,PtrGNCMsg优于最近基于神经机器翻译的方法,并且首次实现了OOV词的预测。基于GitHub中排名前2000的Java项目的差异语料库和手动提交消息的实验结果表明,PtrGNCMsg比最先进的方法性能更好,改进的BLEU分别提高1.02,ROUGE-1提高4.00,ROUGE-L提高3.78。
{"title":"Generating Commit Messages from Diffs using Pointer-Generator Network","authors":"Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, Yu Qian","doi":"10.1109/MSR.2019.00056","DOIUrl":"https://doi.org/10.1109/MSR.2019.00056","url":null,"abstract":"The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"41 1","pages":"299-309"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83652347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Semantic Source Code Models Using Identifier Embeddings 使用标识符嵌入的语义源代码模型
V. Efstathiou, D. Spinellis
The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data–driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state–of–the–art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.
近年来在线开放源代码存储库的出现导致了公开可用源代码数量的爆炸式增长,以及与各种软件开发活动相关的元数据。因此,随着机器学习研究的最新进展,软件维护活动正在从符号形式方法转向数据驱动方法。在这种情况下,源代码标识符中隐藏的丰富语义为构建代码的语义表示提供了机会,这有助于代码搜索和重用任务。为此,我们以预训练向量空间模型的形式提供了六种流行编程语言的分布式代码表示,即Java, Python, PHP, C, c++和c#。这些模型是使用fastText生成的,这是一个用于学习单词表示的最先进的库。每个模型都使用来自单一编程语言的数据进行训练;为生成所有模型而挖掘的代码总计超过13000个存储库。我们指出了自然语言和源代码之间的差异,以及我们处理的不同编程语言之间编码约定的差异。我们描述了这些异质性如何指导我们所做的数据预处理决策和发布模型中训练参数的选择。最后,我们提出了模型的潜在应用,并讨论了模型的局限性。
{"title":"Semantic Source Code Models Using Identifier Embeddings","authors":"V. Efstathiou, D. Spinellis","doi":"10.1109/MSR.2019.00015","DOIUrl":"https://doi.org/10.1109/MSR.2019.00015","url":null,"abstract":"The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data–driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state–of–the–art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"95 1","pages":"29-33"},"PeriodicalIF":0.0,"publicationDate":"2019-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81807736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
The Impact of Systematic Edits in History Slicing 系统编辑对历史切片的影响
Ryosuke Funaki, Shinpei Hayashi, M. Saeki
While extracting a subset of a commit history, specifying the necessary portion is a time-consuming task for developers. Several commit-based history slicing techniques have been proposed to identify dependencies between commits and to extract a related set of commits using a specific commit as a slicing criterion. However, the resulting subset of commits become large if commits for systematic edits whose changes do not depend on each other exist. We empirically investigated the impact of systematic edits on history slicing. In this study, commits in which systematic edits were detected are split between each file so that unnecessary dependencies between commits are eliminated. In several histories of open source systems, the size of history slices was reduced by 13.3-57.2% on average after splitting the commits for systematic edits.
在提取提交历史的子集时,指定必要的部分对开发人员来说是一项耗时的任务。已经提出了几种基于提交的历史切片技术,用于识别提交之间的依赖关系,并使用特定的提交作为切片标准提取相关的提交集。但是,如果存在不相互依赖的系统编辑的提交,则提交的结果子集会变得很大。我们实证研究了系统编辑对历史切片的影响。在本研究中,检测到系统编辑的提交在每个文件之间被分割,从而消除了提交之间不必要的依赖关系。在开源系统的几个历史中,在为系统编辑分割提交之后,历史切片的大小平均减少了13.3-57.2%。
{"title":"The Impact of Systematic Edits in History Slicing","authors":"Ryosuke Funaki, Shinpei Hayashi, M. Saeki","doi":"10.1109/MSR.2019.00083","DOIUrl":"https://doi.org/10.1109/MSR.2019.00083","url":null,"abstract":"While extracting a subset of a commit history, specifying the necessary portion is a time-consuming task for developers. Several commit-based history slicing techniques have been proposed to identify dependencies between commits and to extract a related set of commits using a specific commit as a slicing criterion. However, the resulting subset of commits become large if commits for systematic edits whose changes do not depend on each other exist. We empirically investigated the impact of systematic edits on history slicing. In this study, commits in which systematic edits were detected are split between each file so that unnecessary dependencies between commits are eliminated. In several histories of open source systems, the size of history slices was reduced by 13.3-57.2% on average after splitting the commits for systematic edits.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"210 1","pages":"555-559"},"PeriodicalIF":0.0,"publicationDate":"2019-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76107279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Style-Analyzer: Fixing Code Style Inconsistencies with Interpretable Unsupervised Algorithms 风格分析器:用可解释的无监督算法修复代码风格不一致
Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, Egor Bulychev
Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces style-analyzer, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. style-analyzer is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. style-analyzer can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of style-analyzer by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. style-analyzer includes a web application to visualize how the rules are triggered. We release style-analyzer as a reusable and extendable open source software package on GitHub for the benefit of the community.
源代码审查是手工的、耗时的、昂贵的。人的参与应该集中在分析程序最相关的方面,比如逻辑和可维护性,而不是修改风格、语法或格式缺陷。一些具有检测功能的工具可以自动格式化代码,并报告受支持编程语言的各种风格违规。它们基于领域专家编写的规则,因此,它们的配置通常是乏味的,并且给定的规则集覆盖所有可能的极端情况是不切实际的。一些基于机器学习的解决方案是存在的,但它们仍然是无法解释的黑盒子。本文介绍了一种新的风格分析器,它是一种新的开源工具,它使用决策树森林模型来自动修复代码格式违规,该模型适用于每个代码库,并且是完全无监督的。风格分析器是建立在我们新颖的辅助代码审查框架Lookout之上的。它准确地挖掘所分析的每个Git存储库的格式风格,并用紧凑的人类可读规则表达找到的格式模式。然后,样式分析器可以以代码审查注释的形式建议样式不一致的修复。在19个流行的JavaScript项目中,我们通过展示它可以高精度地重现原始风格,并通过展示它在修复真正的风格错误方面产生有希望的结果,来评估风格分析器的输出质量和实际相关性。Style-analyzer包含一个web应用程序来可视化规则是如何被触发的。为了社区的利益,我们在GitHub上发布了样式分析器作为可重用和可扩展的开源软件包。
{"title":"Style-Analyzer: Fixing Code Style Inconsistencies with Interpretable Unsupervised Algorithms","authors":"Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, Egor Bulychev","doi":"10.1109/MSR.2019.00073","DOIUrl":"https://doi.org/10.1109/MSR.2019.00073","url":null,"abstract":"Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces style-analyzer, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. style-analyzer is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. style-analyzer can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of style-analyzer by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. style-analyzer includes a web application to visualize how the rules are triggered. We release style-analyzer as a reusable and extendable open source software package on GitHub for the benefit of the community.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"87 1","pages":"468-478"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82059032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1