2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)最新文献

英文中文

[Research Paper] Untangling Composite Commits Using Program Slicing [研究论文]使用程序切片解缠复合提交

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00030

Ward Muylaert, Coen De Roover

Composite commits are a common mistake in the use of version control software. A composite commit groups many unrelated tasks, rendering the commit difficult for developers to understand, revert, or integrate and for empirical researchers to analyse. We propose an algorithmic foundation for tool support to identify such composite commits. Our algorithm computes both a program dependence graph and the changes to the abstract syntax tree for the files that have been changed in a commit. Our algorithm then groups these fine-grained changes according to the slices through the dependence graph they belong to. To evaluate our technique, we analyse and refine an established dataset of Java commits, the results of which we also make available. We find that our algorithm can determine whether or not a commit is composite. For the majority of commits, this analysis takes but a few seconds. The parts of a commit that our algorithm identifies do not map directly to the commit's tasks. The parts tend to be smaller, but stay within their respective tasks.

复合提交是使用版本控制软件时的一个常见错误。复合提交将许多不相关的任务分组，使得提交对于开发人员来说难以理解、恢复或集成，并且对于经验研究人员来说难以分析。我们提出了一个算法基础的工具支持来识别这样的复合提交。我们的算法既计算程序依赖图，也计算在提交中被更改的文件的抽象语法树的更改。然后，我们的算法通过它们所属的依赖图，根据切片对这些细粒度的更改进行分组。为了评估我们的技术，我们分析并改进了一个已建立的Java提交数据集，我们也提供了该数据集的结果。我们发现我们的算法可以确定提交是否是复合的。对于大多数提交，这个分析只需要几秒钟。我们的算法识别的提交部分并不直接映射到提交的任务。这些部分往往较小，但保持在各自的任务范围内。

引用次数: 3

[Research Paper] Automatic Detection of Sources and Sinks in Arbitrary Java Libraries [研究论文]任意Java库中的源和汇自动检测

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00019

Darius Sas, Marco Bessi, F. Fontana

In the last decade, data security has become a primary concern for an increasing amount of companies around the world. Protecting the customer's privacy is now at the core of many businesses operating in any kind of market. Thus, the demand for new technologies to safeguard user data and prevent data breaches has increased accordingly. In this work, we investigate a machine learning-based approach to automatically extract sources and sinks from arbitrary Java libraries. Our method exploits several different features based on semantic, syntactic, intra-procedural dataflow and class-hierarchy traits embedded into the bytecode to distinguish sources and sinks. The performed experiments show that, under certain conditions and after some preprocessing, sources and sinks across different libraries share common characteristics that allow a machine learning model to distinguish them from the other library methods. The prototype model achieved remarkable results of 86% accuracy and 81% F-measure on our validation set of roughly 600 methods.

在过去的十年中，数据安全已经成为全球越来越多的公司关注的主要问题。保护客户的隐私现在是在任何市场中运营的许多企业的核心。因此，对保护用户数据和防止数据泄露的新技术的需求也相应增加。在这项工作中，我们研究了一种基于机器学习的方法来自动从任意Java库中提取源和汇。我们的方法利用嵌入字节码的语义、句法、过程内数据流和类层次特征来区分源和汇。所进行的实验表明，在某些条件下，经过一些预处理，不同库中的源和汇具有共同的特征，这使得机器学习模型能够将它们与其他库方法区分开来。在大约600种方法的验证集上，原型模型取得了86%的准确率和81%的F-measure的显著结果。

引用次数: 5

[Research Paper] Fine-Grained Model Slicing for Rebel [研究论文]Rebel的细粒度模型切片

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00035

R. Eilers, Jurriaan Hage, I. Prasetya, Joost Bosman

In this paper, we apply fine-grained slicing techniques to the models generated from the Rebel modeling language before passing them on to an SMT solver. We show that our slicing techniques have a significant positive effect on performance, allowing us to verify larger problem instances and with higher path bounds than with unsliced models. For small and shallow instances, however, the overhead of slicing dominates verification time, and slicing should not be resorted to.

在本文中，我们将细粒度切片技术应用于Rebel建模语言生成的模型，然后再将它们传递给SMT求解器。我们表明，我们的切片技术对性能有显著的积极影响，使我们能够验证更大的问题实例和比未切片模型更高的路径边界。然而，对于小而浅的实例，切片的开销支配了验证时间，因此不应该使用切片。

引用次数: 1

[Research Paper] Obfuscating Java Programs by Translating Selected Portions of Bytecode to Native Libraries [研究论文]通过将选定的字节码部分翻译成本机库来混淆Java程序

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00012

Davide Pizzolotto, M. Ceccato

Code obfuscation is a popular approach to turn program comprehension and analysis harder, with the aim of mitigating threats related to malicious reverse engineering and code tampering. However, programming languages that compile to high level bytecode (e.g., Java) can be obfuscated only to a limited extent. In fact, high level bytecode still contains high level relevant information that an attacker might exploit. In order to enable more resilient obfuscations, part of these programs might be implemented with programming languages (e.g., C) that compile to low level machine-dependent code. In fact, machine code contains and leaks less high level information and it enables more resilient obfuscations. In this paper, we present an approach to automatically translate critical sections of high level Java bytecode to C code, so that more effective obfuscations can be resorted to. Moreover, a developer can still work with a single programming language, i.e., Java.

代码混淆是一种使程序理解和分析更加困难的流行方法，其目的是减轻与恶意逆向工程和代码篡改相关的威胁。然而，编译成高级字节码的编程语言(例如Java)只能在有限的程度上混淆。实际上，高级字节码仍然包含攻击者可能利用的高级相关信息。为了实现更有弹性的混淆，这些程序的一部分可能会用编程语言(例如C语言)来实现，这些编程语言可以编译成低级机器相关的代码。事实上，机器代码包含和泄漏的高级信息更少，它支持更有弹性的混淆。在本文中，我们提出了一种将高级Java字节码的关键部分自动转换为C代码的方法，以便可以采用更有效的混淆。此外，开发人员仍然可以使用单一的编程语言，例如Java。

引用次数: 4

[Research Paper] On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools [研究论文]机器学习技术在基于云的自动代码克隆验证工具设计中的应用

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00025

Golam Mostaeen, Jeffrey Svajlenko, B. Roy, C. Roy, Kevin A. Schneider

A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.

代码克隆是一对代码片段，存在于相似的软件系统内部或系统之间。由于代码克隆通常会对软件系统的可维护性产生负面影响，因此在过去的十年中，已经提出和研究了大量的代码克隆检测技术和工具。一般来说，为了检测所有可能的类似源代码模式，克隆检测工具在语法级别(如文本、令牌、AST等)上工作，同时缺乏用户特定的首选项。这通常意味着必须在任何分析之前手动验证报告的克隆，以便从任务或用户特定的考虑中过滤出真正的阳性克隆。这种手动克隆验证工作非常耗时，而且经常容易出错，特别是对于大规模克隆检测而言。在本文中，我们提出了一种基于机器学习的方法来自动化验证过程。在几个不同软件系统中由几个克隆检测器检测克隆的实验中，我们发现与多个专家法官手动验证相比，我们的方法的准确率高达87.4%。该方法与现有的代码克隆自动验证方法进行了对比研究，取得了良好的效果。我们还介绍了我们在不同代码克隆检测工具、机器学习算法和开源软件系统方面的实验结果。

{"title":"[Research Paper] On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools","authors":"Golam Mostaeen, Jeffrey Svajlenko, B. Roy, C. Roy, Kevin A. Schneider","doi":"10.1109/SCAM.2018.00025","DOIUrl":"https://doi.org/10.1109/SCAM.2018.00025","url":null,"abstract":"A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124496404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

[Engineering Paper] Analyzing the Evolution of Preprocessor-Based Variability: A Tale of a Thousand and One Scripts [工程论文]基于预处理器的变异性演化分析:一千零一个剧本的故事

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00013

Sandro Schulze, W. Fenske

Highly configurable software systems allow the efficient and reliable development of similar software variants based on a common code base. The C preprocessor CPP, which uses source code annotations that enable conditional compilation, is a simple yet powerful text-based tool for implementing such systems. However, since annotations interfere with the actual source code, the CPP has often been accused of being a source of errors and increased maintenance effort. In our research, we have been curious about whether high-level patterns of CPP misuse (i.e., code smells) can be identified, how they evolve, and whether they really hinder maintenance. To support this research, we started a simple tool which over the years evolved into a powerful toolchain. This evolution was possible because our toolchain is not monolithic, but is composed of many small tools connected by scripts and communicating via files. Moreover, we reused existing tools whenever possible and developed our own solutions only as a last resort. In this paper, we report our experiences of building this toolchain. In particular, we present design decisions we made and lessons learned, both positive and negative ones. We hope that this not only stimulates discussion and (in the best case) attracts more researchers in using our tools. Rather, we also want to encourage others to put emphasis on building tools instead of considering them "yet another research prototype".

高度可配置的软件系统允许基于公共代码库高效可靠地开发类似的软件变体。C预处理器CPP使用支持条件编译的源代码注释，是实现此类系统的一个简单但功能强大的基于文本的工具。然而，由于注释干扰了实际的源代码，CPP经常被指责为错误的来源，并增加了维护工作。在我们的研究中，我们一直很好奇是否可以识别CPP滥用的高级模式(例如，代码气味)，它们是如何演变的，以及它们是否真的阻碍了维护。为了支持这项研究，我们开始了一个简单的工具，多年来它演变成了一个强大的工具链。这种进化是可能的，因为我们的工具链不是单一的，而是由许多由脚本连接并通过文件通信的小工具组成的。此外，我们尽可能重用现有的工具，开发我们自己的解决方案只是作为最后的手段。在本文中，我们报告了构建这个工具链的经验。特别是，我们展示了我们所做的设计决策和经验教训，包括积极的和消极的。我们希望这不仅能激发讨论，而且(在最好的情况下)能吸引更多的研究人员使用我们的工具。相反，我们也希望鼓励其他人把重点放在构建工具上，而不是把它们视为“另一个研究原型”。

{"title":"[Engineering Paper] Analyzing the Evolution of Preprocessor-Based Variability: A Tale of a Thousand and One Scripts","authors":"Sandro Schulze, W. Fenske","doi":"10.1109/SCAM.2018.00013","DOIUrl":"https://doi.org/10.1109/SCAM.2018.00013","url":null,"abstract":"Highly configurable software systems allow the efficient and reliable development of similar software variants based on a common code base. The C preprocessor CPP, which uses source code annotations that enable conditional compilation, is a simple yet powerful text-based tool for implementing such systems. However, since annotations interfere with the actual source code, the CPP has often been accused of being a source of errors and increased maintenance effort. In our research, we have been curious about whether high-level patterns of CPP misuse (i.e., code smells) can be identified, how they evolve, and whether they really hinder maintenance. To support this research, we started a simple tool which over the years evolved into a powerful toolchain. This evolution was possible because our toolchain is not monolithic, but is composed of many small tools connected by scripts and communicating via files. Moreover, we reused existing tools whenever possible and developed our own solutions only as a last resort. In this paper, we report our experiences of building this toolchain. In particular, we present design decisions we made and lessons learned, both positive and negative ones. We hope that this not only stimulates discussion and (in the best case) attracts more researchers in using our tools. Rather, we also want to encourage others to put emphasis on building tools instead of considering them \"yet another research prototype\".","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124731240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

[Engineering Paper] Graal: The Quest for Source Code Knowledge [工程论文]格拉尔:对源代码知识的探索

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00021

Valerio Cosentino, Santiago Dueñas, Ahmed Zerouali, G. Robles, Jesus M. Gonzalez-Barahona

Source code analysis tools are designed to analyze code artifacts with different intents, which span from improving the quality and security of the software to easing refactoring and reverse engineering activities. However, most tools do not come with features to periodically schedule their analysis or to be executed on a battery of repositories, and lack support to combine their results with other analysis tools. Thus, researchers and practitioners are often forced to develop ad-hoc scripts to meet their needs. This comes at the risk of obtaining wrong results (because of the lack of testing) and of hindering replication by other research teams. In addition, the resulting scripts are often not meant to be customized nor designed for incrementality, scalability and extensibility. In this paper we present Graal, which empowers users with a customizable, scalable and incremental approach to conduct source code analysis and enables relating the obtained results with other software project data. Graal leverages on and extends the functionalities of GrimoireLab, a strong free software tool developed by Bitergia, a company devoted to offer commercial software development analytics, and part of the CHAOSS project of the Linux Foundation.

源代码分析工具被设计用来分析具有不同目的的代码工件，这些目的从提高软件的质量和安全性到简化重构和逆向工程活动。然而，大多数工具没有提供定期安排其分析或在存储库上执行的特性，并且缺乏将其结果与其他分析工具组合在一起的支持。因此，研究人员和实践者经常被迫开发专门的脚本来满足他们的需求。这样做有可能得到错误的结果(因为缺乏测试)，也有可能阻碍其他研究团队的复制。此外，生成的脚本通常不是定制的，也不是为增量、可伸缩性和可扩展性而设计的。在本文中，我们介绍了Graal，它为用户提供了一种可定制的、可扩展的和增量的方法来进行源代码分析，并使获得的结果与其他软件项目数据相关联。Graal利用并扩展了GrimoireLab的功能，GrimoireLab是由Bitergia开发的强大的免费软件工具，Bitergia是一家致力于提供商业软件开发分析的公司，也是Linux基金会CHAOSS项目的一部分。

{"title":"[Engineering Paper] Graal: The Quest for Source Code Knowledge","authors":"Valerio Cosentino, Santiago Dueñas, Ahmed Zerouali, G. Robles, Jesus M. Gonzalez-Barahona","doi":"10.1109/SCAM.2018.00021","DOIUrl":"https://doi.org/10.1109/SCAM.2018.00021","url":null,"abstract":"Source code analysis tools are designed to analyze code artifacts with different intents, which span from improving the quality and security of the software to easing refactoring and reverse engineering activities. However, most tools do not come with features to periodically schedule their analysis or to be executed on a battery of repositories, and lack support to combine their results with other analysis tools. Thus, researchers and practitioners are often forced to develop ad-hoc scripts to meet their needs. This comes at the risk of obtaining wrong results (because of the lack of testing) and of hindering replication by other research teams. In addition, the resulting scripts are often not meant to be customized nor designed for incrementality, scalability and extensibility. In this paper we present Graal, which empowers users with a customizable, scalable and incremental approach to conduct source code analysis and enables relating the obtained results with other software project data. Graal leverages on and extends the functionalities of GrimoireLab, a strong free software tool developed by Bitergia, a company devoted to offer commercial software development analytics, and part of the CHAOSS project of the Linux Foundation.","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124132466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

[Research Paper] The Case for Adaptive Change Recommendation [研究论文]适应性变化建议的案例

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00022

Sydney Pugh, D. Binkley, L. Moonen

As the complexity of a software system grows, it becomes increasingly difficult for developers to be aware of all the dependencies that exist between artifacts (e.g., files or methods) of the system. Change impact analysis helps to overcome this problem, as it recommends to a developer relevant source-code artifacts related to her current changes. Association rule mining has shown promise in determining change impact by uncovering relevant patterns in the system's change history. State-of-the-art change impact mining algorithms typically make use of a change history of tens of thousands of transactions. For efficiency, targeted association rule mining focuses on only those transactions potentially relevant to answering a particular query. However, even targeted algorithms must consider the complete set of relevant transactions in the history. This paper presents ATARI, a new adaptive approach to association rule mining that considers a dynamic selection of the relevant transactions. It can be viewed as a further constrained version of targeted association rule mining, in which as few as a single transaction might be considered when determining change impact. Our investigation of adaptive change impact mining empirically studies seven algorithm variants. We show that adaptive algorithms are viable, can be just as applicable as the start-of-the-art complete-history algorithms, and even outperform them for certain queries. However, more important than the direct comparison, our investigation lays necessary groundwork for the future study of adaptive techniques and their application to challenges such as the on-the-fly style of impact analysis that is needed at the GitHub-scale.

随着软件系统复杂性的增长，开发人员越来越难以意识到系统的工件(例如，文件或方法)之间存在的所有依赖关系。变更影响分析有助于克服这个问题，因为它向开发人员推荐与其当前变更相关的相关源代码工件。关联规则挖掘在通过发现系统变更历史中的相关模式来确定变更影响方面显示出了前景。最先进的变更影响挖掘算法通常利用成千上万个事务的变更历史。为了提高效率，目标关联规则挖掘只关注那些可能与回答特定查询相关的事务。然而，即使是目标算法也必须考虑历史上相关事务的完整集合。本文提出了一种新的自适应关联规则挖掘方法ATARI，该方法考虑了相关事务的动态选择。它可以被视为目标关联规则挖掘的进一步约束版本，在确定更改影响时，可能只考虑单个事务。我们对自适应变化影响挖掘的调查实证研究了七种算法变体。我们证明了自适应算法是可行的，可以像最先进的完整历史算法一样适用，甚至在某些查询中优于它们。然而，比直接比较更重要的是，我们的调查为自适应技术的未来研究及其在挑战中的应用奠定了必要的基础，例如在github规模上需要的即时影响分析风格。

{"title":"[Research Paper] The Case for Adaptive Change Recommendation","authors":"Sydney Pugh, D. Binkley, L. Moonen","doi":"10.1109/SCAM.2018.00022","DOIUrl":"https://doi.org/10.1109/SCAM.2018.00022","url":null,"abstract":"As the complexity of a software system grows, it becomes increasingly difficult for developers to be aware of all the dependencies that exist between artifacts (e.g., files or methods) of the system. Change impact analysis helps to overcome this problem, as it recommends to a developer relevant source-code artifacts related to her current changes. Association rule mining has shown promise in determining change impact by uncovering relevant patterns in the system's change history. State-of-the-art change impact mining algorithms typically make use of a change history of tens of thousands of transactions. For efficiency, targeted association rule mining focuses on only those transactions potentially relevant to answering a particular query. However, even targeted algorithms must consider the complete set of relevant transactions in the history. This paper presents ATARI, a new adaptive approach to association rule mining that considers a dynamic selection of the relevant transactions. It can be viewed as a further constrained version of targeted association rule mining, in which as few as a single transaction might be considered when determining change impact. Our investigation of adaptive change impact mining empirically studies seven algorithm variants. We show that adaptive algorithms are viable, can be just as applicable as the start-of-the-art complete-history algorithms, and even outperform them for certain queries. However, more important than the direct comparison, our investigation lays necessary groundwork for the future study of adaptive techniques and their application to challenges such as the on-the-fly style of impact analysis that is needed at the GitHub-scale.","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126371633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

[Engineering Paper] SCC: Automatic Classification of Code Snippets [工程论文]SCC:代码片段自动分类

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00031

Kamel Alreshedy, Dhanush Dharmaretnam, D. Germán, Venkatesh Srinivasan, T. Gulliver

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI-a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.

确定源代码文件的编程语言已经在研究界得到了考虑;研究表明，机器学习(ML)和自然语言处理(NLP)算法可以有效地识别源代码文件的编程语言。然而，确定代码片段或几行源代码的编程语言仍然是一项具有挑战性的任务。Stack Overflow等在线论坛和GitHub等代码库包含大量代码片段。在本文中，我们描述了源代码分类(SCC)，一种可以识别用21种不同编程语言编写的代码片段的编程语言的分类器。采用多项朴素贝叶斯(MNB)分类器，该分类器使用Stack Overflow posts进行训练。它的准确率达到75%，高于编程语言识别(pli -一种专有的在线片段分类器)，后者的准确率仅为55.5%。使用该工具，准确率、召回率和F1得分的平均值分别为0.76、0.75和0.75。此外，它还可以区分C、c++和c#等一系列编程语言的代码片段，还可以识别c# 3.0、c# 4.0和c# 5.0等编程语言的版本。

{"title":"[Engineering Paper] SCC: Automatic Classification of Code Snippets","authors":"Kamel Alreshedy, Dhanush Dharmaretnam, D. Germán, Venkatesh Srinivasan, T. Gulliver","doi":"10.1109/SCAM.2018.00031","DOIUrl":"https://doi.org/10.1109/SCAM.2018.00031","url":null,"abstract":"Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI-a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134073255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

[Research Paper] Automatic Checking of Regular Expressions [研究论文]正则表达式的自动检查

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2018-09-01 DOI: 10.1109/SCAM.2018.00034

E. Larson

Regular expressions are extensively used to process strings. The regular expression language is concise which makes it easy for developers to use but also makes it easy for developers to make mistakes. Since regular expressions are compiled at run-time, the regular expression compiler does not give any feedback on potential errors. This paper describes ACRE - Automatic Checking of Regular Expressions. ACRE takes a regular expression as input and performs 11 different checks on the regular expression. The checks are based on common mistakes. Among the checks are checks for incorrect use of character sets (enclosed by []), wildcards (represented by.), and line anchors (^ and $). ACRE has found errors in 283 out of 826 regular expressions. Each of the 11 checks found at least seven errors. The number of false reports is moderate: 46 of the regular expressions contained a false report. ACRE is simple to use: the user enters a regular expressions and presses the check button. Any violations are reported back to the user with the incorrect portion of the regular expression highlighted. For 9 of the 11 checks, an example accepted string is generated that further illustrates the error.

正则表达式广泛用于处理字符串。正则表达式语言很简洁，这使得开发人员很容易使用，但也很容易犯错误。由于正则表达式是在运行时编译的，因此正则表达式编译器不会对潜在的错误给出任何反馈。本文描述了正则表达式的自动检查。ACRE将正则表达式作为输入，并对正则表达式执行11种不同的检查。这些检查是基于常见错误。这些检查包括对字符集(用[]括起来)、通配符(用。表示)和行锚(^和$)的不正确使用的检查。ACRE在826个正则表达式中发现了283个错误。11次检查中，每次都至少发现了7个错误。错误报告的数量适中:46个正则表达式包含错误报告。ACRE使用起来很简单:用户输入一个正则表达式并按下复选按钮。任何违规都将报告给用户，并突出显示正则表达式的错误部分。对于11次检查中的9次，将生成一个示例接受字符串，进一步说明错误。

{"title":"[Research Paper] Automatic Checking of Regular Expressions","authors":"E. Larson","doi":"10.1109/SCAM.2018.00034","DOIUrl":"https://doi.org/10.1109/SCAM.2018.00034","url":null,"abstract":"Regular expressions are extensively used to process strings. The regular expression language is concise which makes it easy for developers to use but also makes it easy for developers to make mistakes. Since regular expressions are compiled at run-time, the regular expression compiler does not give any feedback on potential errors. This paper describes ACRE - Automatic Checking of Regular Expressions. ACRE takes a regular expression as input and performs 11 different checks on the regular expression. The checks are based on common mistakes. Among the checks are checks for incorrect use of character sets (enclosed by []), wildcards (represented by.), and line anchors (^ and $). ACRE has found errors in 283 out of 826 regular expressions. Each of the 11 checks found at least seven errors. The number of false reports is moderate: 46 of the regular expressions contained a false report. ACRE is simple to use: the user enters a regular expressions and presses the check button. Any violations are reported back to the user with the incorrect portion of the regular expression highlighted. For 9 of the 11 checks, an example accepted string is generated that further illustrates the error.","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127488690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀