Mining source code to automatically split identifiers for software analysis

2009 6th IEEE International Working Conference on Mining Software Repositories Pub Date : 2009-05-16 DOI:10.1109/MSR.2009.5069482

Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker

{"title":"Mining source code to automatically split identifiers for software analysis","authors":"Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker","doi":"10.1109/MSR.2009.5069482","DOIUrl":null,"url":null,"abstract":"Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"312 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"193","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 6th IEEE International Working Conference on Mining Software Repositories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2009.5069482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 193

Abstract

Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

挖掘源代码自动分割标识符用于软件分析

自动化的软件工程工具(例如，程序搜索、关注点定位、代码重用、质量评估等)越来越依赖于来自代码注释和标识符的自然语言信息。从标识符中分析单词的第一步需要将标识符拆分为组成它们的单词。与使用空格和标点符号来描绘单词的自然语言不同，标识符不能包含空格。分割标识符的一种常用方法是遵循编程语言的命名约定。例如，Java程序员经常使用驼峰大小写，其中单词由大写字母或非字母字符描述。然而，程序员也通过将单词序列连接在一起来创建标识符，而没有可识别的描述，这对自动标识符分割提出了挑战。本文提出了一种通过挖掘源代码中的词频将标识符自动分割成词序列的算法。有了这些单词频率，我们的标识符分配器使用评分技术自动为标识符选择最合适的分区。在对来自开放源码Java程序的8000多个标识符的评估中，我们的Samurai方法优于现有的艺术技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2009 6th IEEE International Working Conference on Mining Software Repositories

自引率

0.00%

发文量

期刊最新文献

Tracking concept drift of software projects using defect prediction quality Mining the history of synchronous changes to refine code ownership Learning from defect removals Assigning bug reports using a vocabulary-based expertise model of developers Using association rules to study the co-evolution of production & test code