{"title":"Natural Software Revisited","authors":"Musfiqur Rahman, Dharani Palani, Peter C. Rigby","doi":"10.1109/ICSE.2019.00022","DOIUrl":null,"url":null,"abstract":"Recent works have concluded that software code is more repetitive and predictable, i.e. more natural, than English texts. On re-examination, we find that much of the apparent \"naturalness\" of source code is due to the presence of language specific syntax, especially separators, such as semi-colons and brackets. For example, separators account for 44% of all tokens in our Java corpus. When we follow the NLP practices of eliminating punctuation (e.g., separators) and stopwords (e.g., keywords), we find that code is still repetitive and predictable, but to a lesser degree than previously thought. We suggest that SyntaxTokens be filtered to reduce noise in code recommenders. Unlike the code written for a particular project, API code usage is similar across projects: a file is opened and closed in the same manner regardless of domain. When we restrict our n-grams to those contained in the Java API, we find that API usages are highly repetitive. Since API calls are common across programs, researchers have made reliable statistical models to recommend sophisticated API call sequences. Sequential n-gram models were developed for natural languages. Code is usually represented by an AST which contains control and data flow, making n-grams models a poor representation of code. Comparing n-grams to statistical graph representations of the same codebase, we find that graphs are more repetitive and contain higherlevel patterns than n-grams. We suggest that future work focus on statistical code graphs models that accurately capture complex coding patterns. Our replication package makes our scripts and data available to future researchers[1].","PeriodicalId":6736,"journal":{"name":"2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)","volume":"22 1","pages":"37-48"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34
Abstract
Recent works have concluded that software code is more repetitive and predictable, i.e. more natural, than English texts. On re-examination, we find that much of the apparent "naturalness" of source code is due to the presence of language specific syntax, especially separators, such as semi-colons and brackets. For example, separators account for 44% of all tokens in our Java corpus. When we follow the NLP practices of eliminating punctuation (e.g., separators) and stopwords (e.g., keywords), we find that code is still repetitive and predictable, but to a lesser degree than previously thought. We suggest that SyntaxTokens be filtered to reduce noise in code recommenders. Unlike the code written for a particular project, API code usage is similar across projects: a file is opened and closed in the same manner regardless of domain. When we restrict our n-grams to those contained in the Java API, we find that API usages are highly repetitive. Since API calls are common across programs, researchers have made reliable statistical models to recommend sophisticated API call sequences. Sequential n-gram models were developed for natural languages. Code is usually represented by an AST which contains control and data flow, making n-grams models a poor representation of code. Comparing n-grams to statistical graph representations of the same codebase, we find that graphs are more repetitive and contain higherlevel patterns than n-grams. We suggest that future work focus on statistical code graphs models that accurately capture complex coding patterns. Our replication package makes our scripts and data available to future researchers[1].