{"title":"A Bigram-based Inference Model for Retrieving Abbreviated Phrases in Source Code","authors":"Abdulrahman Alatawi, Weifeng Xu, Dianxiang Xu","doi":"10.1145/3383219.3383221","DOIUrl":null,"url":null,"abstract":"Expanding abbreviations in source code to their full meanings is very useful for software maintainers to comprehend the source code. The existing approaches, however, focus on expanding an abbreviation to a single word, i.e., unigram. They do not perform well when dealing with abbreviations of phrases that consist of multiple unigrams. This paper proposes a bigram-based approach for retrieving abbreviated phrases automatically. Key to this approach is a bigram-based inference model for choosing the best phrase from all candidates. It utilizes the statistical properties of unigrams and bigrams as prior knowledge and a bigram language model for estimating the likelihood of each candidate phrase of a given abbreviation. We have applied the bigram-based approach to 100 phrase abbreviations, randomly selected from eight open source projects. The experiment results show that it has correctly retrieved 78% of the abbreviations by using the unigram and bigram properties of a source code repository. This is 9% more accurate than the unigram-based approach and much better than other existing approaches. The bigram-based approach is also less biased towards specific phrase sizes than the unigram-based approach.","PeriodicalId":334629,"journal":{"name":"Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3383219.3383221","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Expanding abbreviations in source code to their full meanings is very useful for software maintainers to comprehend the source code. The existing approaches, however, focus on expanding an abbreviation to a single word, i.e., unigram. They do not perform well when dealing with abbreviations of phrases that consist of multiple unigrams. This paper proposes a bigram-based approach for retrieving abbreviated phrases automatically. Key to this approach is a bigram-based inference model for choosing the best phrase from all candidates. It utilizes the statistical properties of unigrams and bigrams as prior knowledge and a bigram language model for estimating the likelihood of each candidate phrase of a given abbreviation. We have applied the bigram-based approach to 100 phrase abbreviations, randomly selected from eight open source projects. The experiment results show that it has correctly retrieved 78% of the abbreviations by using the unigram and bigram properties of a source code repository. This is 9% more accurate than the unigram-based approach and much better than other existing approaches. The bigram-based approach is also less biased towards specific phrase sizes than the unigram-based approach.