Unsupervised Software-Specific Morphological Forms Inference from Informal Discussions

2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE) Pub Date : 2017-05-20 DOI:10.1109/ICSE.2017.48

Chunyang Chen, Zhenchang Xing, Ximing Wang

{"title":"Unsupervised Software-Specific Morphological Forms Inference from Informal Discussions","authors":"Chunyang Chen, Zhenchang Xing, Ximing Wang","doi":"10.1109/ICSE.2017.48","DOIUrl":null,"url":null,"abstract":"Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.","PeriodicalId":6505,"journal":{"name":"2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)","volume":"920 1","pages":"450-461"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2017.48","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

Abstract

Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从非正式讨论中推断非监督软件特定的形态形式

社交平台上的非正式讨论(例如Stack Overflow)以自然语言文本积累了大量的编程知识。可以利用自然语言过程(NLP)技术来获取软件工程任务的知识库。为了有效地使用NLP技术，一致的词汇是必不可少的。不幸的是，在非正式的讨论中，相同的概念经常被有意或无意地以许多不同的形态提到，比如缩写、同义词和拼写错误。处理这种形态形式的现有技术要么是为通用英语设计的，要么主要依赖于特定领域的词汇规则。一个包含特定于软件的术语和常用的形态形式的同义词库对于规范化软件工程文本是非常必要的，但是手工构建非常困难。在这项工作中，我们提出了一种自动构建这样一个词库的方法。我们的方法通过对比特定于软件的语料库和一般语料库来识别特定于软件的术语，并通过结合分布式词语义、特定于领域的词汇规则和转换以及形态关系的图分析来推断特定于软件的术语的形态。我们对软件特定的术语、缩写和同义词的社区策划的列表评估结果同义词典的覆盖范围和准确性。我们还手动检查同义词库中已识别的缩写和同义词的正确性。通过对Stack Overflow和CodeProject的问题进行规范化的案例研究，我们展示了同义词库的有用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量

期刊最新文献

Adaptive Unpacking of Android Apps Symbolic Model Extraction for Web Application Verification On Cross-Stack Configuration Errors Syntactic and Semantic Differencing for Combinatorial Models of Test Designs Fuzzy Fine-Grained Code-History Analysis