基于众包知识和超大数据分析的代码搜索查询的有效重构

2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2018-07-23 DOI:10.1109/ICSME.2018.00057

M. M. Rahman, C. Roy

{"title":"基于众包知识和超大数据分析的代码搜索查询的有效重构","authors":"M. M. Rahman, C. Roy","doi":"10.1109/ICSME.2018.00057","DOIUrl":null,"url":null,"abstract":"Software developers frequently issue generic natural language queries for code search while using code search engines (e.g., GitHub native search, Krugle). Such queries often do not lead to any relevant results due to vocabulary mismatch problems. In this paper, we propose a novel technique that automatically identifies relevant and specific API classes from Stack Overflow Q & A site for a programming task written as a natural language query, and then reformulates the query for improved code search. We first collect candidate API classes from Stack Overflow using pseudo-relevance feedback and two term weighting algorithms, and then rank the candidates using Borda count and semantic proximity between query keywords and the API classes. The semantic proximity has been determined by an analysis of 1.3 million questions and answers of Stack Overflow. Experiments using 310 code search queries report that our technique suggests relevant API classes with 48% precision and 58% recall which are 32% and 48% higher respectively than those of the state-of-the-art. Comparisons with two state-of-the-art studies and three popular search engines (e.g., Google, Stack Overflow, and GitHub native search) report that our reformulated queries (1) outperform the queries of the state-of-the-art, and (2) significantly improve the code search results provided by these contemporary search engines.","PeriodicalId":6572,"journal":{"name":"2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"17 1","pages":"473-484"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"Effective Reformulation of Query for Code Search Using Crowdsourced Knowledge and Extra-Large Data Analytics\",\"authors\":\"M. M. Rahman, C. Roy\",\"doi\":\"10.1109/ICSME.2018.00057\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software developers frequently issue generic natural language queries for code search while using code search engines (e.g., GitHub native search, Krugle). Such queries often do not lead to any relevant results due to vocabulary mismatch problems. In this paper, we propose a novel technique that automatically identifies relevant and specific API classes from Stack Overflow Q & A site for a programming task written as a natural language query, and then reformulates the query for improved code search. We first collect candidate API classes from Stack Overflow using pseudo-relevance feedback and two term weighting algorithms, and then rank the candidates using Borda count and semantic proximity between query keywords and the API classes. The semantic proximity has been determined by an analysis of 1.3 million questions and answers of Stack Overflow. Experiments using 310 code search queries report that our technique suggests relevant API classes with 48% precision and 58% recall which are 32% and 48% higher respectively than those of the state-of-the-art. Comparisons with two state-of-the-art studies and three popular search engines (e.g., Google, Stack Overflow, and GitHub native search) report that our reformulated queries (1) outperform the queries of the state-of-the-art, and (2) significantly improve the code search results provided by these contemporary search engines.\",\"PeriodicalId\":6572,\"journal\":{\"name\":\"2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"volume\":\"17 1\",\"pages\":\"473-484\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSME.2018.00057\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME.2018.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

摘要

软件开发人员在使用代码搜索引擎(例如，GitHub原生搜索，Krugle)时，经常为代码搜索发出通用的自然语言查询。由于词汇不匹配问题，此类查询通常不会产生任何相关的结果。在本文中，我们提出了一种新技术，该技术可以自动从堆栈溢出问答站点中识别相关和特定的API类，用于编写为自然语言查询的编程任务，然后重新制定查询以改进代码搜索。我们首先使用伪相关反馈和两个术语加权算法从Stack Overflow收集候选API类，然后使用Borda计数和查询关键字与API类之间的语义接近度对候选API类进行排名。语义接近度是通过对Stack Overflow 130万个问题和答案的分析确定的。使用310个代码搜索查询的实验报告表明，我们的技术建议相关的API类具有48%的精度和58%的召回率，分别比最先进的技术高32%和48%。与两个最先进的研究和三个流行的搜索引擎(例如，b谷歌，Stack Overflow和GitHub原生搜索)的比较报告表明，我们重新制定的查询(1)优于最先进的查询，并且(2)显着改善了这些当代搜索引擎提供的代码搜索结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Effective Reformulation of Query for Code Search Using Crowdsourced Knowledge and Extra-Large Data Analytics

Software developers frequently issue generic natural language queries for code search while using code search engines (e.g., GitHub native search, Krugle). Such queries often do not lead to any relevant results due to vocabulary mismatch problems. In this paper, we propose a novel technique that automatically identifies relevant and specific API classes from Stack Overflow Q & A site for a programming task written as a natural language query, and then reformulates the query for improved code search. We first collect candidate API classes from Stack Overflow using pseudo-relevance feedback and two term weighting algorithms, and then rank the candidates using Borda count and semantic proximity between query keywords and the API classes. The semantic proximity has been determined by an analysis of 1.3 million questions and answers of Stack Overflow. Experiments using 310 code search queries report that our technique suggests relevant API classes with 48% precision and 58% recall which are 32% and 48% higher respectively than those of the state-of-the-art. Comparisons with two state-of-the-art studies and three popular search engines (e.g., Google, Stack Overflow, and GitHub native search) report that our reformulated queries (1) outperform the queries of the state-of-the-art, and (2) significantly improve the code search results provided by these contemporary search engines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量

期刊最新文献

Studying the Impact of Policy Changes on Bug Handling Performance Test Re-Prioritization in Continuous Testing Environments Threats of Aggregating Software Repository Data Studying Permission Related Issues in Android Wearable Apps NLP2API: Query Reformulation for Code Search Using Crowdsourced Knowledge and Extra-Large Data Analytics