挖掘堆栈溢出的重复问题

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR) Pub Date : 2016-05-14 DOI:10.1145/2901739.2901770

Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider

{"title":"挖掘堆栈溢出的重复问题","authors":"Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider","doi":"10.1145/2901739.2901770","DOIUrl":null,"url":null,"abstract":"Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"33 1","pages":"402-412"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"107","resultStr":"{\"title\":\"Mining Duplicate Questions of Stack Overflow\",\"authors\":\"Md Ahasanuzzaman, M. Asaduzzaman, C. Roy, Kevin A. Schneider\",\"doi\":\"10.1145/2901739.2901770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.\",\"PeriodicalId\":6621,\"journal\":{\"name\":\"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)\",\"volume\":\"33 1\",\"pages\":\"402-412\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"107\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2901739.2901770\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2901739.2901770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 107

摘要

Stack Overflow是一个关注编程问题的热门问答网站。尽管努力防止提出已经得到回答的问题，但该网站仍存在重复问题。这可能会导致开发人员不必要地等待一个问题得到回答，而这个问题已经被提出和回答了。该网站目前依赖于它的版主和有很高声誉的用户手动将这些问题标记为重复，这不仅导致回复延迟，而且需要额外的努力。在本文中，我们首先执行手动调查，以了解为什么用户在Stack Overflow中提交重复问题。基于我们的手工调查，我们提出了一种分类技术，该技术使用许多精心选择的特征来识别重复的问题。使用大量问题的评估表明，我们的技术可以以合理的准确性检测重复问题。我们还将我们的技术与用于检测重复问题的最先进技术DupPredictor进行了比较，我们发现我们提出的技术比该技术具有更好的召回率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Mining Duplicate Questions of Stack Overflow

Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量