Filling the Gaps: Improving Wikipedia Stubs

Proceedings of the 2015 ACM Symposium on Document Engineering Pub Date : 2015-09-08 DOI:10.1145/2682571.2797073

Siddhartha Banerjee, P. Mitra

{"title":"Filling the Gaps: Improving Wikipedia Stubs","authors":"Siddhartha Banerjee, P. Mitra","doi":"10.1145/2682571.2797073","DOIUrl":null,"url":null,"abstract":"The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2682571.2797073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

填补空白:改进维基百科存根

在维基百科上，只有有限数量的贡献者的可用性不能确保在线百科全书的持续增长和改进。随着信息在网络上的分散，我们的目标是使维基百科生成内容的过程自动化。在这项工作中，我们提出了一种改进维基百科上不包含全面信息的存根的技术。分类器从Wikipedia上现有的综合文章中学习特征，并推荐可以添加到存根中的内容，以提高存根的完整性。我们使用几种分类器进行了实验-基于潜狄利克雷分配(LDA)的模型，基于深度学习的架构(deep belief network)和基于TFIDF的分类器。我们的实验表明，基于LDA的模型优于其他模型(~6%的f值)。我们的生成方法表明，这种技术能够生成全面的文章。我们的系统生成的文章的ROUGE-2分数优于使用基线生成的文章。我们的系统生成的内容已经被添加到几个存根中，并成功地保留在维基百科中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 ACM Symposium on Document Engineering

自引率

0.00%

发文量

期刊最新文献

VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents Document Engineering Issues in Document Analysis Document Changes: Modeling, Detection, Storage and Visualization (DChanges 2015) Creating eBooks with Accessible Graphics Content Spatio-temporal Validation of Multimedia Documents