Identifying Duplicate and Contradictory Information in Wikipedia

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries Pub Date : 2014-06-04 DOI:10.1145/2756406.2756947

Sarah Weissman, S. Ayhan, Joshua Bradley, Jimmy J. Lin

引用次数: 13

Abstract

In this paper, we identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual examination, we discovered that these clusters can be categorized into six different types: templates, identical sentences, copyediting, factual drift, references, and other. Two of these categories are particularly interesting: identical sentences quantify the extent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

识别维基百科中重复和矛盾的信息

在本文中，我们通过应用网页近重复检测技术来识别维基百科文章中相同或高度相似的句子。这是通过MapReduce实现的minhash来识别具有高Jaccard相似性的句子，然后通过生成句子集群来完成的。基于手工检查，我们发现这些集群可以分为六种不同的类型:模板、相同的句子、抄写、事实漂移、引用和其他。其中两个类别特别有趣:相同的句子量化了维基百科中内容被复制和粘贴的程度，而近乎重复的句子陈述了相互矛盾的事实，指出了维基百科的质量问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

自引率

0.00%

发文量