Corpus Cleanup of Mistaken Agreement Using Word Sense Disambiguation

Int. J. Comput. Linguistics Chin. Lang. Process. Pub Date : 2008-12-01 DOI:10.30019/IJCLCLP.200812.0002

Liang-Chih Yu, Chung-Hsien Wu, Jui-Feng Yeh, E. Hovy

{"title":"Corpus Cleanup of Mistaken Agreement Using Word Sense Disambiguation","authors":"Liang-Chih Yu, Chung-Hsien Wu, Jui-Feng Yeh, E. Hovy","doi":"10.30019/IJCLCLP.200812.0002","DOIUrl":null,"url":null,"abstract":"Word sense annotated corpora are useful resources for many text mining applications. Such corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, nobody has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective in identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% of the remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Linguistics Chin. Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30019/IJCLCLP.200812.0002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Word sense annotated corpora are useful resources for many text mining applications. Such corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, nobody has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective in identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% of the remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用词义消歧法清理语料库中的一致性错误

词义注释语料库是许多文本挖掘应用程序的有用资源。这样的语料库只有在注释一致的情况下才有用。大多数大规模注释工作都采取特殊措施来调和注释者之间的分歧。然而，到目前为止，还没有人研究过如何自动确定注释者同意但错误的范例。在本文中，我们使用了OntoNotes，一个大规模的语义注释语料库，包括词义，谓词-参数结构，本体链接和共引用。为了确定词义注释中的错误一致，我们使用词义消歧(WSD)来选择一组可疑的候选词进行人工评估。实验从精度、成本效益比和熵三个方面检验了WSD的性能。实验结果表明，WSD在识别高度模糊词的错误注释时最有效，而基线在识别其他情况下效果更好。这两种方法可以结合起来改善清理过程。这个程序允许我们在OntoNotes语料库中找到大约2%的剩余错误协议。可以很容易地定义一个类似的过程来检查其他带注释的语料库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

自引率

0.00%

发文量