Search-based entity disambiguation with document-centric knowledge bases

Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business Pub Date : 2015-10-21 DOI:10.1145/2809563.2809618

Stefan Zwicklbauer, C. Seifert, M. Granitzer

{"title":"Search-based entity disambiguation with document-centric knowledge bases","authors":"Stefan Zwicklbauer, C. Seifert, M. Granitzer","doi":"10.1145/2809563.2809618","DOIUrl":null,"url":null,"abstract":"Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. One possibility to describe these entities within a knowledge base is via entity-annotated documents (document-centric knowledge base). It has been shown that entity disambiguation with search-based algorithms that use document-centric knowledge bases perform well on the biomedical domain. In this context, the question remains how the quantity of annotated entities within documents and the document count used for entity classification influence disambiguation results. Another open question is whether disambiguation results hold true on more general knowledge data sets (e.g. Wikipedia). In our work we implement a search-based, document-centric disambiguation system and explicitly evaluate the mentioned issues on the biomedical data set CALBC and general knowledge data set Wikipedia, respectively. We show that the number of documents used for classification and the amount of annotations within these documents must be well-matched to attain the best result. Additionally, we reveal that disambiguation accuracy is poor on Wikipedia. We show that disambiguation results significantly improve when using shorter but more documents (e.g. Wikipedia paragraphs). Our results indicate that search-based, document-centric disambiguation systems must be carefully adapted with reference to the underlying domain and availability of user data.","PeriodicalId":20526,"journal":{"name":"Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2809563.2809618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. One possibility to describe these entities within a knowledge base is via entity-annotated documents (document-centric knowledge base). It has been shown that entity disambiguation with search-based algorithms that use document-centric knowledge bases perform well on the biomedical domain. In this context, the question remains how the quantity of annotated entities within documents and the document count used for entity classification influence disambiguation results. Another open question is whether disambiguation results hold true on more general knowledge data sets (e.g. Wikipedia). In our work we implement a search-based, document-centric disambiguation system and explicitly evaluate the mentioned issues on the biomedical data set CALBC and general knowledge data set Wikipedia, respectively. We show that the number of documents used for classification and the amount of annotations within these documents must be well-matched to attain the best result. Additionally, we reveal that disambiguation accuracy is poor on Wikipedia. We show that disambiguation results significantly improve when using shorter but more documents (e.g. Wikipedia paragraphs). Our results indicate that search-based, document-centric disambiguation systems must be carefully adapted with reference to the underlying domain and availability of user data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用以文档为中心的知识库的基于搜索的实体消歧

实体消歧是将自然语言文本中的歧义术语映射到知识库中的实体的任务。在知识库中描述这些实体的一种可能性是通过实体注释文档(以文档为中心的知识库)。研究表明，使用以文档为中心的知识库的基于搜索的实体消歧算法在生物医学领域表现良好。在这种情况下，问题仍然是文档中注释实体的数量和用于实体分类的文档计数如何影响消歧结果。另一个悬而未决的问题是消歧结果是否适用于更一般的知识数据集(例如维基百科)。在我们的工作中，我们实现了一个基于搜索的、以文档为中心的消歧系统，并分别在生物医学数据集CALBC和通用知识数据集Wikipedia上明确地评估了上述问题。我们表明，用于分类的文档数量和这些文档中的注释数量必须很好地匹配才能获得最佳结果。此外，我们发现维基百科的消歧准确性很差。我们表明，当使用更短但更多的文档(例如维基百科段落)时，消歧结果显着改善。我们的研究结果表明，基于搜索的、以文档为中心的消歧系统必须仔细地根据底层领域和用户数据的可用性进行调整。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business

自引率

0.00%

发文量

期刊最新文献

Science with and without e Advantages of extending wiki pages with knowledge-based recommendations Facilitating maturing of socio-technical patterns through social learning approaches A vulnerability's lifetime: enhancing version information in CVE databases MicroTrails