{"title":"Search-based entity disambiguation with document-centric knowledge bases","authors":"Stefan Zwicklbauer, C. Seifert, M. Granitzer","doi":"10.1145/2809563.2809618","DOIUrl":null,"url":null,"abstract":"Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. One possibility to describe these entities within a knowledge base is via entity-annotated documents (document-centric knowledge base). It has been shown that entity disambiguation with search-based algorithms that use document-centric knowledge bases perform well on the biomedical domain. In this context, the question remains how the quantity of annotated entities within documents and the document count used for entity classification influence disambiguation results. Another open question is whether disambiguation results hold true on more general knowledge data sets (e.g. Wikipedia). In our work we implement a search-based, document-centric disambiguation system and explicitly evaluate the mentioned issues on the biomedical data set CALBC and general knowledge data set Wikipedia, respectively. We show that the number of documents used for classification and the amount of annotations within these documents must be well-matched to attain the best result. Additionally, we reveal that disambiguation accuracy is poor on Wikipedia. We show that disambiguation results significantly improve when using shorter but more documents (e.g. Wikipedia paragraphs). Our results indicate that search-based, document-centric disambiguation systems must be carefully adapted with reference to the underlying domain and availability of user data.","PeriodicalId":20526,"journal":{"name":"Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2809563.2809618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. One possibility to describe these entities within a knowledge base is via entity-annotated documents (document-centric knowledge base). It has been shown that entity disambiguation with search-based algorithms that use document-centric knowledge bases perform well on the biomedical domain. In this context, the question remains how the quantity of annotated entities within documents and the document count used for entity classification influence disambiguation results. Another open question is whether disambiguation results hold true on more general knowledge data sets (e.g. Wikipedia). In our work we implement a search-based, document-centric disambiguation system and explicitly evaluate the mentioned issues on the biomedical data set CALBC and general knowledge data set Wikipedia, respectively. We show that the number of documents used for classification and the amount of annotations within these documents must be well-matched to attain the best result. Additionally, we reveal that disambiguation accuracy is poor on Wikipedia. We show that disambiguation results significantly improve when using shorter but more documents (e.g. Wikipedia paragraphs). Our results indicate that search-based, document-centric disambiguation systems must be carefully adapted with reference to the underlying domain and availability of user data.