A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries

Proceedings of the 25th International Conference on World Wide Web Pub Date : 2016-04-11 DOI:10.1145/2872427.2883061

M. Cornolti, P. Ferragina, Massimiliano Ciaramita, Stefan Rüd, Hinrich Schütze

{"title":"A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries","authors":"M. Cornolti, P. Ferragina, Massimiliano Ciaramita, Stefan Rüd, Hinrich Schütze","doi":"10.1145/2872427.2883061","DOIUrl":null,"url":null,"abstract":"In this paper we study the problem of linking open-domain web-search queries towards entities drawn from the full entity inventory of Wikipedia articles. We introduce SMAPH-2, a second-order approach that, by piggybacking on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. The key algorithmic idea underlying SMAPH-2 is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in the input query. This allows us to confine the possible concepts pertinent to the query to only the ones really mentioned in it. The link-back is implemented via a collective disambiguation step based upon a supervised ranking model that makes one joint prediction for the annotation of the complete query optimizing directly the F1 measure. We evaluate both known features, such as word embeddings and semantic relatedness among entities, and several novel features such as an approximate distance between mentions and entities (which can handle spelling errors). We demonstrate that SMAPH-2 achieves state-of-the-art performance on the ERD@SIGIR2014 benchmark. We also publish GERDAQ (General Entity Recognition, Disambiguation and Annotation in Queries), a novel, public dataset built specifically for web-query entity linking via a crowdsourcing effort. SMAPH-2 outperforms the benchmarks by comparable margins also on GERDAQ.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2883061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

In this paper we study the problem of linking open-domain web-search queries towards entities drawn from the full entity inventory of Wikipedia articles. We introduce SMAPH-2, a second-order approach that, by piggybacking on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. The key algorithmic idea underlying SMAPH-2 is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in the input query. This allows us to confine the possible concepts pertinent to the query to only the ones really mentioned in it. The link-back is implemented via a collective disambiguation step based upon a supervised ranking model that makes one joint prediction for the annotation of the complete query optimizing directly the F1 measure. We evaluate both known features, such as word embeddings and semantic relatedness among entities, and several novel features such as an approximate distance between mentions and entities (which can handle spelling errors). We demonstrate that SMAPH-2 achieves state-of-the-art performance on the ERD@SIGIR2014 benchmark. We also publish GERDAQ (General Entity Recognition, Disambiguation and Annotation in Queries), a novel, public dataset built specifically for web-query entity linking via a crowdsourcing effort. SMAPH-2 outperforms the benchmarks by comparable margins also on GERDAQ.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Web查询中联合实体提及检测与链接的背载系统

在本文中，我们研究了将开放域网络搜索查询链接到维基百科文章的完整实体目录中的实体的问题。我们引入了SMAPH-2，这是一种二阶方法，通过承载web搜索引擎，减轻了查询语言特征的噪音和不规则性，并将查询置于更大的上下文中，更容易理解它们。SMAPH-2的关键算法思想是首先发现候选实体集，然后将这些实体链接回输入查询中出现的提及。这允许我们将与查询相关的可能概念限制为查询中真正提到的概念。该链接是通过基于监督排序模型的集体消歧步骤实现的，该模型对直接优化F1度量的完整查询的注释进行联合预测。我们评估了已知的特征，如词嵌入和实体之间的语义相关性，以及几个新特征，如提及和实体之间的近似距离(可以处理拼写错误)。我们证明SMAPH-2在ERD@SIGIR2014基准上实现了最先进的性能。我们还发布了GERDAQ(查询中的通用实体识别、消歧和注释)，这是一个新颖的公共数据集，专门为通过众包努力建立的web查询实体链接而构建。SMAPH-2的表现也比GERDAQ的基准股指高出相当的利润率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 25th International Conference on World Wide Web

自引率

0.00%

发文量