Google based name search: Resolving mixed entities on the web

2009 Fourth International Conference on Digital Information Management Pub Date : 2009-12-18 DOI:10.1109/ICDIM.2009.5356763

Byung-Won On, Ingyu Lee

{"title":"Google based name search: Resolving mixed entities on the web","authors":"Byung-Won On, Ingyu Lee","doi":"10.1109/ICDIM.2009.5356763","DOIUrl":null,"url":null,"abstract":"When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.","PeriodicalId":300287,"journal":{"name":"2009 Fourth International Conference on Digital Information Management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Fourth International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2009.5356763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于Google的名称搜索:解决网络上的混合实体

当使用非唯一值作为实体的标识符时，由于它们的同音，可能会出现混淆。特别是，当使用实体的部分“名称”作为其标识符时，该问题通常被称为混合实体解析问题，其目标是将由于名称同音而导致的错误实体分类(例如，如果仅使用姓氏作为标识符，则无法区分“Vannevar Bush”和“George Bush”)。特别是，混合实体解析问题在Web数据上很常见。例如，在Google中搜索一个产品名称(例如，Oracle)，由于名称同音，存在混合的网页(例如，Oracle Database, Oracle Audio, Oracle Academy等)。在本文中，我们提出了一个实用的系统来解决Web上的这种混合实体。为了开发这样一个系统，我们提出了一个基于web服务的接口，一个无监督聚类方案和聚类排序算法。特别是，由于正确的聚类数量通常是未知的，我们研究了基于实体成对相似性传播的最先进的无监督聚类解决方案。通过实验验证了我们的主张，表明我们的方法优于主要竞争解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2009 Fourth International Conference on Digital Information Management

自引率

0.00%

发文量

期刊最新文献

Ontology based entity disambiguation with natural language patterns Tiles — A model for classifying and using contextual information for context-aware applications Effectively and efficiently detect web page duplication From state-based to event-based contextual security policies P2P applied in CMS for advertising