Google based name search: Resolving mixed entities on the web

Byung-Won On, Ingyu Lee
{"title":"Google based name search: Resolving mixed entities on the web","authors":"Byung-Won On, Ingyu Lee","doi":"10.1109/ICDIM.2009.5356763","DOIUrl":null,"url":null,"abstract":"When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.","PeriodicalId":300287,"journal":{"name":"2009 Fourth International Conference on Digital Information Management","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2009-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Fourth International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2009.5356763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于Google的名称搜索:解决网络上的混合实体
当使用非唯一值作为实体的标识符时,由于它们的同音,可能会出现混淆。特别是,当使用实体的部分“名称”作为其标识符时,该问题通常被称为混合实体解析问题,其目标是将由于名称同音而导致的错误实体分类(例如,如果仅使用姓氏作为标识符,则无法区分“Vannevar Bush”和“George Bush”)。特别是,混合实体解析问题在Web数据上很常见。例如,在Google中搜索一个产品名称(例如,Oracle),由于名称同音,存在混合的网页(例如,Oracle Database, Oracle Audio, Oracle Academy等)。在本文中,我们提出了一个实用的系统来解决Web上的这种混合实体。为了开发这样一个系统,我们提出了一个基于web服务的接口,一个无监督聚类方案和聚类排序算法。特别是,由于正确的聚类数量通常是未知的,我们研究了基于实体成对相似性传播的最先进的无监督聚类解决方案。通过实验验证了我们的主张,表明我们的方法优于主要竞争解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ontology based entity disambiguation with natural language patterns Tiles — A model for classifying and using contextual information for context-aware applications Effectively and efficiently detect web page duplication From state-based to event-based contextual security policies P2P applied in CMS for advertising
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1