链接用户Web配置文件的人名解析

G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang
{"title":"链接用户Web配置文件的人名解析","authors":"G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang","doi":"10.1145/2767109.2767117","DOIUrl":null,"url":null,"abstract":"A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles (\"John Smith\" versus \"Smith, John\"), extra information (\"John Smith, PhD\", \"Rev. John Smith\"), and country-specific last-name prefixes (\"Jean van de Velde\"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of \"bucket\" features based on (name-token, label) distributions in lieu of \"term\" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Person-Name Parsing for Linking User Web Profiles\",\"authors\":\"G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang\",\"doi\":\"10.1145/2767109.2767117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles (\\\"John Smith\\\" versus \\\"Smith, John\\\"), extra information (\\\"John Smith, PhD\\\", \\\"Rev. John Smith\\\"), and country-specific last-name prefixes (\\\"Jean van de Velde\\\"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of \\\"bucket\\\" features based on (name-token, label) distributions in lieu of \\\"term\\\" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.\",\"PeriodicalId\":316270,\"journal\":{\"name\":\"Proceedings of the 18th International Workshop on Web and Databases\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th International Workshop on Web and Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2767109.2767117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Workshop on Web and Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2767109.2767117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

人名解析器涉及识别人名的组成部分。由于多种写作风格(“John Smith”和“Smith, John”)、额外信息(“John Smith, PhD”、“Rev. John Smith”)和特定国家的姓氏前缀(“Jean van de Velde”),在Web 2.0应用程序上解析用户配置文件中的全名字符串并不简单。据我们所知,我们是第一个系统地解决这个问题的人,提出了解析嘈杂全名字符串的机器学习方法。在本文中,我们提出了基于标记统计、表面模式和专用字典的几种类型的特征,并在序列建模框架中应用它们来学习全名解析器。特别是,我们建议使用基于(名称令牌,标签)分布的“桶”特征来代替各种自然语言处理应用中经常使用的“术语”特征,以防止学习参数作为训练数据大小的函数的增长。我们通过实验说明了我们提出的功能的通用性、有效性和效率方面,这些功能可以对来自美国流行的专业网络网站LinkedIn和常用人名的全名字符串进行嘈杂的全名解析。在这些数据集上,我们的全名解析器明显优于使用分类方法训练的解析器和商业上可用的名称解析解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Person-Name Parsing for Linking User Web Profiles
A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles ("John Smith" versus "Smith, John"), extra information ("John Smith, PhD", "Rev. John Smith"), and country-specific last-name prefixes ("Jean van de Velde"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of "bucket" features based on (name-token, label) distributions in lieu of "term" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Discovering Subsumption Relationships for Web-Based Ontologies Truth Finding with Attribute Partitioning Long-term Optimization of Update Frequencies for Decaying Information Analyzing Crowd Rankings The elephant in the room: getting value from Big Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1