{"title":"链接用户Web配置文件的人名解析","authors":"G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang","doi":"10.1145/2767109.2767117","DOIUrl":null,"url":null,"abstract":"A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles (\"John Smith\" versus \"Smith, John\"), extra information (\"John Smith, PhD\", \"Rev. John Smith\"), and country-specific last-name prefixes (\"Jean van de Velde\"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of \"bucket\" features based on (name-token, label) distributions in lieu of \"term\" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Person-Name Parsing for Linking User Web Profiles\",\"authors\":\"G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang\",\"doi\":\"10.1145/2767109.2767117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles (\\\"John Smith\\\" versus \\\"Smith, John\\\"), extra information (\\\"John Smith, PhD\\\", \\\"Rev. John Smith\\\"), and country-specific last-name prefixes (\\\"Jean van de Velde\\\"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of \\\"bucket\\\" features based on (name-token, label) distributions in lieu of \\\"term\\\" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.\",\"PeriodicalId\":316270,\"journal\":{\"name\":\"Proceedings of the 18th International Workshop on Web and Databases\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th International Workshop on Web and Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2767109.2767117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Workshop on Web and Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2767109.2767117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
摘要
人名解析器涉及识别人名的组成部分。由于多种写作风格(“John Smith”和“Smith, John”)、额外信息(“John Smith, PhD”、“Rev. John Smith”)和特定国家的姓氏前缀(“Jean van de Velde”),在Web 2.0应用程序上解析用户配置文件中的全名字符串并不简单。据我们所知,我们是第一个系统地解决这个问题的人,提出了解析嘈杂全名字符串的机器学习方法。在本文中,我们提出了基于标记统计、表面模式和专用字典的几种类型的特征,并在序列建模框架中应用它们来学习全名解析器。特别是,我们建议使用基于(名称令牌,标签)分布的“桶”特征来代替各种自然语言处理应用中经常使用的“术语”特征,以防止学习参数作为训练数据大小的函数的增长。我们通过实验说明了我们提出的功能的通用性、有效性和效率方面,这些功能可以对来自美国流行的专业网络网站LinkedIn和常用人名的全名字符串进行嘈杂的全名解析。在这些数据集上,我们的全名解析器明显优于使用分类方法训练的解析器和商业上可用的名称解析解决方案。
A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles ("John Smith" versus "Smith, John"), extra information ("John Smith, PhD", "Rev. John Smith"), and country-specific last-name prefixes ("Jean van de Velde"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of "bucket" features based on (name-token, label) distributions in lieu of "term" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.