Automatic author name disambiguation by differentiable feature selection

IF 1.8 4区管理学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of Information Science Pub Date : 2023-09-19 DOI:10.1177/01655515231193859

ZhiJian Fang, Yue Zhuo, Jinying Xu, Zhechong Tang, Zijie Jia, HuaXiong Zhang

{"title":"Automatic author name disambiguation by differentiable feature selection","authors":"ZhiJian Fang, Yue Zhuo, Jinying Xu, Zhechong Tang, Zijie Jia, HuaXiong Zhang","doi":"10.1177/01655515231193859","DOIUrl":null,"url":null,"abstract":"Author name disambiguation (AND) is the task of resolving the ambiguity problem in bibliographic databases, where distinct real-world authors may share the same name or same author may have distinct names. The aim of AND is to split the name-ambiguous entities (articles) into the corresponding authors. Existing AND algorithms mainly focus on designing different similarity metrics between two ambiguous articles. However, most previous methods empirically select and process the features of entities, then use features to predict the similarity by data-driven models. In this article, we are motivated by natural questions: Which features are most useful for splitting name-ambiguous entities? Can they be automatically determined by an optimisation approach rather than heuristic feature engineering? Therefore, we proposed a novel end-to-end differentiable feature selection algorithm, automatically searching the optimal features for AND task (AAND). AAND optimises the discrete feature selection by differentiable Gumbel-Softmax, leading to the joint learning of feature selection policy and similarity prediction model. The experiments are conducted on a benchmark data set, S2AND, which harmonises eight different AND data sets. The results show that the performance of our proposal is superior to the advanced AND methods and feature selection algorithms. Meanwhile, deep insights into AND features are also given.","PeriodicalId":54796,"journal":{"name":"Journal of Information Science","volume":"11 1","pages":"0"},"PeriodicalIF":1.8000,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/01655515231193859","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Author name disambiguation (AND) is the task of resolving the ambiguity problem in bibliographic databases, where distinct real-world authors may share the same name or same author may have distinct names. The aim of AND is to split the name-ambiguous entities (articles) into the corresponding authors. Existing AND algorithms mainly focus on designing different similarity metrics between two ambiguous articles. However, most previous methods empirically select and process the features of entities, then use features to predict the similarity by data-driven models. In this article, we are motivated by natural questions: Which features are most useful for splitting name-ambiguous entities? Can they be automatically determined by an optimisation approach rather than heuristic feature engineering? Therefore, we proposed a novel end-to-end differentiable feature selection algorithm, automatically searching the optimal features for AND task (AAND). AAND optimises the discrete feature selection by differentiable Gumbel-Softmax, leading to the joint learning of feature selection policy and similarity prediction model. The experiments are conducted on a benchmark data set, S2AND, which harmonises eight different AND data sets. The results show that the performance of our proposal is superior to the advanced AND methods and feature selection algorithms. Meanwhile, deep insights into AND features are also given.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于可微分特征选择的作者姓名自动消歧

作者姓名消歧(AND)是解决书目数据库中的歧义问题的任务，其中不同的现实世界作者可能共享相同的名称，或者相同的作者可能具有不同的名称。AND的目的是将名称不明确的实体(文章)拆分为对应的作者。现有的AND算法主要集中在设计两篇歧义文章之间不同的相似度度量。然而，以往的方法大多是经验地选择和处理实体的特征，然后利用特征通过数据驱动模型来预测相似度。在本文中，我们的动机是一个自然的问题:哪些特性对于拆分名称不明确的实体最有用?它们可以通过优化方法而不是启发式特征工程来自动确定吗?为此，我们提出了一种新的端到端可微特征选择算法，自动搜索与任务的最优特征(AAND)。AAND通过可微Gumbel-Softmax优化离散特征选择，实现特征选择策略和相似度预测模型的联合学习。实验是在一个基准数据集S2AND上进行的，该数据集协调了八个不同的AND数据集。结果表明，该方法的性能优于先进的AND方法和特征选择算法。同时，对AND特征也进行了深入的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Information Science 工程技术-计算机：信息系统

CiteScore

6.80

自引率

8.30%

发文量

121

审稿时长

4 months

期刊介绍： The Journal of Information Science is a peer-reviewed international journal of high repute covering topics of interest to all those researching and working in the sciences of information and knowledge management. The Editors welcome material on any aspect of information science theory, policy, application or practice that will advance thinking in the field.