{"title":"Siamese Neural Network for Unstructured Data Linkage","authors":"Anna Jurek-Loughrey","doi":"10.1145/3428757.3429106","DOIUrl":null,"url":null,"abstract":"Data integration is one of the key problems in the era of Big Data analytics. The key challenge of data integration is the identification of records representing the same entities (e.g. person). This task is referred to as Record Linkage. It is uncommon for different data sources to share a unique identifier hence the records must be matched by comparing their corresponding values. Most of the existing methods assume that records across different sources are structured and represented by the same set of attributes (e.g. name, date of birth). However, nowadays majority of the data comes without structure (e.g. social media sites). We propose a new approach to Record Linkage based on application of Siamese Neural Network. The model can be applied with structured, semi-structured and unstructured records and it does not assume a common format across different data sources. We demonstrate that the model performs on par with other approaches, which make constraining assumptions regarding the data.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3428757.3429106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Data integration is one of the key problems in the era of Big Data analytics. The key challenge of data integration is the identification of records representing the same entities (e.g. person). This task is referred to as Record Linkage. It is uncommon for different data sources to share a unique identifier hence the records must be matched by comparing their corresponding values. Most of the existing methods assume that records across different sources are structured and represented by the same set of attributes (e.g. name, date of birth). However, nowadays majority of the data comes without structure (e.g. social media sites). We propose a new approach to Record Linkage based on application of Siamese Neural Network. The model can be applied with structured, semi-structured and unstructured records and it does not assume a common format across different data sources. We demonstrate that the model performs on par with other approaches, which make constraining assumptions regarding the data.