{"title":"有效的光谱邻域块实体分辨率","authors":"Liangcai Shu, Aiyou Chen, Ming Xiong, W. Meng","doi":"10.1109/ICDE.2011.5767835","DOIUrl":null,"url":null,"abstract":"In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":"{\"title\":\"Efficient SPectrAl Neighborhood blocking for entity resolution\",\"authors\":\"Liangcai Shu, Aiyou Chen, Ming Xiong, W. Meng\",\"doi\":\"10.1109/ICDE.2011.5767835\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.\",\"PeriodicalId\":332374,\"journal\":{\"name\":\"2011 IEEE 27th International Conference on Data Engineering\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"42\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 27th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2011.5767835\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 27th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2011.5767835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Efficient SPectrAl Neighborhood blocking for entity resolution
In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.