{"title":"基于hmm的地址解析:在MapReduce上高效解析数十亿个地址","authors":"Xiang Li, Hakan Kardes, Xin Wang, Ang Sun","doi":"10.1145/2666310.2666471","DOIUrl":null,"url":null,"abstract":"Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches to build models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses with Hadoop.","PeriodicalId":153031,"journal":{"name":"Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"HMM-based address parsing: efficiently parsing billions of addresses on MapReduce\",\"authors\":\"Xiang Li, Hakan Kardes, Xin Wang, Ang Sun\",\"doi\":\"10.1145/2666310.2666471\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches to build models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses with Hadoop.\",\"PeriodicalId\":153031,\"journal\":{\"name\":\"Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2666310.2666471\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2666310.2666471","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HMM-based address parsing: efficiently parsing billions of addresses on MapReduce
Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches to build models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses with Hadoop.