{"title":"基于hmm的大规模综合训练数据生成的地址解析","authors":"Xiang Li, Hakan Kardes, Xin Wang, Ang Sun","doi":"10.1145/2663713.2664430","DOIUrl":null,"url":null,"abstract":"Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches of synthetic training data generation to build robust models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses.","PeriodicalId":320466,"journal":{"name":"International Workshop on Location and the Web","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"HMM-based Address Parsing with Massive Synthetic Training Data Generation\",\"authors\":\"Xiang Li, Hakan Kardes, Xin Wang, Ang Sun\",\"doi\":\"10.1145/2663713.2664430\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches of synthetic training data generation to build robust models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses.\",\"PeriodicalId\":320466,\"journal\":{\"name\":\"International Workshop on Location and the Web\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on Location and the Web\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2663713.2664430\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Location and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663713.2664430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HMM-based Address Parsing with Massive Synthetic Training Data Generation
Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches of synthetic training data generation to build robust models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses.