{"title":"基于词级特征的汉语未知词提取","authors":"Wenbo Pang, Xiaozhong Fan, Yijun Gu, Jiangde Yu","doi":"10.1109/HIS.2009.77","DOIUrl":null,"url":null,"abstract":"The automatic recognition of unknown words is an important problem in Chinese information processing. Based on the characteristics of words, this paper proposes a method to recognize new words using high frequent strings. Firstly, the high frequent strings from each single document are extracted as candidate strings. Then the strings that cannot satisfy the characteristics of word’s distribution and word’s independently usage are removed. Finally, segment the entire corpus with these candidate strings, and count the word-frequency for further filtering. Experimental results show that, on the documents about basketball downloaded from Zaobao Newspaper, this method achieves an F-score of 79.39%.","PeriodicalId":414085,"journal":{"name":"2009 Ninth International Conference on Hybrid Intelligent Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Chinese Unknown Words Extraction Based on Word-Level Characteristics\",\"authors\":\"Wenbo Pang, Xiaozhong Fan, Yijun Gu, Jiangde Yu\",\"doi\":\"10.1109/HIS.2009.77\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The automatic recognition of unknown words is an important problem in Chinese information processing. Based on the characteristics of words, this paper proposes a method to recognize new words using high frequent strings. Firstly, the high frequent strings from each single document are extracted as candidate strings. Then the strings that cannot satisfy the characteristics of word’s distribution and word’s independently usage are removed. Finally, segment the entire corpus with these candidate strings, and count the word-frequency for further filtering. Experimental results show that, on the documents about basketball downloaded from Zaobao Newspaper, this method achieves an F-score of 79.39%.\",\"PeriodicalId\":414085,\"journal\":{\"name\":\"2009 Ninth International Conference on Hybrid Intelligent Systems\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Ninth International Conference on Hybrid Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HIS.2009.77\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Ninth International Conference on Hybrid Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HIS.2009.77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Chinese Unknown Words Extraction Based on Word-Level Characteristics
The automatic recognition of unknown words is an important problem in Chinese information processing. Based on the characteristics of words, this paper proposes a method to recognize new words using high frequent strings. Firstly, the high frequent strings from each single document are extracted as candidate strings. Then the strings that cannot satisfy the characteristics of word’s distribution and word’s independently usage are removed. Finally, segment the entire corpus with these candidate strings, and count the word-frequency for further filtering. Experimental results show that, on the documents about basketball downloaded from Zaobao Newspaper, this method achieves an F-score of 79.39%.