Wei Wei Wei Wei, Wei Liu Wei Wei, Beibei Zhang Wei Liu, Rafał Scherer Beibei Zhang, Robertas Damaševičius Rafal Scherer
{"title":"基于词向量表示的税务相关领域新词发现","authors":"Wei Wei Wei Wei, Wei Liu Wei Wei, Beibei Zhang Wei Liu, Rafał Scherer Beibei Zhang, Robertas Damaševičius Rafal Scherer","doi":"10.53106/160792642023072404010","DOIUrl":null,"url":null,"abstract":"\n New words detection, as basic research in natural language processing, has gained extensive concern from academic and business communities. When the existing Chinese word segmentation technology is applied in the specific field of tax-related finance, because it cannot correctly identify new words in the field, it will have an impact on subsequent information extraction and entity recognition. Aiming at the current problems in new word discovery, it proposed a new word detection method using statistical features that are based on the inner measurement and branch entropy and then combined with word vector representation. First, perform word segmentation preprocessing on the corpus, calculate the internal cohesion degree of words through statistics of scattered string mutual information, filter out candidate two-tuples, and then filter and expand the two-tuples; next, it locks the boundaries of new words through calculate the branch entropy. Finally, expand the new vocabulary dictionary according to the cosine similarity principle of word vector representation. The unsupervised neologism discovery proposed in this paper allows for automatic growth of the neologism lexicon, experimental results on large-scale corpus verify the effectiveness of this method.\n \n","PeriodicalId":442331,"journal":{"name":"網際網路技術學刊","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Discovery of New Words in Tax-related Fields Based on Word Vector Representation\",\"authors\":\"Wei Wei Wei Wei, Wei Liu Wei Wei, Beibei Zhang Wei Liu, Rafał Scherer Beibei Zhang, Robertas Damaševičius Rafal Scherer\",\"doi\":\"10.53106/160792642023072404010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n New words detection, as basic research in natural language processing, has gained extensive concern from academic and business communities. When the existing Chinese word segmentation technology is applied in the specific field of tax-related finance, because it cannot correctly identify new words in the field, it will have an impact on subsequent information extraction and entity recognition. Aiming at the current problems in new word discovery, it proposed a new word detection method using statistical features that are based on the inner measurement and branch entropy and then combined with word vector representation. First, perform word segmentation preprocessing on the corpus, calculate the internal cohesion degree of words through statistics of scattered string mutual information, filter out candidate two-tuples, and then filter and expand the two-tuples; next, it locks the boundaries of new words through calculate the branch entropy. Finally, expand the new vocabulary dictionary according to the cosine similarity principle of word vector representation. The unsupervised neologism discovery proposed in this paper allows for automatic growth of the neologism lexicon, experimental results on large-scale corpus verify the effectiveness of this method.\\n \\n\",\"PeriodicalId\":442331,\"journal\":{\"name\":\"網際網路技術學刊\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"網際網路技術學刊\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.53106/160792642023072404010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"網際網路技術學刊","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53106/160792642023072404010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Discovery of New Words in Tax-related Fields Based on Word Vector Representation
New words detection, as basic research in natural language processing, has gained extensive concern from academic and business communities. When the existing Chinese word segmentation technology is applied in the specific field of tax-related finance, because it cannot correctly identify new words in the field, it will have an impact on subsequent information extraction and entity recognition. Aiming at the current problems in new word discovery, it proposed a new word detection method using statistical features that are based on the inner measurement and branch entropy and then combined with word vector representation. First, perform word segmentation preprocessing on the corpus, calculate the internal cohesion degree of words through statistics of scattered string mutual information, filter out candidate two-tuples, and then filter and expand the two-tuples; next, it locks the boundaries of new words through calculate the branch entropy. Finally, expand the new vocabulary dictionary according to the cosine similarity principle of word vector representation. The unsupervised neologism discovery proposed in this paper allows for automatic growth of the neologism lexicon, experimental results on large-scale corpus verify the effectiveness of this method.