基于词向量表示的税务相关领域新词发现

Wei Wei Wei Wei, Wei Liu Wei Wei, Beibei Zhang Wei Liu, Rafał Scherer Beibei Zhang, Robertas Damaševičius Rafal Scherer
{"title":"基于词向量表示的税务相关领域新词发现","authors":"Wei Wei Wei Wei, Wei Liu Wei Wei, Beibei Zhang Wei Liu, Rafał Scherer Beibei Zhang, Robertas Damaševičius Rafal Scherer","doi":"10.53106/160792642023072404010","DOIUrl":null,"url":null,"abstract":"\n New words detection, as basic research in natural language processing, has gained extensive concern from academic and business communities. When the existing Chinese word segmentation technology is applied in the specific field of tax-related finance, because it cannot correctly identify new words in the field, it will have an impact on subsequent information extraction and entity recognition. Aiming at the current problems in new word discovery, it proposed a new word detection method using statistical features that are based on the inner measurement and branch entropy and then combined with word vector representation. First, perform word segmentation preprocessing on the corpus, calculate the internal cohesion degree of words through statistics of scattered string mutual information, filter out candidate two-tuples, and then filter and expand the two-tuples; next, it locks the boundaries of new words through calculate the branch entropy. Finally, expand the new vocabulary dictionary according to the cosine similarity principle of word vector representation. The unsupervised neologism discovery proposed in this paper allows for automatic growth of the neologism lexicon, experimental results on large-scale corpus verify the effectiveness of this method.\n \n","PeriodicalId":442331,"journal":{"name":"網際網路技術學刊","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Discovery of New Words in Tax-related Fields Based on Word Vector Representation\",\"authors\":\"Wei Wei Wei Wei, Wei Liu Wei Wei, Beibei Zhang Wei Liu, Rafał Scherer Beibei Zhang, Robertas Damaševičius Rafal Scherer\",\"doi\":\"10.53106/160792642023072404010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n New words detection, as basic research in natural language processing, has gained extensive concern from academic and business communities. When the existing Chinese word segmentation technology is applied in the specific field of tax-related finance, because it cannot correctly identify new words in the field, it will have an impact on subsequent information extraction and entity recognition. Aiming at the current problems in new word discovery, it proposed a new word detection method using statistical features that are based on the inner measurement and branch entropy and then combined with word vector representation. First, perform word segmentation preprocessing on the corpus, calculate the internal cohesion degree of words through statistics of scattered string mutual information, filter out candidate two-tuples, and then filter and expand the two-tuples; next, it locks the boundaries of new words through calculate the branch entropy. Finally, expand the new vocabulary dictionary according to the cosine similarity principle of word vector representation. The unsupervised neologism discovery proposed in this paper allows for automatic growth of the neologism lexicon, experimental results on large-scale corpus verify the effectiveness of this method.\\n \\n\",\"PeriodicalId\":442331,\"journal\":{\"name\":\"網際網路技術學刊\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"網際網路技術學刊\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.53106/160792642023072404010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"網際網路技術學刊","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53106/160792642023072404010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

新词检测作为自然语言处理的基础研究,受到了学术界和企业界的广泛关注。现有的中文分词技术在涉税金融特定领域应用时,由于无法正确识别该领域的新词,会对后续的信息提取和实体识别产生影响。针对当前新词发现中存在的问题,提出了一种基于内度量和分支熵的统计特征与词向量表示相结合的新词检测方法。首先对语料库进行分词预处理,通过统计分散的字符串互信息计算词的内部衔接度,过滤出候选双元组,然后对双元组进行过滤和扩展;其次,通过计算分支熵来锁定新词的边界。最后,根据词向量表示的余弦相似原理扩展新词汇字典。本文提出的无监督新词发现方法实现了新词词典的自动增长,在大规模语料库上的实验结果验证了该方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Discovery of New Words in Tax-related Fields Based on Word Vector Representation
New words detection, as basic research in natural language processing, has gained extensive concern from academic and business communities. When the existing Chinese word segmentation technology is applied in the specific field of tax-related finance, because it cannot correctly identify new words in the field, it will have an impact on subsequent information extraction and entity recognition. Aiming at the current problems in new word discovery, it proposed a new word detection method using statistical features that are based on the inner measurement and branch entropy and then combined with word vector representation. First, perform word segmentation preprocessing on the corpus, calculate the internal cohesion degree of words through statistics of scattered string mutual information, filter out candidate two-tuples, and then filter and expand the two-tuples; next, it locks the boundaries of new words through calculate the branch entropy. Finally, expand the new vocabulary dictionary according to the cosine similarity principle of word vector representation. The unsupervised neologism discovery proposed in this paper allows for automatic growth of the neologism lexicon, experimental results on large-scale corpus verify the effectiveness of this method.  
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Compact Depth Separable Convolutional Image Filter for Clinical Color Perception Test Hybrid Dynamic Analysis for Android Malware Protected by Anti-Analysis Techniques with DOOLDA An Improved SSD Model for Small Size Work-pieces Recognition in Automatic Production Line A Construction of Knowledge Graph for Semiconductor Industry Chain Based on Lattice-LSTM and PCNN Models Designing a Multi-Criteria Decision-Making Framework to Establish a Value Ranking System for the Quality Evaluation of Long-Term Care Services
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1