{"title":"Contextual Word2Vec Model for Understanding Chinese Out of Vocabularies on Online Social Media","authors":"Jiakai Gu, Gen Li, Nam D. Vo, Jason J. Jung","doi":"10.4018/ijswis.309428","DOIUrl":null,"url":null,"abstract":"In this chapter, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words. If there is a word that has similar contextual information to the OOV, the word can be used to understand the OOV. They chose the Weibo corpus as the dataset for the experiments. The results show that the proposed model achieves 97.10% accuracy, which is better than Skip-Gram by 8.53%.","PeriodicalId":54934,"journal":{"name":"International Journal on Semantic Web and Information Systems","volume":"36 1","pages":"1-14"},"PeriodicalIF":4.1000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Semantic Web and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.4018/ijswis.309428","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 5
Abstract
In this chapter, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words. If there is a word that has similar contextual information to the OOV, the word can be used to understand the OOV. They chose the Weibo corpus as the dataset for the experiments. The results show that the proposed model achieves 97.10% accuracy, which is better than Skip-Gram by 8.53%.
在本章中,作者建议使用上下文Word2Vec模型来理解OOV (out of vocabulary)。利用左右熵和点信息熵提取OOV。他们选择使用Word2Vec来构建词向量空间,使用CBOW (continuous bag of words)来获取词的上下文信息。如果有一个单词与OOV具有相似的上下文信息,则可以使用该单词来理解OOV。他们选择微博语料库作为实验的数据集。结果表明,该模型的准确率为97.10%,比Skip-Gram高8.53%。
期刊介绍:
The International Journal on Semantic Web and Information Systems (IJSWIS) promotes a knowledge transfer channel where academics, practitioners, and researchers can discuss, analyze, criticize, synthesize, communicate, elaborate, and simplify the more-than-promising technology of the semantic Web in the context of information systems. The journal aims to establish value-adding knowledge transfer and personal development channels in three distinctive areas: academia, industry, and government.