{"title":"情感分类:文本矢量化方法综述:词袋、Tf-Idf、Word2vec和Doc2vec","authors":"Haisal Dauda Abubakar, M. Umar","doi":"10.56471/slujst.v4i.266","DOIUrl":null,"url":null,"abstract":"In Sentiment Analysis, there are three (3) approaches namely, machine learning, lexicon-based and ruled based approaches. This study investigates on machine learning approaches which involves text vectorization or word embedding- an essential step in natural language processing tasks since most machine learning algorithms work with numerical input. Text vectorization involves the representation or mapping of words or documents of a corpus to numerical vectors of numbers or real numbers. There are several approaches in the literatures on document/text representation, however this study will focus on three (3) commonly used ones viz; Bag of words, TF-IDF, word2vec and doc2vec, and try to identify the reason behind that for review and recommendation to the researchers in hurry. Review of this study shows that TF-IDF feature vector representations generally outperforms other two (2) vectorization methods word2vec and doc2vec, specifically in book review sentiment classification. And therefore recommended for future studies in book review data set","PeriodicalId":299818,"journal":{"name":"SLU Journal of Science and Technology","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec\",\"authors\":\"Haisal Dauda Abubakar, M. Umar\",\"doi\":\"10.56471/slujst.v4i.266\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In Sentiment Analysis, there are three (3) approaches namely, machine learning, lexicon-based and ruled based approaches. This study investigates on machine learning approaches which involves text vectorization or word embedding- an essential step in natural language processing tasks since most machine learning algorithms work with numerical input. Text vectorization involves the representation or mapping of words or documents of a corpus to numerical vectors of numbers or real numbers. There are several approaches in the literatures on document/text representation, however this study will focus on three (3) commonly used ones viz; Bag of words, TF-IDF, word2vec and doc2vec, and try to identify the reason behind that for review and recommendation to the researchers in hurry. Review of this study shows that TF-IDF feature vector representations generally outperforms other two (2) vectorization methods word2vec and doc2vec, specifically in book review sentiment classification. And therefore recommended for future studies in book review data set\",\"PeriodicalId\":299818,\"journal\":{\"name\":\"SLU Journal of Science and Technology\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SLU Journal of Science and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.56471/slujst.v4i.266\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SLU Journal of Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.56471/slujst.v4i.266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
摘要
在情感分析中,有三种方法,即机器学习、基于词典和基于规则的方法。本研究探讨了涉及文本向量化或词嵌入的机器学习方法,这是自然语言处理任务的重要步骤,因为大多数机器学习算法都与数字输入一起工作。文本矢量化涉及语料库的单词或文档到数字或实数的数值向量的表示或映射。文献中有几种关于文档/文本表示的方法,但本研究将重点关注三种常用的方法,即;Bag of words, TF-IDF, word2vec和doc2vec,并试图找出背后的原因,以便尽快审查和推荐给研究人员。回顾本研究表明,TF-IDF特征向量表示总体上优于其他两种向量化方法word2vec和doc2vec,特别是在书评情感分类方面。因此推荐用于未来的书评数据集研究
Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec
In Sentiment Analysis, there are three (3) approaches namely, machine learning, lexicon-based and ruled based approaches. This study investigates on machine learning approaches which involves text vectorization or word embedding- an essential step in natural language processing tasks since most machine learning algorithms work with numerical input. Text vectorization involves the representation or mapping of words or documents of a corpus to numerical vectors of numbers or real numbers. There are several approaches in the literatures on document/text representation, however this study will focus on three (3) commonly used ones viz; Bag of words, TF-IDF, word2vec and doc2vec, and try to identify the reason behind that for review and recommendation to the researchers in hurry. Review of this study shows that TF-IDF feature vector representations generally outperforms other two (2) vectorization methods word2vec and doc2vec, specifically in book review sentiment classification. And therefore recommended for future studies in book review data set