预处理策略对利用TF-IDF和BOW特征的ML分类器性能的影响

IF 1.7 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal Pub Date : 2020-06-18 DOI:10.14201/adcaij2020924968
Amit Pimpalkar, R. Raj
{"title":"预处理策略对利用TF-IDF和BOW特征的ML分类器性能的影响","authors":"Amit Pimpalkar, R. Raj","doi":"10.14201/adcaij2020924968","DOIUrl":null,"url":null,"abstract":"Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. \nFor polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification.  The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.","PeriodicalId":42597,"journal":{"name":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","volume":null,"pages":null},"PeriodicalIF":1.7000,"publicationDate":"2020-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features\",\"authors\":\"Amit Pimpalkar, R. Raj\",\"doi\":\"10.14201/adcaij2020924968\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. \\nFor polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification.  The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.\",\"PeriodicalId\":42597,\"journal\":{\"name\":\"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2020-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14201/adcaij2020924968\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij2020924968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 21

摘要

数据分析及其相关应用近年来已成为重要的研究领域。如今,研究人员关注的主题是每分每秒产生的大量数据,因为人们不断地分享对与他们相关的事物的想法和观点。然而,社交媒体信息仍然是非结构化的、分散的、难以处理的,需要建立一个坚实的基础,这样它们才能被用作特定主题的有价值的信息。在噪声、关联、表情符号、大众分类法和俚语等方面处理这一领域的非结构化数据确实非常具有挑战性,因此需要在获得正确的情感之前进行适当的数据预处理。从Kaggle和Twitter中提取数据集,使用NLTK和Scikit-learn进行预处理,并对Words Bag (BOW)、Term Frequency (TF)和Inverse Document Frequency (IDF)方案进行特征选择和提取。对于极性识别,我们评估了五种不同的机器学习(ML)算法,即多项式朴素贝叶斯(MNB)、逻辑回归(LR)、决策树(DT)、XGBoost (XGB)和支持向量机(SVM)。我们对这些算法的成功进行了比较分析,以确定哪种算法在召回率、准确性、f1分数和精度方面最适合给定的数据集。我们评估了各种预处理技术对两个数据集的影响;一个有定义域,另一个没有。结果表明,SVM分类器的准确率和精密度分别达到73.12%和94.91%,优于其他分类器。本研究还强调了特征的选择和表示以及各种预处理技术对分类性能的积极影响。最终结果表明情绪分类有所改善,我们注意到预处理方法明显表明分类器的效率有所提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features
Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification.  The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
22
审稿时长
4 weeks
期刊最新文献
Enhancing Energy Efficiency in Cluster Based WSN using Grey Wolf Optimization Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation Healthcare Data Collection Using Internet of Things and Blockchain Based Decentralized Data Storage Development of an Extended Medical Diagnostic System for Typhoid and Malaria Fever Comparison of Swarm-based Metaheuristic and Gradient Descent-based Algorithms in Artificial Neural Network Training
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1