波斯语评论垃圾邮件检测的监督框架

2019 5th International Conference on Web Research (ICWR) Pub Date : 2019-04-01 DOI:10.1109/ICWR.2019.8765275

Mohammad Ehsan Basiri, Neshat Safarian, Hadi Khosravi Farsani

{"title":"波斯语评论垃圾邮件检测的监督框架","authors":"Mohammad Ehsan Basiri, Neshat Safarian, Hadi Khosravi Farsani","doi":"10.1109/ICWR.2019.8765275","DOIUrl":null,"url":null,"abstract":"Sentiment analysis of online reviews has attracted an increasing attention from both academia and industry. Although online reviews are valuable sources of information for detecting public opinion towards different aspects of products, they may be written by spammers with different purposes. In order to detect such spam reviews, several methods have been proposed for English language but no study has been reported on Persian spam detection so far. In the current study, Persian reviews of cell-phones are investigated to find spam type 1 and type 2 which are fake reviews and reviews only written about brands, respectively. In the proposed framework a labeled dataset, SpamPer, is first created using a majority voting on the answers of 11 questions previously designed for spam detection by human annotators. Then several preprocessing steps for Persian language are performed to refine the training data. Finally review-based and metadata features are extracted. The obtained results on 3000 reviews of SpamPer shows that the highest accuracy is obtained using the decision tree with 0.78 F1-measure. Moreover, the results reveal that SVM for unbalanced data and decision tree for balanced data achieve better performance when they are trained on the combination of metadata and review-based features.","PeriodicalId":6680,"journal":{"name":"2019 5th International Conference on Web Research (ICWR)","volume":"47 1","pages":"203-207"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A Supervised Framework for Review Spam Detection in the Persian Language\",\"authors\":\"Mohammad Ehsan Basiri, Neshat Safarian, Hadi Khosravi Farsani\",\"doi\":\"10.1109/ICWR.2019.8765275\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis of online reviews has attracted an increasing attention from both academia and industry. Although online reviews are valuable sources of information for detecting public opinion towards different aspects of products, they may be written by spammers with different purposes. In order to detect such spam reviews, several methods have been proposed for English language but no study has been reported on Persian spam detection so far. In the current study, Persian reviews of cell-phones are investigated to find spam type 1 and type 2 which are fake reviews and reviews only written about brands, respectively. In the proposed framework a labeled dataset, SpamPer, is first created using a majority voting on the answers of 11 questions previously designed for spam detection by human annotators. Then several preprocessing steps for Persian language are performed to refine the training data. Finally review-based and metadata features are extracted. The obtained results on 3000 reviews of SpamPer shows that the highest accuracy is obtained using the decision tree with 0.78 F1-measure. Moreover, the results reveal that SVM for unbalanced data and decision tree for balanced data achieve better performance when they are trained on the combination of metadata and review-based features.\",\"PeriodicalId\":6680,\"journal\":{\"name\":\"2019 5th International Conference on Web Research (ICWR)\",\"volume\":\"47 1\",\"pages\":\"203-207\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 5th International Conference on Web Research (ICWR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWR.2019.8765275\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 5th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR.2019.8765275","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

网络评论的情感分析越来越受到学术界和业界的关注。尽管在线评论是检测公众对产品不同方面看法的宝贵信息来源，但它们可能是由不同目的的垃圾邮件发送者撰写的。为了检测这种垃圾邮件评论，已经提出了几种针对英语的方法，但迄今为止还没有关于波斯语垃圾邮件检测的研究报告。在目前的研究中，研究人员调查了波斯语对手机的评论，发现垃圾邮件类型1和类型2分别是虚假评论和只写品牌的评论。在提议的框架中，首先使用对11个问题的答案进行多数投票来创建标记数据集SpamPer，这些问题是由人类注释者设计用于垃圾邮件检测的。然后对波斯语进行预处理，对训练数据进行细化。最后提取基于评审的特征和元数据特征。在SpamPer的3000条评论上获得的结果表明，使用决策树获得的准确率最高，其F1-measure值为0.78。此外，研究结果表明，将元数据与基于评论的特征相结合，对非平衡数据的支持向量机和平衡数据的决策树进行训练，可以获得更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Supervised Framework for Review Spam Detection in the Persian Language

Sentiment analysis of online reviews has attracted an increasing attention from both academia and industry. Although online reviews are valuable sources of information for detecting public opinion towards different aspects of products, they may be written by spammers with different purposes. In order to detect such spam reviews, several methods have been proposed for English language but no study has been reported on Persian spam detection so far. In the current study, Persian reviews of cell-phones are investigated to find spam type 1 and type 2 which are fake reviews and reviews only written about brands, respectively. In the proposed framework a labeled dataset, SpamPer, is first created using a majority voting on the answers of 11 questions previously designed for spam detection by human annotators. Then several preprocessing steps for Persian language are performed to refine the training data. Finally review-based and metadata features are extracted. The obtained results on 3000 reviews of SpamPer shows that the highest accuracy is obtained using the decision tree with 0.78 F1-measure. Moreover, the results reveal that SVM for unbalanced data and decision tree for balanced data achieve better performance when they are trained on the combination of metadata and review-based features.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 5th International Conference on Web Research (ICWR)

自引率

0.00%

发文量