MalCov: Covid-19 Fake News Dataset in the Malay Language

N. H. A. Rahim, M. Basri
{"title":"MalCov: Covid-19 Fake News Dataset in the Malay Language","authors":"N. H. A. Rahim, M. Basri","doi":"10.1109/IVIT55443.2022.10033374","DOIUrl":null,"url":null,"abstract":"The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called \"Sebenarnya.my\" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.","PeriodicalId":325667,"journal":{"name":"2022 International Visualization, Informatics and Technology Conference (IVIT)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Visualization, Informatics and Technology Conference (IVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IVIT55443.2022.10033374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called "Sebenarnya.my" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MalCov:马来语新冠病毒假新闻数据集
新冠肺炎大流行极大地改变了世界氛围。新冠病毒本身已经在世界范围内传播,因此与新冠病毒有关的错误信息也造成了社会混乱。信息学术术语的不准确使用造成了关于这种疾病的误导性信息。这一悲剧引起了公众的恐慌和困惑,以及政府信息与公众之间的沟通不畅。最近有几次尝试通过机器学习模型进行自动分类,以避免这种假新闻的传播。这些方法需要使用标记数据。然而,用于预测建模的可用语料库的稀缺性,特别是在英语以外的语言中,是该领域的一大障碍挑战。到目前为止,我们提出的研究可能是马来语假新闻检测广泛研究的第一步。为此,我们引入MalCov(马来西亚新冠病毒)假新闻数据集。虚假文章占79.5%的“MalCov”是在主要社交媒体平台上收集的,虚假文章约171条。其余的报表都是经过当地当局检查和手工验证的有效条目。所有这些文章都来自一个名为“Sebenarnya”的门户网站。由于我们对这个数据集使用的是一种非英语语言,所以数据被分成了内容和标题。然后分析使用频率最高的单词。几种机器学习模型,如Naïve贝叶斯,支持向量机和逻辑回归被用来建立分类器。结果,决策树的性能最高,为93.48%。Keywords-dataset;假新闻;假新闻检测;机器学习分类;马来语的语言。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Soil Nutrient Deficiency Detection of Lime Trees using Signal-based Deep Learning Impact of excessive use of social media on students learning performance: Gratifications theory perspective Benefits of Digital Printing for Fashion Entrepreneurs: A Case Study at Alia Bastamam Behavioural Characteristics and Cyberbullying Profiles Among Malaysian Youngsters Improved Classification using Extended Hybrid Feature Selection Approach with Dimensionality Reduction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1