Clickbait Detection in Indonesian News Title with Gray Unbalanced Class Based on BERT

Pub Date : 2023-01-01 DOI:10.12720/jait.14.2.233-241
P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto
{"title":"Clickbait Detection in Indonesian News Title with Gray Unbalanced Class Based on BERT","authors":"P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto","doi":"10.12720/jait.14.2.233-241","DOIUrl":null,"url":null,"abstract":"—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.2.233-241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
基于BERT的印尼新闻标题灰色不平衡类标题党检测
-世界上约有2.63亿人使用印尼语,但它被归类为资源不足的语言。近年来,新闻分析中的标题党问题引起了人们的关注。然而,对于印尼语来说,仍然缺乏用于标题党任务的资源。标题党吸引了读者的注意力,即使内容没有信息和误导。标题党数据集的不平衡是指数据集中类别分布不均匀,影响分类结果。本研究提出在不减少原始数据数量的前提下,利用焦点损失来提高分类精度。通常,标题党数据分为两类,即标题党和非标题党。然而,有些标题很难分类,即使是人类。因此,本研究将标题分为三类,即标题党(clickbait)、非标题党(non-clickbait)和灰色标题党(灰色标题党)。该方法在两类分类中准确率达到93.4%,优于以往的研究。然而,该方法在三类分类中达到了73.3%的准确率。我们的研究表明,灰色标题党和标题党数据之间的相似性很高,这使得分类更具挑战性。另一方面,在标题党中使用三种分类标题不足以提供更好的分类性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1