利用变形金刚(BERT)的微调双向编码器表征检测印尼新闻标题标题

Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo
{"title":"利用变形金刚(BERT)的微调双向编码器表征检测印尼新闻标题标题","authors":"Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo","doi":"10.25139/inform.v7i2.4686","DOIUrl":null,"url":null,"abstract":"The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.","PeriodicalId":52760,"journal":{"name":"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi","volume":"17 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)\",\"authors\":\"Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo\",\"doi\":\"10.25139/inform.v7i2.4686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.\",\"PeriodicalId\":52760,\"journal\":{\"name\":\"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25139/inform.v7i2.4686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25139/inform.v7i2.4686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

存在与内容不匹配的新闻文章的问题,被称为标题党,严重干扰了读者获得他们期望的信息。近年来,标题党新闻的数量持续显著增加。根据这个问题,需要一个标题党检测器来自动识别新闻文章标题,包括标题党和非标题党。此外,许多现有的解决方案使用手工制作的特征和传统的机器学习方法,这限制了泛化。因此,本研究对变形金刚(BERT)的双向编码器表示进行了微调,并使用印度尼西亚新闻标题数据集CLICK-ID来预测标题党(BERT)。在本研究中,我们使用IndoBERT作为预训练模型,这是一种最先进的基于bert的印尼语语言模型。然后,通过比较具有不同预训练模型的IndoBERT分类器与两种基于词向量的方法(即词袋和TF-IDF)和五种机器学习分类器(即NB, KNN, SVM, DT和RF)的性能来评估基于bert的分类器的有用性。评估结果表明,所有微调的IndoBERT分类器在分类标题党和非标题党印度尼西亚新闻标题方面优于所有基于词向量的机器学习分类器。使用两个训练阶段模型的IndoBERTBASE准确率最高,为0.8247,为0.064(6%),优于使用词袋模型的SVM分类器的准确率0.7607。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)
The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
31
审稿时长
10 weeks
期刊最新文献
Blended Learning Vocationalogy Entrepreneurship Program: Analysis of Human-Computer Interaction Based on Technology Acceptance Model (TAM) Sentiment Analysis for IMDb Movie Review Using Support Vector Machine (SVM) Method Estimation of Brake Pad Wear Using Fuzzy Logic in Real Time Website Analysis and Design Using Iconix Process Method: Case Study: Kedai Lengghian Classification of Pistachio Nut Using Convolutional Neural Network
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1