利用变形金刚(BERT)的微调双向编码器表征检测印尼新闻标题标题

Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Pub Date : 2022-07-30 DOI:10.25139/inform.v7i2.4686

Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo

{"title":"利用变形金刚(BERT)的微调双向编码器表征检测印尼新闻标题标题","authors":"Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo","doi":"10.25139/inform.v7i2.4686","DOIUrl":null,"url":null,"abstract":"The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.","PeriodicalId":52760,"journal":{"name":"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi","volume":"17 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)\",\"authors\":\"Diyah Utami Kusumaning Putri, Dinar Nugroho Pratomo\",\"doi\":\"10.25139/inform.v7i2.4686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.\",\"PeriodicalId\":52760,\"journal\":{\"name\":\"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25139/inform.v7i2.4686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25139/inform.v7i2.4686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

存在与内容不匹配的新闻文章的问题，被称为标题党，严重干扰了读者获得他们期望的信息。近年来，标题党新闻的数量持续显著增加。根据这个问题，需要一个标题党检测器来自动识别新闻文章标题，包括标题党和非标题党。此外，许多现有的解决方案使用手工制作的特征和传统的机器学习方法，这限制了泛化。因此，本研究对变形金刚(BERT)的双向编码器表示进行了微调，并使用印度尼西亚新闻标题数据集CLICK-ID来预测标题党(BERT)。在本研究中，我们使用IndoBERT作为预训练模型，这是一种最先进的基于bert的印尼语语言模型。然后，通过比较具有不同预训练模型的IndoBERT分类器与两种基于词向量的方法(即词袋和TF-IDF)和五种机器学习分类器(即NB, KNN, SVM, DT和RF)的性能来评估基于bert的分类器的有用性。评估结果表明，所有微调的IndoBERT分类器在分类标题党和非标题党印度尼西亚新闻标题方面优于所有基于词向量的机器学习分类器。使用两个训练阶段模型的IndoBERTBASE准确率最高，为0.8247，为0.064(6%)，优于使用词袋模型的SVM分类器的准确率0.7607。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Clickbait Detection of Indonesian News Headlines using Fine-Tune Bidirectional Encoder Representations from Transformers (BERT)

The problem of the existence of news article that does not match with content, called clickbait, has seriously interfered readers from getting the information they expect. The number of clickbait news continues significantly increased in recent years. According to this problem, a clickbait detector is required to automatically identify news article headlines that include clickbait and non-clickbait. Additionally, many currently existing solutions use handcrafted features and traditional machine learning methods, which limit the generalization. Therefore, this study fine-tunes the Bidirectional Encoder Representations from Transformers (BERT) and uses the Indonesian news headlines dataset CLICK-ID to predict clickbait (BERT). In this research, we use IndoBERT as the pre-trained model, a state-of-the-art BERT-based language model for Indonesian. Then, the usefulness of BERT-based classifiers is then assessed by comparing the performance of IndoBERT classifiers with different pre-trained models with that of two word-vectors-based approaches (i.e., bag-of-words and TF-IDF) and five machine learning classifiers (i.e., NB, KNN, SVM, DT, and RF). The evaluation results indicate that all fine-tuned IndoBERT classifiers outperform all word-vectors-based machine learning classifiers in classifying clickbait and non-clickbait Indonesian news headlines. The IndoBERTBASE using the two training phases model gets the highest accuracy of 0.8247, which is 0.064 (6%), outperforming the SVM classifier's accuracy with the bag-of-words model 0.7607.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Inform Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi

自引率

0.00%

发文量

审稿时长

10 weeks