The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines

Algorithms Pub Date : 2024-03-22 DOI:10.3390/a17040132
Torrey Wagner, Dennis Guhl, Brent Langhals
{"title":"The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines","authors":"Torrey Wagner, Dennis Guhl, Brent Langhals","doi":"10.3390/a17040132","DOIUrl":null,"url":null,"abstract":"Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.","PeriodicalId":502609,"journal":{"name":"Algorithms","volume":" 34","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/a17040132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据准备和模型复杂性对中文新闻标题自然语言分类的影响
随着中国在 21 世纪崛起为政治和经济大国,人们对分析中文新闻文章以更好地了解中国发展趋势的兴趣与日俱增。由于资料数量庞大,按标题文本或标题对中文新闻文章进行自动分类是一种有效的方法,可将文章分门别类,以便进行高效审查。通过自然语言处理预测主题类别,对头条新闻网站上标有 15 个类别的 383,000 条标题数据集进行了评估。研究了六种数据准备方式对四种算法预测准确性的影响。最简单的模型(Naïve Bayes)在保留数据集上达到了 85.1% 的准确率,而最复杂的模型(使用 BERT 的神经网络)则达到了 89.3% 的准确率。我们确定了最有用的数据准备步骤,另一个目标是研究自动分类过程的基本复杂性和计算成本。结果发现,BERT 模型需要多 170 倍的时间来训练,预测速度慢 18,600 倍,需要多 27 倍的磁盘空间来保存,这表明它可能是需要最高准确性的低容量应用的最佳选择。不过,对于可以忍受轻微性能下降的大规模操作,奈维贝叶斯算法可能是最佳选择。在头条数据集中,每四条记录中就有近一条是重复的,而这是首次发布的去除重复记录的分析结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Label-Setting Algorithm for Multi-Destination K Simple Shortest Paths Problem and Application A Quantum Approach for Exploring the Numerical Results of the Heat Equation Enhancing Indoor Positioning Accuracy with WLAN and WSN: A QPSO Hybrid Algorithm with Surface Tessellation Trajectory Classification and Recognition of Planar Mechanisms Based on ResNet18 Network Computational Test for Conditional Independence
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1