The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines

Algorithms Pub Date : 2024-03-22 DOI:10.3390/a17040132

Torrey Wagner, Dennis Guhl, Brent Langhals

{"title":"The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines","authors":"Torrey Wagner, Dennis Guhl, Brent Langhals","doi":"10.3390/a17040132","DOIUrl":null,"url":null,"abstract":"Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.","PeriodicalId":502609,"journal":{"name":"Algorithms","volume":" 34","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/a17040132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据准备和模型复杂性对中文新闻标题自然语言分类的影响

随着中国在 21 世纪崛起为政治和经济大国，人们对分析中文新闻文章以更好地了解中国发展趋势的兴趣与日俱增。由于资料数量庞大，按标题文本或标题对中文新闻文章进行自动分类是一种有效的方法，可将文章分门别类，以便进行高效审查。通过自然语言处理预测主题类别，对头条新闻网站上标有 15 个类别的 383,000 条标题数据集进行了评估。研究了六种数据准备方式对四种算法预测准确性的影响。最简单的模型（Naïve Bayes）在保留数据集上达到了 85.1% 的准确率，而最复杂的模型（使用 BERT 的神经网络）则达到了 89.3% 的准确率。我们确定了最有用的数据准备步骤，另一个目标是研究自动分类过程的基本复杂性和计算成本。结果发现，BERT 模型需要多 170 倍的时间来训练，预测速度慢 18,600 倍，需要多 27 倍的磁盘空间来保存，这表明它可能是需要最高准确性的低容量应用的最佳选择。不过，对于可以忍受轻微性能下降的大规模操作，奈维贝叶斯算法可能是最佳选择。在头条数据集中，每四条记录中就有近一条是重复的，而这是首次发布的去除重复记录的分析结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Algorithms

自引率

0.00%

发文量