Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets

Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang
{"title":"Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets","authors":"Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang","doi":"10.1109/CogMI56440.2022.00027","DOIUrl":null,"url":null,"abstract":"Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.","PeriodicalId":211430,"journal":{"name":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CogMI56440.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于BERT的文本增强方法在小型不平衡数据集上的性能评价
最近,深度学习方法在理解和分析短信方面取得了巨大的成功。然而,在现实世界的应用中,由于数据收集和人工注释的高成本,标记的文本数据通常尺寸较小,并且在类中不平衡,从而限制了深度学习分类器的性能。因此,本研究探索了一个未被充分研究的领域——样本大小和不平衡比例如何影响深度学习模型和增强方法的性能——并提供了一个解决方案。具体来说,本研究考察了BERT、Word2Vec和WordNet增强方法在数据集规模为500、1000和2000、失衡比例为4:1和9:1的情况下的性能。实验结果表明,BERT增强提高了BERT检测少数类的性能,并且当数据规模较小(例如500个训练文档)且高度不平衡(例如9:1)时,改进效果最为显著(与基本模型相比F1提高15.6-40.4%,与过采样方法模型相比F1提高2.8%-10.4%)。当数据量增大或失衡比减小时,BERT增强所产生的改进变小或不显著。此外,与其他模型和方法相比,BERT增强和BERT微调实现了最佳性能,为小尺寸、高度不平衡的文本分类任务展示了一个有前途的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Artificial Reasoning in the Streetscape PSLotto: A Privacy-Enhanced COVID Lottery System A Methodology for Energy Usage Prediction in Long-Lasting Abnormal Events An approach to dealing with incremental concept drift in personalized learning systems Artificial Intelligence Meets Tactical Autonomy: Challenges and Perspectives
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1