自然语言处理中数据增强对下游性能的影响

First Workshop on Insights from Negative Results in NLP Pub Date : 1900-01-01 DOI:10.18653/v1/2022.insights-1.12

Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo

{"title":"自然语言处理中数据增强对下游性能的影响","authors":"Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo","doi":"10.18653/v1/2022.insights-1.12","DOIUrl":null,"url":null,"abstract":"With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing\",\"authors\":\"Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo\",\"doi\":\"10.18653/v1/2022.insights-1.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.\",\"PeriodicalId\":441528,\"journal\":{\"name\":\"First Workshop on Insights from Negative Results in NLP\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"First Workshop on Insights from Negative Results in NLP\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.insights-1.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.insights-1.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

在机器学习的广泛范围内，数据增强是提高机器学习模型泛化和鲁棒性的常用策略。虽然数据增强在计算机视觉中得到了广泛的应用，但它在自然语言处理中的应用却相当有限。这是因为在NLP内部，提出的数据增强方法对性能的影响尚未得到统一的评估，有效的数据增强方法尚不清楚。在本文中，我们希望通过在调整预训练语言模型时评估12种数据增强方法对多个数据集的影响来解决这个问题。我们发现，当数据大小限制在几千个以内时，性能的改善微乎其微，而当数据大小增加时，性能会下降。我们还使用各种方法来量化数据增强的强度，并发现这些值虽然与下游性能弱相关，但根据任务负相关或正相关。此外，我们发现明显缺乏一致的性能数据增强。这些都暗示了NLP任务中数据增强的困难，我们倾向于认为静态数据增强在给定这些属性的情况下并不广泛适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing

With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量

期刊最新文献

What GPT Knows About Who is Who Pathologies of Pre-trained Language Models in Few-shot Fine-tuning Can Question Rewriting Help Conversational Question Answering? Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains Do Data-based Curricula Work?