Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

First Workshop on Insights from Negative Results in NLP Pub Date : 2020-10-09 DOI:10.18653/v1/2020.insights-1.13

William Huang, Haokun Liu, Samuel R. Bowman

{"title":"Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data","authors":"William Huang, Haokun Liu, Samuel R. Bowman","doi":"10.18653/v1/2020.insights-1.13","DOIUrl":null,"url":null,"abstract":"A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.insights-1.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

反事实增广的SNLI训练数据并不比未增广的数据产生更好的泛化

越来越多的研究表明，模型利用注释工件在标准众包基准(从众包工作者那里收集的数据集来创建评估任务)上实现了最先进的性能，但在同一任务的域外示例上仍然失败。最近的工作探索了使用反事实增强数据(通过最小限度地编辑一组种子示例来生成反事实标签)来增强与这些基准相关的训练数据，并构建更健壮的分类器，从而更好地进行泛化。然而，Khashabi等人(2020)发现，在控制数据集大小和收集成本时，这种类型的增强对阅读理解任务几乎没有好处。我们在此基础上使用英语自然语言推理数据来测试模型的泛化和鲁棒性，并发现在反事实增强的SNLI数据集上训练的模型并不比类似规模的未增强数据集泛化得更好，而且反事实增强会损害性能，产生的模型对挑战示例的鲁棒性较差。通过标准众包技术对自然语言理解数据进行反事实增强似乎不是收集训练数据的有效方法，需要进一步创新才能使这一工作可行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量

期刊最新文献

What GPT Knows About Who is Who Pathologies of Pre-trained Language Models in Few-shot Fine-tuning Can Question Rewriting Help Conversational Question Answering? Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains Do Data-based Curricula Work?