Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

William Huang, Haokun Liu, Samuel R. Bowman
{"title":"Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data","authors":"William Huang, Haokun Liu, Samuel R. Bowman","doi":"10.18653/v1/2020.insights-1.13","DOIUrl":null,"url":null,"abstract":"A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.insights-1.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
反事实增广的SNLI训练数据并不比未增广的数据产生更好的泛化
越来越多的研究表明,模型利用注释工件在标准众包基准(从众包工作者那里收集的数据集来创建评估任务)上实现了最先进的性能,但在同一任务的域外示例上仍然失败。最近的工作探索了使用反事实增强数据(通过最小限度地编辑一组种子示例来生成反事实标签)来增强与这些基准相关的训练数据,并构建更健壮的分类器,从而更好地进行泛化。然而,Khashabi等人(2020)发现,在控制数据集大小和收集成本时,这种类型的增强对阅读理解任务几乎没有好处。我们在此基础上使用英语自然语言推理数据来测试模型的泛化和鲁棒性,并发现在反事实增强的SNLI数据集上训练的模型并不比类似规模的未增强数据集泛化得更好,而且反事实增强会损害性能,产生的模型对挑战示例的鲁棒性较差。通过标准众包技术对自然语言理解数据进行反事实增强似乎不是收集训练数据的有效方法,需要进一步创新才能使这一工作可行。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
What GPT Knows About Who is Who Pathologies of Pre-trained Language Models in Few-shot Fine-tuning Can Question Rewriting Help Conversational Question Answering? Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains Do Data-based Curricula Work?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1