一种基于重采样的NLI模型评估方法

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2023-06-09 DOI:10.1017/s1351324923000268

Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota

{"title":"一种基于重采样的NLI模型评估方法","authors":"Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota","doi":"10.1017/s1351324923000268","DOIUrl":null,"url":null,"abstract":"\n The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A resampling-based method to evaluate NLI models\",\"authors\":\"Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota\",\"doi\":\"10.1017/s1351324923000268\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.\",\"PeriodicalId\":49143,\"journal\":{\"name\":\"Natural Language Engineering\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1017/s1351324923000268\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000268","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

深度学习技术的最新进展已经产生了能够在传统的自然语言推理(NLI)数据集上获得高分的模型。为了理解这些强大模型的泛化限制，出现了越来越多的对抗性评估方案。这些工作使用了类似的评估方法:他们基于具有已知逻辑和语义属性的句子(对抗集)构建一个新的NLI测试集，在基准NLI数据集上训练模型，并在新集中对其进行评估。在对抗集上表现不佳被认为是模型的局限性。这个评估过程的问题是，它可能只表明一个抽样问题。机器学习模型在新的测试集中可能表现不佳，因为在对抗集中呈现的文本模式在训练样本中没有很好地表示。为了解决这一问题，我们提出了一种新的评价方法——等价不变性检验(IE检验)。IE测试用足够的对抗性示例训练模型，并在两个等效数据集上检查模型的性能。作为一个案例研究，我们将IE测试应用于最先进的NLI模型，使用同义词替换作为对抗示例的形式。实验表明，尽管这些模型具有很高的预测能力，但对于相同的输入，通常会产生不同的推理输出，更重要的是，这一缺陷无法通过在训练数据中添加对抗性观察值来解决。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A resampling-based method to evaluate NLI models

The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.