用于测试NLI的各种逻辑推理能力的可扩展框架

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2023-11-04 DOI:10.1007/s10579-023-09691-y

Ishan Tarunesh, Somak Aditya, Monojit Choudhury

{"title":"用于测试NLI的各种逻辑推理能力的可扩展框架","authors":"Ishan Tarunesh, Somak Aditya, Monojit Choudhury","doi":"10.1007/s10579-023-09691-y","DOIUrl":null,"url":null,"abstract":"Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 8","pages":"0"},"PeriodicalIF":1.7000,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI\",\"authors\":\"Ishan Tarunesh, Somak Aditya, Monojit Choudhury\",\"doi\":\"10.1007/s10579-023-09691-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.\",\"PeriodicalId\":49927,\"journal\":{\"name\":\"Language Resources and Evaluation\",\"volume\":\"11 8\",\"pages\":\"0\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2023-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Language Resources and Evaluation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10579-023-09691-y\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10579-023-09691-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 4

摘要

自然语言推理(NLI)被认为是测试自然语言理解能力的代表性任务。在这项工作中，我们提出了一个可扩展的框架，以集体但分类地测试NLI(以及通过扩展，NLU)所需的各种逻辑推理能力。在行为测试的激励下，我们创建了一个半合成的大型测试台(363个模板，363k个示例)和一个相关框架，提供以下实用工具:(1)在17个推理维度(包括实用推理)上单独测试和分析推理能力;(2)设计实验，研究跨能力的信息内容(留一项或加一项);(3)综合性质使我们能够控制人为因素和偏见。我们从自由形式的自然语言模板(CheckList)和定义良好的功能分类中扩展了一个公开可用的自动化测试用例实例化框架，以覆盖范围越来越广的越来越难的测试用例，同时改变自然语言的复杂性。通过对最先进的NLI系统的分析，我们观察到我们的基准测试确实很难(即使使用额外资源进行训练也很重要)。有些功能比较难。此外，细粒度分析和微调实验揭示了关于这些能力和模型的更多见解——支持和扩展了以前的观察结果;由此可见所提出的试验台的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI

Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.