利用聚类和提示自动生成自然语言处理行为测试用例

arXiv - CS - Emerging Technologies Pub Date : 2024-07-31 DOI:arxiv-2408.00161

Ying Li, Rahul Singh, Tarun Joshi, Agus Sudjianto

{"title":"利用聚类和提示自动生成自然语言处理行为测试用例","authors":"Ying Li, Rahul Singh, Tarun Joshi, Agus Sudjianto","doi":"arxiv-2408.00161","DOIUrl":null,"url":null,"abstract":"Recent work in behavioral testing for natural language processing (NLP)\nmodels, such as Checklist, is inspired by related paradigms in software\nengineering testing. They allow evaluation of general linguistic capabilities\nand domain understanding, hence can help evaluate conceptual soundness and\nidentify model weaknesses. However, a major challenge is the creation of test\ncases. The current packages rely on semi-automated approach using manual\ndevelopment which requires domain expertise and can be time consuming. This\npaper introduces an automated approach to develop test cases by exploiting the\npower of large language models and statistical techniques. It clusters the text\nrepresentations to carefully construct meaningful groups and then apply\nprompting techniques to automatically generate Minimal Functionality Tests\n(MFT). The well-known Amazon Reviews corpus is used to demonstrate our\napproach. We analyze the behavioral test profiles across four different\nclassification algorithms and discuss the limitations and strengths of those\nmodels.","PeriodicalId":501168,"journal":{"name":"arXiv - CS - Emerging Technologies","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting\",\"authors\":\"Ying Li, Rahul Singh, Tarun Joshi, Agus Sudjianto\",\"doi\":\"arxiv-2408.00161\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent work in behavioral testing for natural language processing (NLP)\\nmodels, such as Checklist, is inspired by related paradigms in software\\nengineering testing. They allow evaluation of general linguistic capabilities\\nand domain understanding, hence can help evaluate conceptual soundness and\\nidentify model weaknesses. However, a major challenge is the creation of test\\ncases. The current packages rely on semi-automated approach using manual\\ndevelopment which requires domain expertise and can be time consuming. This\\npaper introduces an automated approach to develop test cases by exploiting the\\npower of large language models and statistical techniques. It clusters the text\\nrepresentations to carefully construct meaningful groups and then apply\\nprompting techniques to automatically generate Minimal Functionality Tests\\n(MFT). The well-known Amazon Reviews corpus is used to demonstrate our\\napproach. We analyze the behavioral test profiles across four different\\nclassification algorithms and discuss the limitations and strengths of those\\nmodels.\",\"PeriodicalId\":501168,\"journal\":{\"name\":\"arXiv - CS - Emerging Technologies\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Emerging Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00161\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自然语言处理（NLP）模型的行为测试（如核对表）方面的最新研究受到了软件工程测试相关范例的启发。它们允许对一般语言能力和领域理解进行评估，因此有助于评估概念的合理性和识别模型的弱点。然而，创建测试用例是一项重大挑战。当前的软件包依赖于使用人工开发的半自动化方法，这需要领域专业知识，而且可能很耗时。本文介绍了一种利用大型语言模型和统计技术开发测试用例的自动化方法。它对文本表述进行聚类，仔细构建有意义的组，然后应用提示技术自动生成最小功能测试（MFT）。著名的亚马逊评论语料库被用来演示我们的方法。我们分析了四种不同分类算法的行为测试概况，并讨论了这些模型的局限性和优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting

Recent work in behavioral testing for natural language processing (NLP) models, such as Checklist, is inspired by related paradigms in software engineering testing. They allow evaluation of general linguistic capabilities and domain understanding, hence can help evaluate conceptual soundness and identify model weaknesses. However, a major challenge is the creation of test cases. The current packages rely on semi-automated approach using manual development which requires domain expertise and can be time consuming. This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques. It clusters the text representations to carefully construct meaningful groups and then apply prompting techniques to automatically generate Minimal Functionality Tests (MFT). The well-known Amazon Reviews corpus is used to demonstrate our approach. We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Emerging Technologies

自引率

0.00%

发文量