{"title":"利用聚类和提示自动生成自然语言处理行为测试用例","authors":"Ying Li, Rahul Singh, Tarun Joshi, Agus Sudjianto","doi":"arxiv-2408.00161","DOIUrl":null,"url":null,"abstract":"Recent work in behavioral testing for natural language processing (NLP)\nmodels, such as Checklist, is inspired by related paradigms in software\nengineering testing. They allow evaluation of general linguistic capabilities\nand domain understanding, hence can help evaluate conceptual soundness and\nidentify model weaknesses. However, a major challenge is the creation of test\ncases. The current packages rely on semi-automated approach using manual\ndevelopment which requires domain expertise and can be time consuming. This\npaper introduces an automated approach to develop test cases by exploiting the\npower of large language models and statistical techniques. It clusters the text\nrepresentations to carefully construct meaningful groups and then apply\nprompting techniques to automatically generate Minimal Functionality Tests\n(MFT). The well-known Amazon Reviews corpus is used to demonstrate our\napproach. We analyze the behavioral test profiles across four different\nclassification algorithms and discuss the limitations and strengths of those\nmodels.","PeriodicalId":501168,"journal":{"name":"arXiv - CS - Emerging Technologies","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting\",\"authors\":\"Ying Li, Rahul Singh, Tarun Joshi, Agus Sudjianto\",\"doi\":\"arxiv-2408.00161\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent work in behavioral testing for natural language processing (NLP)\\nmodels, such as Checklist, is inspired by related paradigms in software\\nengineering testing. They allow evaluation of general linguistic capabilities\\nand domain understanding, hence can help evaluate conceptual soundness and\\nidentify model weaknesses. However, a major challenge is the creation of test\\ncases. The current packages rely on semi-automated approach using manual\\ndevelopment which requires domain expertise and can be time consuming. This\\npaper introduces an automated approach to develop test cases by exploiting the\\npower of large language models and statistical techniques. It clusters the text\\nrepresentations to carefully construct meaningful groups and then apply\\nprompting techniques to automatically generate Minimal Functionality Tests\\n(MFT). The well-known Amazon Reviews corpus is used to demonstrate our\\napproach. We analyze the behavioral test profiles across four different\\nclassification algorithms and discuss the limitations and strengths of those\\nmodels.\",\"PeriodicalId\":501168,\"journal\":{\"name\":\"arXiv - CS - Emerging Technologies\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Emerging Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00161\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting
Recent work in behavioral testing for natural language processing (NLP)
models, such as Checklist, is inspired by related paradigms in software
engineering testing. They allow evaluation of general linguistic capabilities
and domain understanding, hence can help evaluate conceptual soundness and
identify model weaknesses. However, a major challenge is the creation of test
cases. The current packages rely on semi-automated approach using manual
development which requires domain expertise and can be time consuming. This
paper introduces an automated approach to develop test cases by exploiting the
power of large language models and statistical techniques. It clusters the text
representations to carefully construct meaningful groups and then apply
prompting techniques to automatically generate Minimal Functionality Tests
(MFT). The well-known Amazon Reviews corpus is used to demonstrate our
approach. We analyze the behavioral test profiles across four different
classification algorithms and discuss the limitations and strengths of those
models.