Exploring the ability of LLMs to classify written proficiency levels

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-10-29 DOI:10.1016/j.csl.2024.101745

Susanne DeVore

{"title":"Exploring the ability of LLMs to classify written proficiency levels","authors":"Susanne DeVore","doi":"10.1016/j.csl.2024.101745","DOIUrl":null,"url":null,"abstract":"<div><div>This paper tests the ability of LLMs to classify language proficiency ratings of texts written by learners of English and Mandarin, taking a benchmarking research design approach. First, the impact of five variables (LLM model, prompt version, prompt language, grading scale, and temperature) on rating accuracy are tested using a basic instruction-only prompt. Second, the consistency of results is tested. Third, the top performing consistent conditions emerging from the first and second tests are used to test the impact of adding examples and/or proficiency guidelines and the use of zero-, one-, and few-shot chain-of-thought prompting techniques on accuracy rating. While performance does not meet levels necessary for real-world use cases, the results can inform ongoing development of LLMs and prompting techniques to improve accuracy. This paper highlights recent research on prompt engineering outside of the field of linguistics and selects prompt variables and techniques that are theoretically relevant to proficiency rating. Finally, it discusses key takeaways from these tests that can inform future development and why approaches that have been effective in other contexts were not as effective for proficiency rating.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101745"},"PeriodicalIF":3.1000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001281","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper tests the ability of LLMs to classify language proficiency ratings of texts written by learners of English and Mandarin, taking a benchmarking research design approach. First, the impact of five variables (LLM model, prompt version, prompt language, grading scale, and temperature) on rating accuracy are tested using a basic instruction-only prompt. Second, the consistency of results is tested. Third, the top performing consistent conditions emerging from the first and second tests are used to test the impact of adding examples and/or proficiency guidelines and the use of zero-, one-, and few-shot chain-of-thought prompting techniques on accuracy rating. While performance does not meet levels necessary for real-world use cases, the results can inform ongoing development of LLMs and prompting techniques to improve accuracy. This paper highlights recent research on prompt engineering outside of the field of linguistics and selects prompt variables and techniques that are theoretically relevant to proficiency rating. Finally, it discusses key takeaways from these tests that can inform future development and why approaches that have been effective in other contexts were not as effective for proficiency rating.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索法律硕士划分书面能力水平的能力

本文采用基准研究设计方法，测试了 LLM 对英语和普通话学习者所写文章的语言水平评分进行分类的能力。首先，使用纯基础教学提示语测试了五个变量（LLM 模型、提示语版本、提示语、评分标准和温度）对评分准确性的影响。其次，测试结果的一致性。第三，利用第一次和第二次测试中表现最好的一致条件，测试添加示例和/或能力指南以及使用零、一和少量思维链提示技术对准确性评级的影响。虽然测试结果没有达到实际应用所需的水平，但可以为 LLM 和提示技术的持续开发提供参考，从而提高准确率。本文重点介绍了语言学领域之外有关提示工程的最新研究，并选择了理论上与能力评级相关的提示变量和技术。最后，本文讨论了从这些测试中获得的关键启示，这些启示可以为未来的开发提供参考，以及为什么在其他情况下有效的方法在熟练程度评级中却不那么有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.