Prompting large language models for user simulation in task-oriented dialogue systems

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-07-26 DOI:10.1016/j.csl.2024.101697

Atheer Algherairy , Moataz Ahmed

{"title":"Prompting large language models for user simulation in task-oriented dialogue systems","authors":"Atheer Algherairy , Moataz Ahmed","doi":"10.1016/j.csl.2024.101697","DOIUrl":null,"url":null,"abstract":"<div><p>Large Language Models (LLMs) have gained widespread popularity due to their instruction-following abilities. In this study, we evaluate their ability in simulating user interactions for task-oriented dialogue (TOD) systems. Our findings demonstrate that prompting LLMs reveals their promising capabilities for training and testing dialogue policies, eliminating the need for domain expertise in crafting complex rules or relying on large annotated data, as required by traditional simulators. The results show that the dialogue system trained with the ChatGPT simulator achieves a success rate of 59%, comparable to a 62% success rate of the dialogue system trained with the manual-rules, agenda-based user simulator (ABUS). Furthermore, the dialogue system trained with the ChatGPT simulator demonstrates better generalization ability compared to the dialogue system trained with the ABUS. Its success rate outperforms that of the dialogue system trained with the ABUS by 4% on GenTUS, 5% on the ChatGPT Simulator, and 3% on the Llama simulator. Nevertheless, LLM-based user simulators provide challenging environment, lexically rich, diverse, and random responses. Llama simulator outperforms the human reference in all lexical diversity metrics with a margin of 0.66 in SE, 0.39 in CE, 0.01 in MSTTR, 0.04 in HDD, and 0.55 in MTLD, while the ChatGPT simulator achieves comparable results. This ultimately contributes to enhancing the system’s ability to generalize more effectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101697"},"PeriodicalIF":3.4000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000809/pdfft?md5=81b644a0e6ced84bc9ba93092c2f49b3&pid=1-s2.0-S0885230824000809-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000809","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have gained widespread popularity due to their instruction-following abilities. In this study, we evaluate their ability in simulating user interactions for task-oriented dialogue (TOD) systems. Our findings demonstrate that prompting LLMs reveals their promising capabilities for training and testing dialogue policies, eliminating the need for domain expertise in crafting complex rules or relying on large annotated data, as required by traditional simulators. The results show that the dialogue system trained with the ChatGPT simulator achieves a success rate of 59%, comparable to a 62% success rate of the dialogue system trained with the manual-rules, agenda-based user simulator (ABUS). Furthermore, the dialogue system trained with the ChatGPT simulator demonstrates better generalization ability compared to the dialogue system trained with the ABUS. Its success rate outperforms that of the dialogue system trained with the ABUS by 4% on GenTUS, 5% on the ChatGPT Simulator, and 3% on the Llama simulator. Nevertheless, LLM-based user simulators provide challenging environment, lexically rich, diverse, and random responses. Llama simulator outperforms the human reference in all lexical diversity metrics with a margin of 0.66 in SE, 0.39 in CE, 0.01 in MSTTR, 0.04 in HDD, and 0.55 in MTLD, while the ChatGPT simulator achieves comparable results. This ultimately contributes to enhancing the system’s ability to generalize more effectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

提示大型语言模型，用于面向任务的对话系统中的用户模拟

大语言模型（LLMs）因其遵循指令的能力而广受欢迎。在本研究中，我们评估了它们在模拟面向任务的对话（TOD）系统的用户交互方面的能力。我们的研究结果表明，提示 LLMs 在训练和测试对话策略方面显示出了很好的能力，无需像传统模拟器那样需要专业领域的知识来制定复杂的规则或依赖大量的注释数据。结果表明，使用 ChatGPT 模拟器训练的对话系统成功率为 59%，与使用人工规则、基于议程的用户模拟器（ABUS）训练的对话系统 62% 的成功率相当。此外，与使用 ABUS 训练的对话系统相比，使用 ChatGPT 模拟器训练的对话系统具有更好的泛化能力。在 GenTUS 上，它的成功率比用 ABUS 训练的对话系统高出 4%，在 ChatGPT 模拟器上高出 5%，在 Llama 模拟器上高出 3%。不过，基于 LLM 的用户模拟器提供了具有挑战性的环境、丰富的词汇、多样的随机回复。在所有词汇多样性指标上，Llama 模拟器都优于人类参考，SE 为 0.66，CE 为 0.39，MSTTR 为 0.01，HDD 为 0.04，MTLD 为 0.55，而 ChatGPT 模拟器的结果与之相当。这最终有助于增强系统更有效的泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.