InCA:利用大型语言模型重新思考车载会话系统评估

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07469

Friedl, Ken E., Khan, Abbas Goher, Sahoo, Soumya Ranjan, Rony, Md Rashad Al Hasan, Germies, Jana, Süß, Christian

{"title":"InCA:利用大型语言模型重新思考车载会话系统评估","authors":"Friedl, Ken E., Khan, Abbas Goher, Sahoo, Soumya Ranjan, Rony, Md Rashad Al Hasan, Germies, Jana, Süß, Christian","doi":"10.48550/arxiv.2311.07469","DOIUrl":null,"url":null,"abstract":"The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"109 20","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"InCA: Rethinking In-Car Conversational System Assessment Leveraging\\n Large Language Models\",\"authors\":\"Friedl, Ken E., Khan, Abbas Goher, Sahoo, Soumya Ranjan, Rony, Md Rashad Al Hasan, Germies, Jana, Süß, Christian\",\"doi\":\"10.48550/arxiv.2311.07469\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.\",\"PeriodicalId\":496270,\"journal\":{\"name\":\"arXiv (Cornell University)\",\"volume\":\"109 20\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv (Cornell University)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arxiv.2311.07469\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07469","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

高级生成大型语言模型(llm)的评估提出了一个重大挑战，因为它们在最近的发展中具有高度的复杂性。此外，根据关键绩效指标(kpi)来评估各行业中基于法学硕士的应用程序的性能是一项复杂的工作。这项任务需要对行业用例和预期的系统行为有深刻的理解。在汽车行业的背景下，现有的评估指标被证明不足以评估车载会话问答(ConvQA)系统。这些系统的独特需求，其答案可能与驾驶员或汽车安全有关，并且仅限于汽车领域，突出了当前指标的局限性。为了应对这些挑战，本文介绍了一组专门用于评估车载ConvQA系统性能的kpi，以及为这些kpi专门设计的数据集。初步和全面的实证评估证实了我们提出的方法的有效性。此外，我们研究了在提示中使用不同角色的影响，发现它增强了模型在评估中模拟不同观点的能力，反映了不同背景的个体如何看待主题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models

The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量

期刊最新文献

Low-Rank Approximation by Randomly Pivoted LU CCD Photometry of the Globular Cluster NGC 5897 The Distribution of Sandpile Groups of Random Graphs with their Pairings CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings Full-dry Flipping Transfer Method for van der Waals Heterostructure