An evaluation on large language model outputs: Discourse and memorization

Natural Language Processing Journal Pub Date : 2023-09-01 DOI:10.1016/j.nlp.2023.100024

Adrian de Wynter , Xun Wang , Alex Sokolov , Qilong Gu , Si-Qing Chen

引用次数: 7

Abstract

We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, $80.0 %$ of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型输出的评价：语篇与记忆

我们对九种最广泛可用的大型语言模型（LLM）产生的各种输出进行了实证评估。我们的分析是用现成的现成工具完成的。我们发现，当衡量输出病理时，记忆文本的百分比、唯一文本的百分比和整体输出质量之间存在相关性，如反事实和逻辑错误的陈述，以及不专注于主题等一般失败。总体而言，80.0%的评估输出包含记忆数据，但包含最多记忆内容的输出也更有可能被认为是高质量的。我们讨论并评估了缓解策略，表明在评估的模型中，记忆文本的输出率降低了。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量