Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

SeongYeub Chu, JongWoo Kim, MunYong Yi
{"title":"Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation","authors":"SeongYeub Chu, JongWoo Kim, MunYong Yi","doi":"arxiv-2409.07355","DOIUrl":null,"url":null,"abstract":"This study introduces \\textbf{InteractEval}, a framework that integrates\nhuman expertise and Large Language Models (LLMs) using the Think-Aloud (TA)\nmethod to generate attributes for checklist-based text evaluation. By combining\nhuman flexibility and reasoning with LLM consistency, InteractEval outperforms\ntraditional non-LLM-based and LLM-based baselines across four distinct\ndimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The\nexperiment also investigates the effectiveness of the TA method, showing that\nit promotes divergent thinking in both humans and LLMs, leading to the\ngeneration of a wider range of relevant attributes and enhance text evaluation\nperformance. Comparative analysis reveals that humans excel at identifying\nattributes related to internal quality (Coherence and Fluency), but LLMs\nperform better at those attributes related to external alignment (Consistency\nand Relevance). Consequently, leveraging both humans and LLMs together produces\nthe best evaluation outcomes. In other words, this study emphasizes the\nnecessity of effectively combining humans and LLMs in an automated\nchecklist-based text evaluation framework. The code is available at\n\\textbf{\\url{https://github.com/BBeeChu/InteractEval.git}}.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{https://github.com/BBeeChu/InteractEval.git}}.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
共同思考,更好地工作:结合人类和 LLM 的大声思考结果,实现有效的文本评估
本研究介绍了\textbf{InteractEval},这是一个利用 "大声思考"(TA)方法将人类专业知识和大型语言模型(LLM)整合在一起的框架,可为基于核对表的文本评估生成属性。通过将人类的灵活性和推理能力与 LLM 的一致性相结合,InteractEval 在一致性、流畅性、一致性和相关性这四个不同维度上的表现优于传统的非 LLM 和 LLM 基线。实验还研究了 TA 方法的有效性,结果表明它能促进人类和 LLM 的发散思维,从而产生更广泛的相关属性并提高文本评价性能。对比分析表明,人类擅长识别与内部质量相关的属性(连贯性和流畅性),但 LLM 在与外部一致性相关的属性(一致性和相关性)方面表现更好。因此,同时利用人类和 LLM 可以产生最佳的评估结果。换句话说,这项研究强调了在基于自动核对表的文本评价框架中有效结合人类和 LLM 的必要性。代码请访问:textbf{\url{https://github.com/BBeeChu/InteractEval.git}}。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1