Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

arXiv - CS - Computation and Language Pub Date : 2024-09-11 DOI:arxiv-2409.07355

SeongYeub Chu, JongWoo Kim, MunYong Yi

{"title":"Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation","authors":"SeongYeub Chu, JongWoo Kim, MunYong Yi","doi":"arxiv-2409.07355","DOIUrl":null,"url":null,"abstract":"This study introduces \\textbf{InteractEval}, a framework that integrates\nhuman expertise and Large Language Models (LLMs) using the Think-Aloud (TA)\nmethod to generate attributes for checklist-based text evaluation. By combining\nhuman flexibility and reasoning with LLM consistency, InteractEval outperforms\ntraditional non-LLM-based and LLM-based baselines across four distinct\ndimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The\nexperiment also investigates the effectiveness of the TA method, showing that\nit promotes divergent thinking in both humans and LLMs, leading to the\ngeneration of a wider range of relevant attributes and enhance text evaluation\nperformance. Comparative analysis reveals that humans excel at identifying\nattributes related to internal quality (Coherence and Fluency), but LLMs\nperform better at those attributes related to external alignment (Consistency\nand Relevance). Consequently, leveraging both humans and LLMs together produces\nthe best evaluation outcomes. In other words, this study emphasizes the\nnecessity of effectively combining humans and LLMs in an automated\nchecklist-based text evaluation framework. The code is available at\n\\textbf{\\url{https://github.com/BBeeChu/InteractEval.git}}.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{https://github.com/BBeeChu/InteractEval.git}}.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

共同思考，更好地工作：结合人类和 LLM 的大声思考结果，实现有效的文本评估

本研究介绍了\textbf{InteractEval}，这是一个利用 "大声思考"（TA）方法将人类专业知识和大型语言模型（LLM）整合在一起的框架，可为基于核对表的文本评估生成属性。通过将人类的灵活性和推理能力与 LLM 的一致性相结合，InteractEval 在一致性、流畅性、一致性和相关性这四个不同维度上的表现优于传统的非 LLM 和 LLM 基线。实验还研究了 TA 方法的有效性，结果表明它能促进人类和 LLM 的发散思维，从而产生更广泛的相关属性并提高文本评价性能。对比分析表明，人类擅长识别与内部质量相关的属性（连贯性和流畅性），但 LLM 在与外部一致性相关的属性（一致性和相关性）方面表现更好。因此，同时利用人类和 LLM 可以产生最佳的评估结果。换句话说，这项研究强调了在基于自动核对表的文本评价框架中有效结合人类和 LLM 的必要性。代码请访问：textbf{\url{https://github.com/BBeeChu/InteractEval.git}}。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量