arXiv - CS - Computation and Language最新文献_第9页

Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs 从宣传到仇恨：利用多代理 LLM 对阿拉伯语备忘录进行多模态分析

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07246

Firoj Alam, Md. Rafiul Biswas, Uzair Shah, Wajdi Zaghouani, Georgios Mikros

In the past decade, social media platforms have been used for informationdissemination and consumption. While a major portion of the content is postedto promote citizen journalism and public awareness, some content is posted tomislead users. Among different content types such as text, images, and videos,memes (text overlaid on images) are particularly prevalent and can serve aspowerful vehicles for propaganda, hate, and humor. In the current literature,there have been efforts to individually detect such content in memes. However,the study of their intersection is very limited. In this study, we explore theintersection between propaganda and hate in memes using a multi-agent LLM-basedapproach. We extend the propagandistic meme dataset with coarse andfine-grained hate labels. Our finding suggests that there is an associationbetween propaganda and hate in memes. We provide detailed experimental resultsthat can serve as a baseline for future studies. We will make the experimentalresources publicly available to the community.

在过去十年中，社交媒体平台被用于信息传播和消费。虽然大部分内容是为了促进公民新闻和提高公众意识而发布的，但也有一些内容是为了引导用户而发布的。在文字、图片和视频等不同内容类型中，memes（文字叠加在图片上）尤为盛行，可作为宣传、仇恨和幽默的有力载体。在目前的文献中，已经有人在努力单独检测memes 中的此类内容。然而，对它们之间交叉关系的研究却非常有限。在本研究中，我们使用基于多代理 LLM 的方法来探索记忆体中宣传与仇恨的交集。我们用粗粒度和细粒度的仇恨标签扩展了宣传性备忘录数据集。我们的发现表明，记忆体中的宣传和仇恨之间存在关联。我们提供了详细的实验结果，可作为未来研究的基线。我们将向社会公开实验资源。

引用次数: 0

Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem 交叉定义：通过串联学习改进自然语言解释生成

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07123

Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt

Natural language explanations (NLEs) are vital for elucidating the reasoningbehind large language model (LLM) decisions. Many techniques have beendeveloped to generate NLEs using LLMs. However, like humans, LLMs might notalways produce optimal NLEs on first attempt. Inspired by human learningprocesses, we introduce Cross-Refine, which employs role modeling by deployingtwo LLMs as generator and critic, respectively. The generator outputs a firstNLE and then refines this initial explanation using feedback and suggestionsprovided by the critic. Cross-Refine does not require any supervised trainingdata or additional training. We validate Cross-Refine across three NLP tasksusing three state-of-the-art open-source LLMs through automatic and humanevaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, whichonly utilizes self-feedback to refine the explanations. Our findings fromautomatic evaluation and a user study indicate that Cross-Refine outperformsSelf-Refine. Meanwhile, Cross-Refine can perform effectively with less powerfulLLMs, whereas Self-Refine only yields strong results with ChatGPT.Additionally, we conduct an ablation study to assess the importance of feedbackand suggestions. Both of them play an important role in refining explanations.We further evaluate Cross-Refine on a bilingual dataset in English and German.

自然语言解释（NLE）对于阐明大型语言模型（LLM）决策背后的推理至关重要。目前已经开发了许多技术来使用 LLM 生成 NLE。然而，与人类一样，LLM 也不一定能在第一次尝试时生成最佳的 NLE。受人类学习过程的启发，我们引入了 Cross-Refine，它通过部署两个 LLM 分别作为生成器和批判器来进行角色建模。生成器输出第一个 NLE，然后利用批评者提供的反馈和建议完善这个初始解释。Cross-Refine 不需要任何监督训练数据或额外的训练。通过自动和人工评估，我们使用三种最先进的开源 LLM 在三个 NLP 任务中验证了 Cross-Refine。我们选择Self-Refine（Madaan等人，2023年）作为基线，它只利用自我反馈来完善解释。我们的自动评估和用户研究结果表明，Cross-Refine 优于 Self-Refine。同时，Cross-Refine 可以有效地使用功能较弱的LLM，而 Self-Refine 只有在使用 ChatGPT 时才能产生强大的效果。我们还在英语和德语的双语数据集上对 Cross-Refine 进行了进一步评估。

{"title":"Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem","authors":"Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt","doi":"arxiv-2409.07123","DOIUrl":"https://doi.org/arxiv-2409.07123","url":null,"abstract":"Natural language explanations (NLEs) are vital for elucidating the reasoning\u0000behind large language model (LLM) decisions. Many techniques have been\u0000developed to generate NLEs using LLMs. However, like humans, LLMs might not\u0000always produce optimal NLEs on first attempt. Inspired by human learning\u0000processes, we introduce Cross-Refine, which employs role modeling by deploying\u0000two LLMs as generator and critic, respectively. The generator outputs a first\u0000NLE and then refines this initial explanation using feedback and suggestions\u0000provided by the critic. Cross-Refine does not require any supervised training\u0000data or additional training. We validate Cross-Refine across three NLP tasks\u0000using three state-of-the-art open-source LLMs through automatic and human\u0000evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which\u0000only utilizes self-feedback to refine the explanations. Our findings from\u0000automatic evaluation and a user study indicate that Cross-Refine outperforms\u0000Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful\u0000LLMs, whereas Self-Refine only yields strong results with ChatGPT.\u0000Additionally, we conduct an ablation study to assess the importance of feedback\u0000and suggestions. Both of them play an important role in refining explanations.\u0000We further evaluate Cross-Refine on a bilingual dataset in English and German.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding Knowledge Drift in LLMs through Misinformation 通过错误信息了解法律硕士的知识漂移

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07085

Alina Fastowski, Gjergji Kasneci

Large Language Models (LLMs) have revolutionized numerous applications,making them an integral part of our digital ecosystem. However, theirreliability becomes critical, especially when these models are exposed tomisinformation. We primarily analyze the susceptibility of state-of-the-artLLMs to factual inaccuracies when they encounter false information in a QnAscenario, an issue that can lead to a phenomenon we refer to as *knowledgedrift*, which significantly undermines the trustworthiness of these models. Weevaluate the factuality and the uncertainty of the models' responses relying onEntropy, Perplexity, and Token Probability metrics. Our experiments reveal thatan LLM's uncertainty can increase up to 56.6% when the question is answeredincorrectly due to the exposure to false information. At the same time,repeated exposure to the same false information can decrease the modelsuncertainty again (-52.8% w.r.t. the answers on the untainted prompts),potentially manipulating the underlying model's beliefs and introducing a driftfrom its original knowledge. These findings provide insights into LLMs'robustness and vulnerability to adversarial inputs, paving the way fordeveloping more reliable LLM applications across various domains. The code isavailable at https://github.com/afastowski/knowledge_drift.

大型语言模型（LLM）给众多应用带来了变革，使其成为我们数字生态系统不可或缺的一部分。然而，它们的可靠性变得至关重要，尤其是当这些模型暴露在错误信息中时。我们主要分析了最先进的LLM在遇到QnAscenario中的虚假信息时对事实不准确性的易感性，这个问题可能会导致我们称之为 "已知漂移"（*knowledgedrift*）的现象，从而严重破坏这些模型的可信度。我们通过熵（Entropy）、复杂度（Perplexity）和令牌概率（Token Probability）指标来评估模型响应的事实性和不确定性。我们的实验表明，当暴露于虚假信息而导致问题回答错误时，LLM 的不确定性会增加高达 56.6%。与此同时，重复暴露于相同的虚假信息会再次降低模型的不确定性（与未受污染的提示答案相比为-52.8%），这可能会操纵底层模型的信念，使其偏离原有知识。这些发现深入揭示了 LLM 的鲁棒性和易受对抗性输入影响的脆弱性，为在各个领域开发更可靠的 LLM 应用铺平了道路。代码见 https://github.com/afastowski/knowledge_drift。

{"title":"Understanding Knowledge Drift in LLMs through Misinformation","authors":"Alina Fastowski, Gjergji Kasneci","doi":"arxiv-2409.07085","DOIUrl":"https://doi.org/arxiv-2409.07085","url":null,"abstract":"Large Language Models (LLMs) have revolutionized numerous applications,\u0000making them an integral part of our digital ecosystem. However, their\u0000reliability becomes critical, especially when these models are exposed to\u0000misinformation. We primarily analyze the susceptibility of state-of-the-art\u0000LLMs to factual inaccuracies when they encounter false information in a QnA\u0000scenario, an issue that can lead to a phenomenon we refer to as *knowledge\u0000drift*, which significantly undermines the trustworthiness of these models. We\u0000evaluate the factuality and the uncertainty of the models' responses relying on\u0000Entropy, Perplexity, and Token Probability metrics. Our experiments reveal that\u0000an LLM's uncertainty can increase up to 56.6% when the question is answered\u0000incorrectly due to the exposure to false information. At the same time,\u0000repeated exposure to the same false information can decrease the models\u0000uncertainty again (-52.8% w.r.t. the answers on the untainted prompts),\u0000potentially manipulating the underlying model's beliefs and introducing a drift\u0000from its original knowledge. These findings provide insights into LLMs'\u0000robustness and vulnerability to adversarial inputs, paving the way for\u0000developing more reliable LLM applications across various domains. The code is\u0000available at https://github.com/afastowski/knowledge_drift.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge AdaCAD：自适应解码以平衡上下文知识和参数知识之间的冲突

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07394

Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Knowledge conflict arises from discrepancies between information in thecontext of a large language model (LLM) and the knowledge stored in itsparameters. This can hurt performance when using standard decoding techniques,which tend to ignore the context. Existing test-time contrastive methods seekto address this by comparing the LLM's output distribution with and without thecontext and adjust the model according to the contrast between them. However,we find that these methods frequently misjudge the degree of conflict andstruggle to handle instances that vary in their amount of conflict, with staticmethods over-adjusting when conflict is absent. We propose a fine-grained,instance-level approach called AdaCAD, which dynamically infers the weight ofadjustment based on the degree of conflict, as measured by the Jensen-Shannondivergence between distributions representing contextual and parametricknowledge. Our experiments across four models on six diverse question-answering(QA) datasets and three summarization tasks demonstrate that our training-freeadaptive method consistently outperforms other decoding methods on QA, withaverage accuracy gains of 14.21% (absolute) over a static contrastive baseline,and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, ouranalysis shows that while decoding with contrastive baselines hurts performancewhen conflict is absent, AdaCAD mitigates these losses, making it moreapplicable to real-world datasets in which some examples have conflict andothers do not.

知识冲突源于大型语言模型（LLM）上下文中的信息与其参数中存储的知识之间的差异。这可能会影响使用标准解码技术时的性能，因为标准解码技术往往会忽略上下文。现有的测试时间对比方法试图通过比较有语境和无语境的 LLM 输出分布来解决这一问题，并根据两者之间的对比来调整模型。然而，我们发现这些方法经常误判冲突程度，难以处理冲突程度不同的实例，而静态方法会在没有冲突时过度调整。我们提出了一种名为 AdaCAD 的细粒度实例级方法，它可以根据冲突程度动态推断调整权重，冲突程度由代表上下文知识和参数知识的分布之间的詹森-香农发散度来衡量。我们在六个不同的问题解答（QA）数据集和三个摘要任务上对四个模型进行的实验表明，我们的免训练自适应方法在 QA 上的表现始终优于其他解码方法，与静态对比基线相比，平均准确率提高了 14.21%（绝对值），摘要的事实性提高了 5.59（AlignScore）。此外，我们的分析表明，虽然使用对比基线解码会在没有冲突时损害性能，但 AdaCAD 能减轻这些损失，使其更适用于现实世界中一些例子有冲突而另一些例子没有冲突的数据集。

{"title":"AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge","authors":"Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.07394","DOIUrl":"https://doi.org/arxiv-2409.07394","url":null,"abstract":"Knowledge conflict arises from discrepancies between information in the\u0000context of a large language model (LLM) and the knowledge stored in its\u0000parameters. This can hurt performance when using standard decoding techniques,\u0000which tend to ignore the context. Existing test-time contrastive methods seek\u0000to address this by comparing the LLM's output distribution with and without the\u0000context and adjust the model according to the contrast between them. However,\u0000we find that these methods frequently misjudge the degree of conflict and\u0000struggle to handle instances that vary in their amount of conflict, with static\u0000methods over-adjusting when conflict is absent. We propose a fine-grained,\u0000instance-level approach called AdaCAD, which dynamically infers the weight of\u0000adjustment based on the degree of conflict, as measured by the Jensen-Shannon\u0000divergence between distributions representing contextual and parametric\u0000knowledge. Our experiments across four models on six diverse question-answering\u0000(QA) datasets and three summarization tasks demonstrate that our training-free\u0000adaptive method consistently outperforms other decoding methods on QA, with\u0000average accuracy gains of 14.21% (absolute) over a static contrastive baseline,\u0000and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, our\u0000analysis shows that while decoding with contrastive baselines hurts performance\u0000when conflict is absent, AdaCAD mitigates these losses, making it more\u0000applicable to real-world datasets in which some examples have conflict and\u0000others do not.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model 利用大型语言模型进行无本体泛域知识图到文本生成数据集合成

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07088

Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee

Knowledge Graph-to-Text (G2T) generation involves verbalizing structuredknowledge graphs into natural language text. Recent advancements in PretrainedLanguage Models (PLMs) have improved G2T performance, but their effectivenessdepends on datasets with precise graph-text alignment. However, the scarcity ofhigh-quality, general-domain G2T generation datasets restricts progress in thegeneral-domain G2T generation research. To address this issue, we introduceWikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2Tdataset generated using a novel method that leverages Large Language Model(LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domaingraph-text pairs, offers high graph-text consistency without relying onexternal ontologies. Experimental results demonstrate that PLM fine-tuned onWikiOFGraph outperforms those trained on other datasets across variousevaluation metrics. Our method proves to be a scalable and effective solutionfor generating high-quality G2T data, significantly advancing the field of G2Tgeneration.

知识图谱到文本（G2T）的生成涉及将结构化知识图谱口头化为自然语言文本。预训练语言模型（PLM）的最新进展提高了 G2T 的性能，但其有效性取决于图-文精确对齐的数据集。然而，高质量通用域 G2T 生成数据集的稀缺限制了通用域 G2T 生成研究的进展。为了解决这个问题，我们引入了维基百科无本体图-文本数据集（WikiOFGraph），这是一种利用大型语言模型（LLM）和数据查询评估（Data-QuestEval）的新方法生成的新的大规模 G2T 数据集。我们的新数据集包含 585 万个通用图-文本对，无需依赖外部本体就能提供高度的图-文本一致性。实验结果表明，基于 WikiOFGraph 微调的 PLM 在各种评价指标上都优于在其他数据集上训练的 PLM。我们的方法被证明是生成高质量 G2T 数据的一种可扩展的有效解决方案，极大地推动了 G2T 生成领域的发展。

{"title":"Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model","authors":"Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee","doi":"arxiv-2409.07088","DOIUrl":"https://doi.org/arxiv-2409.07088","url":null,"abstract":"Knowledge Graph-to-Text (G2T) generation involves verbalizing structured\u0000knowledge graphs into natural language text. Recent advancements in Pretrained\u0000Language Models (PLMs) have improved G2T performance, but their effectiveness\u0000depends on datasets with precise graph-text alignment. However, the scarcity of\u0000high-quality, general-domain G2T generation datasets restricts progress in the\u0000general-domain G2T generation research. To address this issue, we introduce\u0000Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T\u0000dataset generated using a novel method that leverages Large Language Model\u0000(LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain\u0000graph-text pairs, offers high graph-text consistency without relying on\u0000external ontologies. Experimental results demonstrate that PLM fine-tuned on\u0000WikiOFGraph outperforms those trained on other datasets across various\u0000evaluation metrics. Our method proves to be a scalable and effective solution\u0000for generating high-quality G2T data, significantly advancing the field of G2T\u0000generation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing adversarial robustness in Natural Language Inference using explanations 利用解释增强自然语言推理的对抗鲁棒性

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07423

Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

The surge of state-of-the-art Transformer-based models has undoubtedly pushedthe limits of NLP model performance, excelling in a variety of tasks. We castthe spotlight on the underexplored task of Natural Language Inference (NLI),since models trained on popular well-suited datasets are susceptible toadversarial attacks, allowing subtle input interventions to mislead the model.In this work, we validate the usage of natural language explanation as amodel-agnostic defence strategy through extensive experimentation: only byfine-tuning a classifier on the explanation rather than premise-hypothesisinputs, robustness under various adversarial attacks is achieved in comparisonto explanation-free baselines. Moreover, since there is no standard strategy oftesting the semantic validity of the generated explanations, we research thecorrelation of widely used language generation metrics with human perception,in order for them to serve as a proxy towards robust NLI models. Our approachis resource-efficient and reproducible without significant computationallimitations.

基于变换器的先进模型的涌现无疑突破了 NLP 模型性能的极限，在各种任务中表现出色。我们将目光投向了自然语言推理（NLI）这一尚未被充分探索的任务，因为在流行的合适数据集上训练的模型很容易受到对抗性攻击，使微妙的输入干预误导模型。在这项工作中，我们通过大量实验验证了使用自然语言解释作为与模型无关的防御策略：与无解释基线相比，只有通过在解释而非前提假设输入上对分类器进行微调，才能实现在各种对抗性攻击下的鲁棒性。此外，由于没有测试所生成解释语义有效性的标准策略，我们研究了广泛使用的语言生成指标与人类感知的相关性，以便将它们作为鲁棒性 NLI 模型的代理。我们的方法具有资源效率高和可重复性强的特点，没有明显的计算限制。

{"title":"Enhancing adversarial robustness in Natural Language Inference using explanations","authors":"Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou","doi":"arxiv-2409.07423","DOIUrl":"https://doi.org/arxiv-2409.07423","url":null,"abstract":"The surge of state-of-the-art Transformer-based models has undoubtedly pushed\u0000the limits of NLP model performance, excelling in a variety of tasks. We cast\u0000the spotlight on the underexplored task of Natural Language Inference (NLI),\u0000since models trained on popular well-suited datasets are susceptible to\u0000adversarial attacks, allowing subtle input interventions to mislead the model.\u0000In this work, we validate the usage of natural language explanation as a\u0000model-agnostic defence strategy through extensive experimentation: only by\u0000fine-tuning a classifier on the explanation rather than premise-hypothesis\u0000inputs, robustness under various adversarial attacks is achieved in comparison\u0000to explanation-free baselines. Moreover, since there is no standard strategy of\u0000testing the semantic validity of the generated explanations, we research the\u0000correlation of widely used language generation metrics with human perception,\u0000in order for them to serve as a proxy towards robust NLI models. Our approach\u0000is resource-efficient and reproducible without significant computational\u0000limitations.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"2019 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Agent Workflow Memory 代理工作流程内存

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07429

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Despite the potential of language model-based agents to solve real-worldtasks such as web navigation, current methods still struggle with long-horizontasks with complex action trajectories. In contrast, humans can flexibly solvecomplex tasks by learning reusable task workflows from past experiences andusing them to guide future actions. To build agents that can similarly benefitfrom this process, we introduce Agent Workflow Memory (AWM), a method forinducing commonly reused routines, i.e., workflows, and selectively providingworkflows to the agent to guide subsequent generations. AWM flexibly applies toboth offline and online scenarios, where agents induce workflows from trainingexamples beforehand or from test queries on the fly. We experiment on two majorweb navigation benchmarks -- Mind2Web and WebArena -- that collectively cover1000+ tasks from 200+ domains across travel, shopping, and social media, amongothers. AWM substantially improves the baseline results by 24.6% and 51.1%relative success rate on Mind2Web and WebArena while reducing the number ofsteps taken to solve WebArena tasks successfully. Furthermore, online AWMrobustly generalizes in cross-task, website, and domain evaluations, surpassingbaselines from 8.9 to 14.0 absolute points as train-test task distribution gapswiden.

尽管基于语言模型的代理在解决现实世界的任务（如网络导航）方面具有很大潜力，但目前的方法仍然难以应对具有复杂行动轨迹的长期任务。相比之下，人类可以从过去的经验中学习可重复使用的任务工作流，并用它们来指导未来的行动，从而灵活地解决复杂的任务。为了构建能从这一过程中同样受益的代理，我们引入了代理工作流记忆（AWM），这是一种诱导常用重用例程（即工作流）的方法，并有选择性地向代理提供工作流以指导后续生成。AWM 可灵活应用于离线和在线场景，代理可事先从训练示例或测试查询中诱导工作流。我们在两个主要的网络导航基准--Mind2Web 和 WebArena--上进行了实验，这两个基准涵盖了旅游、购物和社交媒体等 200 多个领域的 1000 多个任务。在 Mind2Web 和 WebArena 上，AWM 大幅提高了基准结果，相对成功率分别提高了 24.6% 和 51.1%，同时减少了成功解决 WebArena 任务所需的步骤数。此外，在线 AWM 在跨任务、网站和领域评估中具有强大的通用性，随着训练-测试任务分布差距的缩小，其绝对值超过基线 8.9 到 14.0 个百分点。

{"title":"Agent Workflow Memory","authors":"Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig","doi":"arxiv-2409.07429","DOIUrl":"https://doi.org/arxiv-2409.07429","url":null,"abstract":"Despite the potential of language model-based agents to solve real-world\u0000tasks such as web navigation, current methods still struggle with long-horizon\u0000tasks with complex action trajectories. In contrast, humans can flexibly solve\u0000complex tasks by learning reusable task workflows from past experiences and\u0000using them to guide future actions. To build agents that can similarly benefit\u0000from this process, we introduce Agent Workflow Memory (AWM), a method for\u0000inducing commonly reused routines, i.e., workflows, and selectively providing\u0000workflows to the agent to guide subsequent generations. AWM flexibly applies to\u0000both offline and online scenarios, where agents induce workflows from training\u0000examples beforehand or from test queries on the fly. We experiment on two major\u0000web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover\u00001000+ tasks from 200+ domains across travel, shopping, and social media, among\u0000others. AWM substantially improves the baseline results by 24.6% and 51.1%\u0000relative success rate on Mind2Web and WebArena while reducing the number of\u0000steps taken to solve WebArena tasks successfully. Furthermore, online AWM\u0000robustly generalizes in cross-task, website, and domain evaluations, surpassing\u0000baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps\u0000widen.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation 共同思考，更好地工作：结合人类和 LLM 的大声思考结果，实现有效的文本评估

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07355

SeongYeub Chu, JongWoo Kim, MunYong Yi

This study introduces textbf{InteractEval}, a framework that integrateshuman expertise and Large Language Models (LLMs) using the Think-Aloud (TA)method to generate attributes for checklist-based text evaluation. By combininghuman flexibility and reasoning with LLM consistency, InteractEval outperformstraditional non-LLM-based and LLM-based baselines across four distinctdimensions, consisting of Coherence, Fluency, Consistency, and Relevance. Theexperiment also investigates the effectiveness of the TA method, showing thatit promotes divergent thinking in both humans and LLMs, leading to thegeneration of a wider range of relevant attributes and enhance text evaluationperformance. Comparative analysis reveals that humans excel at identifyingattributes related to internal quality (Coherence and Fluency), but LLMsperform better at those attributes related to external alignment (Consistencyand Relevance). Consequently, leveraging both humans and LLMs together producesthe best evaluation outcomes. In other words, this study emphasizes thenecessity of effectively combining humans and LLMs in an automatedchecklist-based text evaluation framework. The code is available attextbf{url{https://github.com/BBeeChu/InteractEval.git}}.

本研究介绍了textbf{InteractEval}，这是一个利用 "大声思考"（TA）方法将人类专业知识和大型语言模型（LLM）整合在一起的框架，可为基于核对表的文本评估生成属性。通过将人类的灵活性和推理能力与 LLM 的一致性相结合，InteractEval 在一致性、流畅性、一致性和相关性这四个不同维度上的表现优于传统的非 LLM 和 LLM 基线。实验还研究了 TA 方法的有效性，结果表明它能促进人类和 LLM 的发散思维，从而产生更广泛的相关属性并提高文本评价性能。对比分析表明，人类擅长识别与内部质量相关的属性（连贯性和流畅性），但 LLM 在与外部一致性相关的属性（一致性和相关性）方面表现更好。因此，同时利用人类和 LLM 可以产生最佳的评估结果。换句话说，这项研究强调了在基于自动核对表的文本评价框架中有效结合人类和 LLM 的必要性。代码请访问：textbf{url{https://github.com/BBeeChu/InteractEval.git}}。

{"title":"Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation","authors":"SeongYeub Chu, JongWoo Kim, MunYong Yi","doi":"arxiv-2409.07355","DOIUrl":"https://doi.org/arxiv-2409.07355","url":null,"abstract":"This study introduces textbf{InteractEval}, a framework that integrates\u0000human expertise and Large Language Models (LLMs) using the Think-Aloud (TA)\u0000method to generate attributes for checklist-based text evaluation. By combining\u0000human flexibility and reasoning with LLM consistency, InteractEval outperforms\u0000traditional non-LLM-based and LLM-based baselines across four distinct\u0000dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The\u0000experiment also investigates the effectiveness of the TA method, showing that\u0000it promotes divergent thinking in both humans and LLMs, leading to the\u0000generation of a wider range of relevant attributes and enhance text evaluation\u0000performance. Comparative analysis reveals that humans excel at identifying\u0000attributes related to internal quality (Coherence and Fluency), but LLMs\u0000perform better at those attributes related to external alignment (Consistency\u0000and Relevance). Consequently, leveraging both humans and LLMs together produces\u0000the best evaluation outcomes. In other words, this study emphasizes the\u0000necessity of effectively combining humans and LLMs in an automated\u0000checklist-based text evaluation framework. The code is available at\u0000textbf{url{https://github.com/BBeeChu/InteractEval.git}}.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Efficient Recursive Numeral Systems via Reinforcement Learning 通过强化学习学习高效递归数字系统

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07170

Jonathan D. Thomas, Andrea Silvi, Devdatt Dubhashi, Emil Carlsson, Moa Johansson

The emergence of mathematical concepts, such as number systems, is anunderstudied area in AI for mathematics and reasoning. It has previously beenshown Carlsson et al. (2021) that by using reinforcement learning (RL), agentscan derive simple approximate and exact-restricted numeral systems. However, itis a major challenge to show how more complex recursive numeral systems,similar to the one utilised in English, could arise via a simple learningmechanism such as RL. Here, we introduce an approach towards deriving amechanistic explanation of the emergence of recursive number systems where weconsider an RL agent which directly optimizes a lexicon under a givenmeta-grammar. Utilising a slightly modified version of the seminal meta-grammarof Hurford (1975), we demonstrate that our RL agent can effectively modify thelexicon towards Pareto-optimal configurations which are comparable to thoseobserved within human numeral systems.

数学概念（如数字系统）的出现是人工智能数学与推理中一个研究不足的领域。Carlsson 等人（2021 年）曾指出，通过强化学习（RL），代理可以推导出简单的近似和精确受限的数字系统。然而，如何通过 RL 这种简单的学习机制来展示类似英语中使用的更复杂的递归数字系统是一个重大挑战。在这里，我们引入了一种方法，旨在从机制上解释递归数字系统的出现，即我们考虑在给定元语法下直接优化词典的 RL 代理。利用赫尔福德（Hurford，1975 年）开创性元语法的略微修改版本，我们证明了我们的 RL 代理可以有效地修改词库，使其达到帕累托最优配置，这与人类数字系统中观察到的配置相当。

引用次数: 0

Native vs Non-Native Language Prompting: A Comparative Analysis 母语提示与非母语提示：比较分析

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07054

Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor, Boushra Bendou, Maram Hasanain, Firoj Alam

Large language models (LLMs) have shown remarkable abilities in differentfields, including standard Natural Language Processing (NLP) tasks. To elicitknowledge from LLMs, prompts play a key role, consisting of natural languageinstructions. Most open and closed source LLMs are trained on available labeledand unlabeled resources--digital content such as text, images, audio, andvideos. Hence, these models have better knowledge for high-resourced languagesbut struggle with low-resourced languages. Since prompts play a crucial role inunderstanding their capabilities, the language used for prompts remains animportant research question. Although there has been significant research inthis area, it is still limited, and less has been explored for medium tolow-resourced languages. In this study, we investigate different promptingstrategies (native vs. non-native) on 11 different NLP tasks associated with 12different Arabic datasets (9.7K data points). In total, we conducted 197experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Ourfindings suggest that, on average, the non-native prompt performs the best,followed by mixed and native prompts.

大型语言模型（LLM）在不同领域，包括标准的自然语言处理（NLP）任务中都表现出了非凡的能力。要从 LLMs 中获取知识，由自然语言指令组成的提示起着关键作用。大多数开放源代码和封闭源代码的 LLM 都是在有标签和无标签资源（如文本、图像、音频和视频等数字内容）上进行训练的。因此，这些模型对高资源语言有较好的了解，但对低资源语言则显得力不从心。由于提示在理解其能力方面起着至关重要的作用，因此提示所使用的语言仍然是一个重要的研究问题。尽管在这一领域已经有了大量的研究，但仍很有限，而且对中低资源语言的研究也较少。在本研究中，我们研究了与 12 个不同阿拉伯语数据集（9.7K 个数据点）相关的 11 个不同 NLP 任务中的不同提示策略（母语与非母语）。我们总共进行了 197 次实验，涉及 3 个 LLM、12 个数据集和 3 种提示策略。我们的研究结果表明，平均而言，非母语提示效果最好，其次是混合提示和母语提示。

{"title":"Native vs Non-Native Language Prompting: A Comparative Analysis","authors":"Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor, Boushra Bendou, Maram Hasanain, Firoj Alam","doi":"arxiv-2409.07054","DOIUrl":"https://doi.org/arxiv-2409.07054","url":null,"abstract":"Large language models (LLMs) have shown remarkable abilities in different\u0000fields, including standard Natural Language Processing (NLP) tasks. To elicit\u0000knowledge from LLMs, prompts play a key role, consisting of natural language\u0000instructions. Most open and closed source LLMs are trained on available labeled\u0000and unlabeled resources--digital content such as text, images, audio, and\u0000videos. Hence, these models have better knowledge for high-resourced languages\u0000but struggle with low-resourced languages. Since prompts play a crucial role in\u0000understanding their capabilities, the language used for prompts remains an\u0000important research question. Although there has been significant research in\u0000this area, it is still limited, and less has been explored for medium to\u0000low-resourced languages. In this study, we investigate different prompting\u0000strategies (native vs. non-native) on 11 different NLP tasks associated with 12\u0000different Arabic datasets (9.7K data points). In total, we conducted 197\u0000experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our\u0000findings suggest that, on average, the non-native prompt performs the best,\u0000followed by mixed and native prompts.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0