arXiv - CS - Computation and Language最新文献

英文中文

Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning 推理图增强范例检索，促进情境学习

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11147

Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen

Large language models(LLMs) have exhibited remarkable few-shot learningcapabilities and unified the paradigm of NLP tasks through the in-contextlearning(ICL) technique. Despite the success of ICL, the quality of theexemplar demonstrations can significantly influence the LLM's performance.Existing exemplar selection methods mainly focus on the semantic similaritybetween queries and candidate exemplars. On the other hand, the logicalconnections between reasoning steps can be beneficial to depict theproblem-solving process as well. In this paper, we proposes a novel methodnamed Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLMto generate an initial response, then expresses intermediate problem-solvingsteps to a graph structure. After that, it employs graph kernel to selectexemplars with semantic and structural similarity. Extensive experimentsdemonstrate the structural relationship is helpful to the alignment of queriesand candidate exemplars. The efficacy of RGER on math and logit reasoning tasksshowcases its superiority over state-of-the-art retrieval-based approaches. Ourcode is released at https://github.com/Yukang-Lin/RGER.

大型语言模型（LLMs）通过上下文学习（ICL）技术展示了卓越的少量学习能力，并统一了 NLP 任务的范式。现有的示例选择方法主要关注查询与候选示例之间的语义相似性。另一方面，推理步骤之间的逻辑联系也有利于描述问题的解决过程。本文提出了一种名为推理图增强示例检索（RGER）的新方法。RGER 首先要求 LLM 生成初始响应，然后将中间的问题解决步骤表达为图结构。之后，它利用图核来选择具有语义和结构相似性的示例。大量实验证明，结构关系有助于查询和候选示例的匹配。RGER 在数学和对数推理任务中的功效表明，它优于最先进的基于检索的方法。我们的代码发布于 https://github.com/Yukang-Lin/RGER。

{"title":"Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning","authors":"Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen","doi":"arxiv-2409.11147","DOIUrl":"https://doi.org/arxiv-2409.11147","url":null,"abstract":"Large language models(LLMs) have exhibited remarkable few-shot learning\u0000capabilities and unified the paradigm of NLP tasks through the in-context\u0000learning(ICL) technique. Despite the success of ICL, the quality of the\u0000exemplar demonstrations can significantly influence the LLM's performance.\u0000Existing exemplar selection methods mainly focus on the semantic similarity\u0000between queries and candidate exemplars. On the other hand, the logical\u0000connections between reasoning steps can be beneficial to depict the\u0000problem-solving process as well. In this paper, we proposes a novel method\u0000named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM\u0000to generate an initial response, then expresses intermediate problem-solving\u0000steps to a graph structure. After that, it employs graph kernel to select\u0000exemplars with semantic and structural similarity. Extensive experiments\u0000demonstrate the structural relationship is helpful to the alignment of queries\u0000and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks\u0000showcases its superiority over state-of-the-art retrieval-based approaches. Our\u0000code is released at https://github.com/Yukang-Lin/RGER.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do 法学硕士担任法官与奖励模式：他们能做什么，不能做什么

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11239

Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong

LLM-as-a-Judge and reward models are widely used alternatives ofmultiple-choice questions or human annotators for large language model (LLM)evaluation. Their efficacy shines in evaluating long-form responses, serving acritical role as evaluators of leaderboards and as proxies to align LLMs viareinforcement learning. However, despite their popularity, their effectivenessoutside of English remains largely unexplored. In this paper, we conduct acomprehensive analysis on automated evaluators, reporting key findings on theirbehavior in a non-English environment. First, we discover that Englishevaluation capabilities significantly influence language-specific capabilities,often more than the language proficiency itself, enabling evaluators trained inEnglish to easily transfer their skills to other languages. Second, we identifycritical shortcomings, where LLMs fail to detect and penalize errors, such asfactual inaccuracies, cultural misrepresentations, and the presence of unwantedlanguage. Finally, we release Kudge, the first non-English meta-evaluationdataset containing 5,012 human annotations in Korean.

在大型语言模型（LLM）评估中，LLM-as-a-Judge 和奖励模型被广泛用于替代多选题或人类注释者。它们在评估长式回答时发挥了重要作用，既是排行榜的评估者，也是通过强化学习调整 LLM 的代理。然而，尽管它们很受欢迎，但它们在英语之外的有效性在很大程度上仍未得到探索。在本文中，我们对自动评价器进行了全面分析，报告了它们在非英语环境中行为的主要发现。首先，我们发现英语评估能力对特定语言能力的影响很大，往往比语言能力本身的影响更大，这使得接受过英语培训的评估员能够轻松地将他们的技能转移到其他语言上。其次，我们发现了 LLM 的关键缺陷，即 LLM 无法检测和惩罚错误，如事实不准确、文化表述错误和存在不需要的语言。最后，我们发布了首个非英语元评价数据集 Kudge，其中包含 5012 个韩语人工注释。

{"title":"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do","authors":"Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong","doi":"arxiv-2409.11239","DOIUrl":"https://doi.org/arxiv-2409.11239","url":null,"abstract":"LLM-as-a-Judge and reward models are widely used alternatives of\u0000multiple-choice questions or human annotators for large language model (LLM)\u0000evaluation. Their efficacy shines in evaluating long-form responses, serving a\u0000critical role as evaluators of leaderboards and as proxies to align LLMs via\u0000reinforcement learning. However, despite their popularity, their effectiveness\u0000outside of English remains largely unexplored. In this paper, we conduct a\u0000comprehensive analysis on automated evaluators, reporting key findings on their\u0000behavior in a non-English environment. First, we discover that English\u0000evaluation capabilities significantly influence language-specific capabilities,\u0000often more than the language proficiency itself, enabling evaluators trained in\u0000English to easily transfer their skills to other languages. Second, we identify\u0000critical shortcomings, where LLMs fail to detect and penalize errors, such as\u0000factual inaccuracies, cultural misrepresentations, and the presence of unwanted\u0000language. Finally, we release Kudge, the first non-English meta-evaluation\u0000dataset containing 5,012 human annotations in Korean.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Egalitarian Language Representation in Language Models: It All Begins with Tokenizers 语言模型中的平等语言表达：一切从分词器开始

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11501

Menan Velayuthan, Kengatharaiyer Sarveswaran

Tokenizers act as a bridge between human language and the latent space oflanguage models, influencing how language is represented in these models. Dueto the immense popularity of English-Centric Large Language Models (LLMs),efforts are being made to adapt them for other languages. However, wedemonstrate that, from a tokenization standpoint, not all tokenizers offer fairrepresentation for complex script languages such as Tamil, Sinhala, and Hindi,primarily due to the choice of pre-tokenization methods. We go further to showthat pre-tokenization plays a more critical role than the tokenizationalgorithm itself in achieving an egalitarian representation of these complexscript languages. To address this, we introduce an improvement to the Byte PairEncoding (BPE) algorithm by incorporating graphemes, which we term GraphemePair Encoding (GPE). Our experiments show that grapheme-based characterextraction outperforms byte-level tokenizers for complex scripts. We validatethis approach through experiments on Tamil, Sinhala, and Hindi.

代词化器是人类语言与语言模型潜在空间之间的桥梁，影响着语言在这些模型中的表达方式。由于以英语为中心的大语言模型（LLMs）大受欢迎，人们正努力将其适用于其他语言。然而，我们证明，从标记化的角度来看，并非所有标记化器都能公平地表示泰米尔语、僧伽罗语和印地语等复杂文字语言，这主要是由于选择了预标记化方法。我们进一步证明，在实现这些复杂文字语言的公平表示方面，预标记化比标记化算法本身起着更加关键的作用。为了解决这个问题，我们引入了一种改进的字节对编码（BPE）算法，将词素纳入其中，我们称之为词素对编码（GPE）。我们的实验表明，对于复杂的脚本，基于词素的字符提取效果优于字节级标记器。我们通过对泰米尔语、僧伽罗语和印地语的实验验证了这种方法。

引用次数: 0

THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models THaMES：大型语言模型中减少和评估幻觉的端到端工具

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11353

Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven

Hallucination, the generation of factually incorrect content, is a growingchallenge in Large Language Models (LLMs). Existing detection and mitigationmethods are often isolated and insufficient for domain-specific needs, lackinga standardized pipeline. This paper introduces THaMES (Tool for HallucinationMitigations and EvaluationS), an integrated framework and library addressingthis gap. THaMES offers an end-to-end solution for evaluating and mitigatinghallucinations in LLMs, featuring automated test set generation, multifacetedbenchmarking, and adaptable mitigation strategies. It automates test setcreation from any corpus, ensuring high data quality, diversity, andcost-efficiency through techniques like batch processing, weighted sampling,and counterfactual validation. THaMES assesses a model's ability to detect andreduce hallucinations across various tasks, including text generation andbinary classification, applying optimal mitigation strategies like In-ContextLearning (ICL), Retrieval Augmented Generation (RAG), and Parameter-EfficientFine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge baseof academic papers, political news, and Wikipedia reveal that commercial modelslike GPT-4o benefit more from RAG than ICL, while open-weight models likeLlama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFTsignificantly enhances the performance of Llama-3.1-8B-Instruct in bothevaluation tasks.

幻觉，即生成与事实不符的内容，是大型语言模型（LLM）中一个日益严峻的挑战。现有的检测和缓解方法往往是孤立的，不足以满足特定领域的需求，缺乏标准化的管道。本文介绍了 THaMES（Tool for HallucinationMitigations and EvaluationS，幻觉识别与评估工具），它是一个集成框架和库，可解决这一空白。THaMES 为评估和减轻 LLM 中的幻觉提供了端到端的解决方案，具有自动测试集生成、多方面基准测试和可调整的减轻策略等特点。它可以从任何语料库自动生成测试集，通过批处理、加权采样和反事实验证等技术确保数据的高质量、多样性和成本效益。THaMES 评估了模型在文本生成和二元分类等各种任务中检测和减少幻觉的能力，并应用了最佳缓解策略，如上下文学习 (ICL)、检索增强生成 (RAG) 和参数高效微调 (PEFT)。使用学术论文、政治新闻和维基百科等知识库对最先进的 LLM 进行评估后发现，GPT-4o 等商业模型从 RAG 中获得的收益比 ICL 更大，而 Llama-3.1-8B-Instruct 和 Mistral-Nemo 等开放重量模型从 ICL 中获得的收益更大。此外，PEFT 显著提高了 Llama-3.1-8B-Instruct 在双评估任务中的性能。

{"title":"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models","authors":"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven","doi":"arxiv-2409.11353","DOIUrl":"https://doi.org/arxiv-2409.11353","url":null,"abstract":"Hallucination, the generation of factually incorrect content, is a growing\u0000challenge in Large Language Models (LLMs). Existing detection and mitigation\u0000methods are often isolated and insufficient for domain-specific needs, lacking\u0000a standardized pipeline. This paper introduces THaMES (Tool for Hallucination\u0000Mitigations and EvaluationS), an integrated framework and library addressing\u0000this gap. THaMES offers an end-to-end solution for evaluating and mitigating\u0000hallucinations in LLMs, featuring automated test set generation, multifaceted\u0000benchmarking, and adaptable mitigation strategies. It automates test set\u0000creation from any corpus, ensuring high data quality, diversity, and\u0000cost-efficiency through techniques like batch processing, weighted sampling,\u0000and counterfactual validation. THaMES assesses a model's ability to detect and\u0000reduce hallucinations across various tasks, including text generation and\u0000binary classification, applying optimal mitigation strategies like In-Context\u0000Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\u0000Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\u0000of academic papers, political news, and Wikipedia reveal that commercial models\u0000like GPT-4o benefit more from RAG than ICL, while open-weight models like\u0000Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\u0000significantly enhances the performance of Llama-3.1-8B-Instruct in both\u0000evaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives 讲故事的艺术：用于动态多模态叙事的多代理生成式人工智能

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11261

Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar

This paper introduces the concept of an education tool that utilizesGenerative Artificial Intelligence (GenAI) to enhance storytelling forchildren. The system combines GenAI-driven narrative co-creation,text-to-speech conversion, and text-to-video generation to produce an engagingexperience for learners. We describe the co-creation process, the adaptation ofnarratives into spoken words using text-to-speech models, and thetransformation of these narratives into contextually relevant visuals throughtext-to-video technology. Our evaluation covers the linguistics of thegenerated stories, the text-to-speech conversion quality, and the accuracy ofthe generated visuals.

本文介绍了一种教育工具的概念，该工具利用生成人工智能（GenAI）来增强儿童讲故事的能力。该系统结合了 GenAI 驱动的叙事共创、文本到语音的转换以及文本到视频的生成，为学习者带来了引人入胜的体验。我们描述了共同创作过程、使用文本到语音模型将叙述改编成口语，以及通过文本到视频技术将这些叙述转换成与上下文相关的视觉效果。我们的评估包括生成故事的语言学、文本到语音的转换质量以及生成视觉效果的准确性。

引用次数: 0

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation 多文档接地多轮合成对话生成

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11500

Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo, Radu Florian

We introduce a technique for multi-document grounded multi-turn syntheticdialog generation that incorporates three main ideas. First, we control theoverall dialog flow using taxonomy-driven user queries that are generated withChain-of-Thought (CoT) prompting. Second, we support the generation ofmulti-document grounded dialogs by mimicking real-world use of retrievers toupdate the grounding documents after every user-turn in the dialog. Third, weapply LLM-as-a-Judge to filter out queries with incorrect answers. Humanevaluation of the synthetic dialog data suggests that the data is diverse,coherent, and includes mostly correct answers. Both human and automaticevaluations of answerable queries indicate that models fine-tuned on syntheticdialogs consistently out-perform those fine-tuned on existing human generatedtraining data across four publicly available multi-turn document groundedbenchmark test sets.

我们介绍了一种多文档多回合合成对话生成技术，其中包含三个主要思想。首先，我们使用分类法驱动的用户查询来控制整个对话流程，这些查询是通过思维链（CoT）提示生成的。其次，我们通过模拟现实世界中使用的检索器，在用户每次进入对话后更新基础文档，从而支持多文档基础对话的生成。第三，我们应用 LLM-as-a-Judge（LLM 即法官）来过滤掉答案不正确的查询。对合成对话数据的人工评估表明，这些数据是多样的、连贯的，而且大部分答案都是正确的。对可回答查询的人工评估和自动评估都表明，在四个公开的多轮文档基准测试集中，在合成对话上经过微调的模型始终优于在现有的人工生成的训练数据上经过微调的模型。

引用次数: 0

Linear Recency Bias During Training Improves Transformers' Fit to Reading Times 训练过程中的线性回忆偏差可提高变压器与阅读时间的匹配度

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11250

Christian Clark, Byung-Doh Oh, William Schuler

Recent psycholinguistic research has compared human reading times tosurprisal estimates from language models to study the factors shaping humansentence processing difficulty. Previous studies have shown a strong fitbetween surprisal values from Transformers and reading times. However, standardTransformers work with a lossless representation of the entire previouslinguistic context, unlike models of human language processing that includememory decay. To bridge this gap, this paper evaluates a modification of theTransformer model that uses ALiBi (Press et al., 2022), a recency bias added toattention scores. Surprisal estimates with ALiBi show an improved fit to humanreading times compared to a standard Transformer baseline. A subsequentanalysis of attention heads suggests that ALiBi's mixture of slopes -- whichdetermine the rate of memory decay in each attention head -- may play a role inthe improvement by helping models with ALiBi to track different kinds oflinguistic dependencies.

最近的心理语言学研究将人类的阅读时间与语言模型的意外估计值进行了比较，以研究影响人类句子处理难度的因素。以往的研究表明，Transformers 得出的惊奇值与阅读时间之间的拟合度很高。然而，标准转换器使用的是整个先前语言上下文的无损表示，这与包含记忆衰减的人类语言处理模型不同。为了弥补这一差距，本文评估了使用 ALiBi（Press 等人，2022 年）对 Transformer 模型进行的修改。与标准Transformer基线相比，使用ALiBi的惊奇估计结果显示与人类阅读时间的拟合度有所提高。随后对注意力头的分析表明，ALiBi 的混合斜率--它决定了每个注意力头的记忆衰减速度--可能通过帮助使用 ALiBi 的模型跟踪不同类型的语言依赖性而起到了改善的作用。

引用次数: 0

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection HEARTS：可解释、可持续和稳健的文本刻板印象检测整体框架

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11579

Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven

Stereotypes are generalised assumptions about societal groups, and evenstate-of-the-art LLMs using in-context learning struggle to identify themaccurately. Due to the subjective nature of stereotypes, where what constitutesa stereotype can vary widely depending on cultural, social, and individualperspectives, robust explainability is crucial. Explainable models ensure thatthese nuanced judgments can be understood and validated by human users,promoting trust and accountability. We address these challenges by introducingHEARTS (Holistic Framework for Explainable, Sustainable, and Robust TextStereotype Detection), a framework that enhances model performance, minimisescarbon footprint, and provides transparent, interpretable explanations. Weestablish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising57,201 labeled texts across six groups, including under-representeddemographics like LGBTQ+ and regional stereotypes. Ablation studies confirmthat BERT models fine-tuned on EMGSD outperform those trained on individualcomponents. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 modelusing SHAP to generate token-level importance values, ensuring alignment withhuman understanding, and calculate explainability confidence scores bycomparing SHAP and LIME outputs. Finally, HEARTS is applied to assessstereotypical bias in 12 LLM outputs, revealing a gradual reduction in biasover time within model families.

刻板印象是对社会群体的概括性假设，即使是最先进的语境学习 LLM 也难以准确识别刻板印象。由于刻板印象具有主观性，构成刻板印象的因素会因文化、社会和个人观点的不同而有很大差异，因此强大的可解释性至关重要。可解释模型可确保这些细微的判断能够被人类用户理解和验证，从而提高信任度和责任感。为了应对这些挑战，我们引入了可解释、可持续和稳健的文本原型检测整体框架（HEARTS），这是一个能提高模型性能、最大限度减少碳足迹并提供透明、可解释的解释的框架。我们建立了扩展的多粒度刻板印象数据集（EMGSD），该数据集包含六组共 57201 个标注文本，其中包括 LGBTQ+ 和地区刻板印象等代表性不足的人群。消融研究证实，根据 EMGSD 微调的 BERT 模型优于根据单个组件训练的模型。然后，我们分析了经过微调的高碳效率 ALBERT-V2 模型，利用 SHAP 生成标记级重要性值，确保与人类理解相一致，并通过比较 SHAP 和 LIME 输出计算可解释性置信度分数。最后，HEARTS 被应用于评估 12 个 LLM 输出中的刻板偏见，揭示了随着时间的推移，模型族中的偏见逐渐减少。

{"title":"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection","authors":"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven","doi":"arxiv-2409.11579","DOIUrl":"https://doi.org/arxiv-2409.11579","url":null,"abstract":"Stereotypes are generalised assumptions about societal groups, and even\u0000state-of-the-art LLMs using in-context learning struggle to identify them\u0000accurately. Due to the subjective nature of stereotypes, where what constitutes\u0000a stereotype can vary widely depending on cultural, social, and individual\u0000perspectives, robust explainability is crucial. Explainable models ensure that\u0000these nuanced judgments can be understood and validated by human users,\u0000promoting trust and accountability. We address these challenges by introducing\u0000HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\u0000Stereotype Detection), a framework that enhances model performance, minimises\u0000carbon footprint, and provides transparent, interpretable explanations. We\u0000establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\u000057,201 labeled texts across six groups, including under-represented\u0000demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\u0000that BERT models fine-tuned on EMGSD outperform those trained on individual\u0000components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\u0000using SHAP to generate token-level importance values, ensuring alignment with\u0000human understanding, and calculate explainability confidence scores by\u0000comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\u0000stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\u0000over time within model families.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enriching Datasets with Demographics through Large Language Models: What's in a Name? 通过大型语言模型用人口统计学丰富数据集：名字里有什么？

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11491

Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat

Enriching datasets with demographic information, such as gender, race, andage from names, is a critical task in fields like healthcare, public policy,and social sciences. Such demographic insights allow for more precise andeffective engagement with target populations. Despite previous effortsemploying hidden Markov models and recurrent neural networks to predictdemographics from names, significant limitations persist: the lack oflarge-scale, well-curated, unbiased, publicly available datasets, and the lackof an approach robust across datasets. This scarcity has hindered thedevelopment of traditional supervised learning approaches. In this paper, wedemonstrate that the zero-shot capabilities of Large Language Models (LLMs) canperform as well as, if not better than, bespoke models trained on specializeddata. We apply these LLMs to a variety of datasets, including a real-life,unlabelled dataset of licensed financial professionals in Hong Kong, andcritically assess the inherent demographic biases in these models. Our work notonly advances the state-of-the-art in demographic enrichment but also opensavenues for future research in mitigating biases in LLMs.

在医疗保健、公共政策和社会科学等领域，利用性别、种族和年龄等人口统计学信息丰富数据集是一项至关重要的任务。有了这些人口信息，就能更精确、更有效地与目标人群打交道。尽管以前曾尝试过采用隐马尔可夫模型和递归神经网络来预测姓名中的人口统计学特征，但仍然存在很大的局限性：缺乏大规模的、经过精心整理的、无偏见的、公开可用的数据集，而且缺乏一种跨数据集的稳健方法。这种匮乏阻碍了传统监督学习方法的发展。在本文中，我们证明了大型语言模型（LLMs）的零拍能力可以与在专门数据上训练的定制模型表现得一样好，甚至更好。我们将这些大型语言模型应用于各种数据集，包括香港持牌金融专业人士的无标签真实数据集，并对这些模型中固有的人口统计偏差进行了严格评估。我们的工作不仅推动了人口统计富集领域的最新研究成果，还为未来减轻 LLM 偏差的研究开辟了途径。

{"title":"Enriching Datasets with Demographics through Large Language Models: What's in a Name?","authors":"Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat","doi":"arxiv-2409.11491","DOIUrl":"https://doi.org/arxiv-2409.11491","url":null,"abstract":"Enriching datasets with demographic information, such as gender, race, and\u0000age from names, is a critical task in fields like healthcare, public policy,\u0000and social sciences. Such demographic insights allow for more precise and\u0000effective engagement with target populations. Despite previous efforts\u0000employing hidden Markov models and recurrent neural networks to predict\u0000demographics from names, significant limitations persist: the lack of\u0000large-scale, well-curated, unbiased, publicly available datasets, and the lack\u0000of an approach robust across datasets. This scarcity has hindered the\u0000development of traditional supervised learning approaches. In this paper, we\u0000demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can\u0000perform as well as, if not better than, bespoke models trained on specialized\u0000data. We apply these LLMs to a variety of datasets, including a real-life,\u0000unlabelled dataset of licensed financial professionals in Hong Kong, and\u0000critically assess the inherent demographic biases in these models. Our work not\u0000only advances the state-of-the-art in demographic enrichment but also opens\u0000avenues for future research in mitigating biases in LLMs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization 通过不确定性增强偏好优化实现大型语言模型的自我进化

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11212

Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan

Iterative preference optimization has recently become one of the de-factotraining paradigms for large language models (LLMs), but the performance isstill underwhelming due to too much noisy preference data yielded in the loop.To combat this issue, we present an textbf{U}ncertainty-enhancedtextbf{P}reference textbf{O}ptimization (UPO) framework to make the LLMself-evolve with reliable feedback. The key idea is mitigating the noisypreference data derived from the current policy and reward models by performingpair-wise uncertainty estimation and judiciously reliable feedback sampling. Toreach this goal, we thus introduce an estimator model, which incorporates MonteCarlo (MC) dropout in Bayesian neural network (BNN) to perform uncertaintyestimation for the preference data derived from the LLM policy. Compared to theexisting methods that directly filter generated responses based on the rewardscore, the estimator focuses on the model uncertainty in a pair-wise manner andeffectively bypasses the confirmation bias problem of the reward model.Additionally, we also propose an uncertainty-enhanced self-evolution algorithmto improve the robustness of preference optimization and encourage the LLM togenerate responses with both high reward and certainty. Extensive experimentsover multiple benchmarks demonstrate that our framework substantiallyalleviates the noisy problem and improves the performance of iterativepreference optimization.

最近，迭代偏好优化已成为大型语言模型（LLM）的去事实训练范式之一，但由于循环中产生的偏好数据噪声过大，其性能仍然不尽如人意。为了解决这个问题，我们提出了一种不确定性增强的偏好优化（UPO）框架，使 LLM 在可靠的反馈下自我演化。该框架的关键思路是通过对不确定性的估计和明智可靠的反馈采样，减少从当前策略和奖励模型中得出的嘈杂偏好数据。为了实现这一目标，我们引入了一种估计模型，将蒙特卡罗（MonteCarlo，MC）剔除纳入贝叶斯神经网络（BNN），对从 LLM 政策中得出的偏好数据进行不确定性估计。此外，我们还提出了一种不确定性增强型自进化算法，以提高偏好优化的鲁棒性，并鼓励 LLM 生成同时具有高回报和高确定性的响应。对多个基准的广泛实验证明，我们的框架大大缓解了噪声问题，提高了迭代偏好优化的性能。

{"title":"Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization","authors":"Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan","doi":"arxiv-2409.11212","DOIUrl":"https://doi.org/arxiv-2409.11212","url":null,"abstract":"Iterative preference optimization has recently become one of the de-facto\u0000training paradigms for large language models (LLMs), but the performance is\u0000still underwhelming due to too much noisy preference data yielded in the loop.\u0000To combat this issue, we present an textbf{U}ncertainty-enhanced\u0000textbf{P}reference textbf{O}ptimization (UPO) framework to make the LLM\u0000self-evolve with reliable feedback. The key idea is mitigating the noisy\u0000preference data derived from the current policy and reward models by performing\u0000pair-wise uncertainty estimation and judiciously reliable feedback sampling. To\u0000reach this goal, we thus introduce an estimator model, which incorporates Monte\u0000Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty\u0000estimation for the preference data derived from the LLM policy. Compared to the\u0000existing methods that directly filter generated responses based on the reward\u0000score, the estimator focuses on the model uncertainty in a pair-wise manner and\u0000effectively bypasses the confirmation bias problem of the reward model.\u0000Additionally, we also propose an uncertainty-enhanced self-evolution algorithm\u0000to improve the robustness of preference optimization and encourage the LLM to\u0000generate responses with both high reward and certainty. Extensive experiments\u0000over multiple benchmarks demonstrate that our framework substantially\u0000alleviates the noisy problem and improves the performance of iterative\u0000preference optimization.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Computation and Language

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀