arXiv - CS - Computation and Language最新文献

LLMs + Persona-Plug = Personalized LLMs LLMs + Persona-Plug = 个性化 LLMs

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11901

Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou

Personalization plays a critical role in numerous language tasks andapplications, since users with the same requirements may prefer diverse outputsbased on their individual interests. This has led to the development of variouspersonalized approaches aimed at adapting large language models (LLMs) togenerate customized outputs aligned with user preferences. Some of them involvefine-tuning a unique personalized LLM for each user, which is too expensive forwidespread application. Alternative approaches introduce personalizationinformation in a plug-and-play manner by retrieving the user's relevanthistorical texts as demonstrations. However, this retrieval-based strategy maybreak the continuity of the user history and fail to capture the user's overallstyles and patterns, hence leading to sub-optimal performance. To address thesechallenges, we propose a novel personalized LLM model, ours{}. It constructs auser-specific embedding for each individual by modeling all her historicalcontexts through a lightweight plug-in user embedder module. By attaching thisembedding to the task input, LLMs can better understand and capture user habitsand preferences, thereby producing more personalized outputs without tuningtheir own parameters. Extensive experiments on various tasks in the languagemodel personalization (LaMP) benchmark demonstrate that the proposed modelsignificantly outperforms existing personalized LLM approaches.

个性化在众多语言任务和应用中发挥着至关重要的作用，因为具有相同需求的用户可能会根据个人兴趣偏好不同的输出结果。因此，人们开发了各种个性化方法，旨在调整大型语言模型（LLM），以生成符合用户偏好的定制输出。其中一些方法涉及为每个用户微调独特的个性化 LLM，这种方法成本太高，无法广泛应用。其他方法则通过检索用户的相关历史文本作为示范，以即插即用的方式引入个性化信息。然而，这种基于检索的策略可能会破坏用户历史记录的连续性，无法捕捉用户的整体风格和模式，从而导致性能达不到最优。为了应对这些挑战，我们提出了一种新颖的个性化 LLM 模型（ours{}）。它通过一个轻量级插件用户嵌入模块对每个人的所有历史语境进行建模，从而为每个人构建特定于用户的嵌入。通过将这种嵌入附加到任务输入，LLM 可以更好地理解和捕捉用户的习惯和偏好，从而在不调整自身参数的情况下产生更加个性化的输出。在语言模型个性化（LaMP）基准中的各种任务上进行的广泛实验表明，所提出的模型明显优于现有的个性化 LLM 方法。

{"title":"LLMs + Persona-Plug = Personalized LLMs","authors":"Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou","doi":"arxiv-2409.11901","DOIUrl":"https://doi.org/arxiv-2409.11901","url":null,"abstract":"Personalization plays a critical role in numerous language tasks and\u0000applications, since users with the same requirements may prefer diverse outputs\u0000based on their individual interests. This has led to the development of various\u0000personalized approaches aimed at adapting large language models (LLMs) to\u0000generate customized outputs aligned with user preferences. Some of them involve\u0000fine-tuning a unique personalized LLM for each user, which is too expensive for\u0000widespread application. Alternative approaches introduce personalization\u0000information in a plug-and-play manner by retrieving the user's relevant\u0000historical texts as demonstrations. However, this retrieval-based strategy may\u0000break the continuity of the user history and fail to capture the user's overall\u0000styles and patterns, hence leading to sub-optimal performance. To address these\u0000challenges, we propose a novel personalized LLM model, ours{}. It constructs a\u0000user-specific embedding for each individual by modeling all her historical\u0000contexts through a lightweight plug-in user embedder module. By attaching this\u0000embedding to the task input, LLMs can better understand and capture user habits\u0000and preferences, thereby producing more personalized outputs without tuning\u0000their own parameters. Extensive experiments on various tasks in the language\u0000model personalization (LaMP) benchmark demonstrate that the proposed model\u0000significantly outperforms existing personalized LLM approaches.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Real-Time Conversations with Minimal Training Costs 以最低的培训成本实现实时对话

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11727

Wang Xu, Shuo Wang, Weilin Zhao, Xu Han, Yukun Yan, Yudi Zhang, Zhe Tao, Zhiyuan Liu, Wanxiang Che

Large language models (LLMs) have demonstrated the ability to improve humanefficiency through conversational interactions. Conventional LLM-powereddialogue systems, operating on a turn-based paradigm, preclude real-timeinteraction during response generation. To address this limitation, researchershave proposed duplex models. These models can dynamically adapt to user input,facilitating real-time interactive feedback. However, these methods typicallyrequire substantial computational resources to acquire the ability. To reduceoverhead, this paper presents a new duplex decoding approach that enhances LLMswith duplex ability, requiring minimal additional training. Specifically, ourmethod employs parallel decoding of queries and responses in conversations,effectively implementing a channel-division-multiplexing decoding strategy.Experimental results indicate that our proposed method significantly enhancesthe naturalness and human-likeness of user-AI interactions with minimaltraining costs.

大语言模型（LLM）已证明能够通过对话互动提高人性化效率。传统的 LLM 动力对话系统采用回合制范式，不允许在生成响应时进行实时交互。为了解决这一限制，研究人员提出了双工模型。这些模型可以动态适应用户输入，促进实时交互反馈。然而，这些方法通常需要大量计算资源才能获得这种能力。为了减少开销，本文提出了一种新的双工解码方法，该方法可以增强具有双工能力的 LLM，只需最少的额外训练。实验结果表明，我们提出的方法显著增强了用户与人工智能交互的自然度和人性化，而且培训成本极低。

引用次数: 0

Measuring Human and AI Values based on Generative Psychometrics with Large Language Models 基于大语言模型的生成心理测量学衡量人类和人工智能价值

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12106

Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, Guojie Song

Human values and their measurement are long-standing interdisciplinaryinquiry. Recent advances in AI have sparked renewed interest in this area, withlarge language models (LLMs) emerging as both tools and subjects of valuemeasurement. This work introduces Generative Psychometrics for Values (GPV), anLLM-based, data-driven value measurement paradigm, theoretically grounded intext-revealed selective perceptions. We begin by fine-tuning an LLM foraccurate perception-level value measurement and verifying the capability ofLLMs to parse texts into perceptions, forming the core of the GPV pipeline.Applying GPV to human-authored blogs, we demonstrate its stability, validity,and superiority over prior psychological tools. Then, extending GPV to LLMvalue measurement, we advance the current art with 1) a psychometricmethodology that measures LLM values based on their scalable and free-formoutputs, enabling context-specific measurement; 2) a comparative analysis ofmeasurement paradigms, indicating response biases of prior methods; and 3) anattempt to bridge LLM values and their safety, revealing the predictive powerof different value systems and the impacts of various values on LLM safety.Through interdisciplinary efforts, we aim to leverage AI for next-generationpsychometrics and psychometrics for value-aligned AI.

人类价值观及其测量是一项长期的跨学科研究。人工智能的最新进展再次激发了人们对这一领域的兴趣，大型语言模型（LLM）成为价值测量的工具和主体。本作品介绍了价值生成心理测量法（GPV），这是一种基于 LLM 的数据驱动型价值测量范式，其理论基础是文本揭示的选择性感知。我们首先对 LLM 进行了微调，以实现准确的感知级价值测量，并验证了 LLM 将文本解析为感知的能力，从而形成了 GPV 管道的核心。我们将 GPV 应用于人类撰写的博客，证明了它的稳定性、有效性以及优于现有心理学工具的优势。然后，我们将 GPV 扩展到 LLM 价值测量中，通过 1) 基于可扩展和自由形式输出测量 LLM 价值的心理测量方法，实现了针对具体语境的测量；2) 测量范式的比较分析，指出了先前方法的响应偏差；3) 尝试将 LLM 价值与其安全性联系起来，揭示了不同价值体系的预测能力以及各种价值对 LLM 安全性的影响。通过跨学科的努力，我们的目标是利用人工智能促进下一代心理测量学和心理测量学促进价值一致的人工智能。

{"title":"Measuring Human and AI Values based on Generative Psychometrics with Large Language Models","authors":"Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, Guojie Song","doi":"arxiv-2409.12106","DOIUrl":"https://doi.org/arxiv-2409.12106","url":null,"abstract":"Human values and their measurement are long-standing interdisciplinary\u0000inquiry. Recent advances in AI have sparked renewed interest in this area, with\u0000large language models (LLMs) emerging as both tools and subjects of value\u0000measurement. This work introduces Generative Psychometrics for Values (GPV), an\u0000LLM-based, data-driven value measurement paradigm, theoretically grounded in\u0000text-revealed selective perceptions. We begin by fine-tuning an LLM for\u0000accurate perception-level value measurement and verifying the capability of\u0000LLMs to parse texts into perceptions, forming the core of the GPV pipeline.\u0000Applying GPV to human-authored blogs, we demonstrate its stability, validity,\u0000and superiority over prior psychological tools. Then, extending GPV to LLM\u0000value measurement, we advance the current art with 1) a psychometric\u0000methodology that measures LLM values based on their scalable and free-form\u0000outputs, enabling context-specific measurement; 2) a comparative analysis of\u0000measurement paradigms, indicating response biases of prior methods; and 3) an\u0000attempt to bridge LLM values and their safety, revealing the predictive power\u0000of different value systems and the impacts of various values on LLM safety.\u0000Through interdisciplinary efforts, we aim to leverage AI for next-generation\u0000psychometrics and psychometrics for value-aligned AI.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation 利用 LLMs 进行 API 交互：分类和合成数据生成框架

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11703

Chunliang Tao, Xiaojing Fan, Yahe Yang

As Large Language Models (LLMs) advance in natural language processing, thereis growing interest in leveraging their capabilities to simplify softwareinteractions. In this paper, we propose a novel system that integrates LLMs forboth classifying natural language inputs into corresponding API calls andautomating the creation of sample datasets tailored to specific API functions.By classifying natural language commands, our system allows users to invokecomplex software functionalities through simple inputs, improving interactionefficiency and lowering the barrier to software utilization. Our datasetgeneration approach also enables the efficient and systematic evaluation ofdifferent LLMs in classifying API calls, offering a practical tool fordevelopers or business owners to assess the suitability of LLMs for customizedAPI management. We conduct experiments on several prominent LLMs usinggenerated sample datasets for various API functions. The results show thatGPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8Bperforms much worse at 0.759. These findings highlight the potential of LLMs totransform API management and validate the effectiveness of our system inguiding model testing and selection across diverse applications.

随着大型语言模型（LLMs）在自然语言处理领域的发展，人们对利用其功能简化软件交互的兴趣与日俱增。通过对自然语言命令进行分类，我们的系统允许用户通过简单的输入调用复杂的软件功能，从而提高交互效率并降低软件使用门槛。我们的数据集生成方法还能对不同的 LLM 在 API 调用分类方面进行高效、系统的评估，为开发人员或企业主评估 LLM 是否适合定制化 API 管理提供了实用工具。我们使用为各种 API 功能生成的样本数据集对几种著名的 LLM 进行了实验。结果表明，GPT-4 的分类准确率高达 0.996，而 LLaMA-3-8B 的分类准确率仅为 0.759，表现要差得多。这些发现凸显了 LLM 在改变 API 管理方面的潜力，并验证了我们的系统在不同应用中指导模型测试和选择的有效性。

{"title":"Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation","authors":"Chunliang Tao, Xiaojing Fan, Yahe Yang","doi":"arxiv-2409.11703","DOIUrl":"https://doi.org/arxiv-2409.11703","url":null,"abstract":"As Large Language Models (LLMs) advance in natural language processing, there\u0000is growing interest in leveraging their capabilities to simplify software\u0000interactions. In this paper, we propose a novel system that integrates LLMs for\u0000both classifying natural language inputs into corresponding API calls and\u0000automating the creation of sample datasets tailored to specific API functions.\u0000By classifying natural language commands, our system allows users to invoke\u0000complex software functionalities through simple inputs, improving interaction\u0000efficiency and lowering the barrier to software utilization. Our dataset\u0000generation approach also enables the efficient and systematic evaluation of\u0000different LLMs in classifying API calls, offering a practical tool for\u0000developers or business owners to assess the suitability of LLMs for customized\u0000API management. We conduct experiments on several prominent LLMs using\u0000generated sample datasets for various API functions. The results show that\u0000GPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8B\u0000performs much worse at 0.759. These findings highlight the potential of LLMs to\u0000transform API management and validate the effectiveness of our system in\u0000guiding model testing and selection across diverse applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement Qwen2.5-Math 技术报告：通过自我完善建立数学专家模型

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12122

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang

In this report, we present a series of math-specific large language models:Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of theQwen2.5 series lies in integrating the philosophy of self-improvementthroughout the entire pipeline, from pre-training and post-training toinference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilizedto generate large-scale, high-quality mathematical data. (2) In thepost-training phase, we develop a reward model (RM) by conducting massivesampling from Qwen2-Math-Instruct. This RM is then applied to the iterativeevolution of data in supervised fine-tuning (SFT). With a stronger SFT model,it's possible to iteratively train and update the RM, which in turn guides thenext round of SFT data iteration. On the final SFT model, we employ theultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct.(3) Furthermore, during the inference stage, the RM is used to guide sampling,optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advancedmathematical reasoning capabilities, including Chain-of-Thought (CoT) andTool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematicsdatasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, andAIME24, covering a range of difficulties from grade school level to mathcompetition problems.

在本报告中，我们介绍了一系列数学专用大型语言模型：Qwen2.5-Math 和 Qwen2.5-Math-Instruct-1.5B/7B/72B。Qwen2.5系列的核心创新在于将自我完善的理念贯穿于从预训练、后训练到推理的整个流程：（1）在预训练阶段，利用Qwen2-Math-Instruct生成大规模、高质量的数学数据；（2）在后训练阶段，利用Qwen2-Math-Instruct生成大规模、高质量的数学数据；（3）在推理阶段，利用Qwen2-Math-Instruct生成大规模、高质量的推理数据。(2) 在后训练阶段，我们通过对 Qwen2-Math-Instruct 进行大规模采样，建立奖励模型（RM）。然后，在监督微调（SFT）中将该奖励模型应用于数据的迭代进化。有了更强大的 SFT 模型，就可以迭代训练和更新 RM，进而指导下一轮 SFT 数据迭代。在最终的 SFT 模型上，我们使用最终的 RM 进行强化学习，从而得到 Qwen2.5-Math-Instruct（3）。Qwen2.5-Math-Instruct支持中英文，并具有高级数学推理能力，包括思维链（CoT）和工具集成推理（TIR）。我们在 10 个中英文数学数据集（如 GSM8K、MATH、GaoKao、AMC23 和 AIME24）上对我们的模型进行了评估，涵盖了从小学水平到数学竞赛问题的各种难度。

{"title":"Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement","authors":"An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang","doi":"arxiv-2409.12122","DOIUrl":"https://doi.org/arxiv-2409.12122","url":null,"abstract":"In this report, we present a series of math-specific large language models:\u0000Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the\u0000Qwen2.5 series lies in integrating the philosophy of self-improvement\u0000throughout the entire pipeline, from pre-training and post-training to\u0000inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized\u0000to generate large-scale, high-quality mathematical data. (2) In the\u0000post-training phase, we develop a reward model (RM) by conducting massive\u0000sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative\u0000evolution of data in supervised fine-tuning (SFT). With a stronger SFT model,\u0000it's possible to iteratively train and update the RM, which in turn guides the\u0000next round of SFT data iteration. On the final SFT model, we employ the\u0000ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct.\u0000(3) Furthermore, during the inference stage, the RM is used to guide sampling,\u0000optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced\u0000mathematical reasoning capabilities, including Chain-of-Thought (CoT) and\u0000Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics\u0000datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and\u0000AIME24, covering a range of difficulties from grade school level to math\u0000competition problems.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Finetuning Language Models to Emit Linguistic Expressions of Uncertainty 微调语言模型以发出不确定性的语言表达

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12180

Arslan Chaudhry, Sridhar Thiagarajan, Dilan Gorur

Large language models (LLMs) are increasingly employed in information-seekingand decision-making tasks. Despite their broad utility, LLMs tend to generateinformation that conflicts with real-world facts, and their persuasive stylecan make these inaccuracies appear confident and convincing. As a result,end-users struggle to consistently align the confidence expressed by LLMs withthe accuracy of their predictions, often leading to either blind trust in alloutputs or a complete disregard for their reliability. In this work, we exploresupervised finetuning on uncertainty-augmented predictions as a method todevelop models that produce linguistic expressions of uncertainty.Specifically, we measure the calibration of pre-trained models and thenfine-tune language models to generate calibrated linguistic expressions ofuncertainty. Through experiments on various question-answering datasets, wedemonstrate that LLMs are well-calibrated in assessing their predictions, andsupervised finetuning based on the model's own confidence leads towell-calibrated expressions of uncertainty, particularly for single-claimanswers.

大语言模型（LLM）越来越多地被用于信息搜索和决策任务中。尽管大型语言模型具有广泛的用途，但它们生成的信息往往与现实世界中的事实相冲突，而且它们具有说服力的风格会让这些不准确的信息显得信心十足、令人信服。因此，最终用户很难将 LLM 所表达的信心与其预测的准确性保持一致，这往往导致他们要么盲目信任所有输出，要么完全无视其可靠性。在这项工作中，我们探索了对不确定性增强预测进行监督微调的方法，以此来开发能够生成不确定性语言表达的模型。具体来说，我们测量了预训练模型的校准情况，然后对语言模型进行微调，以生成经过校准的不确定性语言表达。通过在各种问题解答数据集上的实验，我们证明了 LLM 在评估其预测时校准良好，而基于模型自身置信度的监督微调则会带来校准良好的不确定性表达，尤其是对于单个请求的回答。

{"title":"Finetuning Language Models to Emit Linguistic Expressions of Uncertainty","authors":"Arslan Chaudhry, Sridhar Thiagarajan, Dilan Gorur","doi":"arxiv-2409.12180","DOIUrl":"https://doi.org/arxiv-2409.12180","url":null,"abstract":"Large language models (LLMs) are increasingly employed in information-seeking\u0000and decision-making tasks. Despite their broad utility, LLMs tend to generate\u0000information that conflicts with real-world facts, and their persuasive style\u0000can make these inaccuracies appear confident and convincing. As a result,\u0000end-users struggle to consistently align the confidence expressed by LLMs with\u0000the accuracy of their predictions, often leading to either blind trust in all\u0000outputs or a complete disregard for their reliability. In this work, we explore\u0000supervised finetuning on uncertainty-augmented predictions as a method to\u0000develop models that produce linguistic expressions of uncertainty.\u0000Specifically, we measure the calibration of pre-trained models and then\u0000fine-tune language models to generate calibrated linguistic expressions of\u0000uncertainty. Through experiments on various question-answering datasets, we\u0000demonstrate that LLMs are well-calibrated in assessing their predictions, and\u0000supervised finetuning based on the model's own confidence leads to\u0000well-calibrated expressions of uncertainty, particularly for single-claim\u0000answers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts MEOW：通过倒置事实进行 MEMOry 监督 LLM 解学习

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11844

Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Yujiu Yang, Yan Teng, Yingchun Wang

Large Language Models (LLMs) can memorize sensitive information, raisingconcerns about potential misuse. LLM Unlearning, a post-hoc approach to removethis information from trained LLMs, offers a promising solution to mitigatethese risks. However, previous practices face three key challenges: 1. Utility:successful unlearning often causes catastrophic collapse on unrelated tasks. 2.Efficiency: many methods either involve adding similarly sized models, whichslows down unlearning or inference, or require retain data that are difficultto obtain. 3. Robustness: even effective methods may still leak data viaextraction techniques. To address these challenges, we propose MEOW, a simpleyet effective gradient descent-based unlearning method. Specifically, we use anoffline LLM to generate a set of inverted facts. Then, we design a new metric,MEMO, to quantify memorization in LLMs. Finally, based on the signals providedby MEMO, we select the most appropriate set of inverted facts and finetune themodel based on them. We evaluate MEOW on the commonly used unlearn benchmark,ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks.Results demonstrate significant improvement of MEOW in forget quality withoutsubstantial loss in model utility. Meanwhile, MEOW does not exhibit significantdegradation in NLU or NLG capabilities, and there is even a slight improvementin NLU performance.

大型语言模型（LLM）可以记忆敏感信息，这引发了对潜在滥用的担忧。LLM Unlearning 是一种从训练有素的 LLM 中删除这些信息的事后方法，它为降低这些风险提供了一种前景广阔的解决方案。然而，以往的做法面临三大挑战：1.1.实用性：成功解除学习往往会导致无关任务的灾难性崩溃。2.效率：许多方法要么涉及添加类似大小的模型，从而减慢解除学习或推理的速度，要么需要保留难以获得的数据。3.3.鲁棒性：即使是有效的方法，也可能会通过提取技术泄露数据。为了应对这些挑战，我们提出了 MEOW，一种简单而有效的基于梯度下降的解学习方法。具体来说，我们使用离线 LLM 生成一组倒置事实。然后，我们设计了一个新指标 MEMO 来量化 LLM 中的记忆。最后，根据 MEMO 提供的信号，我们选择最合适的倒置事实集，并在此基础上对模型进行微调。我们利用 Llama2-7B-Chat 和 Phi-1.5B 评估了 MEOW 在常用的未学习基准 ToFU 上的表现，并在 NLU 和 NLG 任务上进行了测试。同时，MEOW 在 NLU 和 NLG 能力方面也没有表现出明显的退化，甚至在 NLU 性能方面还略有提高。

{"title":"MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts","authors":"Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Yujiu Yang, Yan Teng, Yingchun Wang","doi":"arxiv-2409.11844","DOIUrl":"https://doi.org/arxiv-2409.11844","url":null,"abstract":"Large Language Models (LLMs) can memorize sensitive information, raising\u0000concerns about potential misuse. LLM Unlearning, a post-hoc approach to remove\u0000this information from trained LLMs, offers a promising solution to mitigate\u0000these risks. However, previous practices face three key challenges: 1. Utility:\u0000successful unlearning often causes catastrophic collapse on unrelated tasks. 2.\u0000Efficiency: many methods either involve adding similarly sized models, which\u0000slows down unlearning or inference, or require retain data that are difficult\u0000to obtain. 3. Robustness: even effective methods may still leak data via\u0000extraction techniques. To address these challenges, we propose MEOW, a simple\u0000yet effective gradient descent-based unlearning method. Specifically, we use an\u0000offline LLM to generate a set of inverted facts. Then, we design a new metric,\u0000MEMO, to quantify memorization in LLMs. Finally, based on the signals provided\u0000by MEMO, we select the most appropriate set of inverted facts and finetune the\u0000model based on them. We evaluate MEOW on the commonly used unlearn benchmark,\u0000ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks.\u0000Results demonstrate significant improvement of MEOW in forget quality without\u0000substantial loss in model utility. Meanwhile, MEOW does not exhibit significant\u0000degradation in NLU or NLG capabilities, and there is even a slight improvement\u0000in NLU performance.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking 同时思考和说话的大语言模型的双层训练和解码

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12059

Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji

Large Language Model can reasonably understand and generate human expressionsbut may lack of thorough thinking and reasoning mechanisms. Recently there havebeen several studies which enhance the thinking ability of language models butmost of them are not data-driven or training-based. In this paper, we aremotivated by the cognitive mechanism in the natural world, and design a novelmodel architecture called TaS which allows it to first consider the thoughtsand then express the response based upon the query. We design several pipelinesto annotate or generate the thought contents from prompt-response samples, thenadd language heads in a middle layer which behaves as the thinking layer. Wetrain the language model by the thoughts-augmented data and successfully letthe thinking layer automatically generate reasonable thoughts and finallyoutput more reasonable responses. Both qualitative examples and quantitativeresults validate the effectiveness and performance of TaS. Our code isavailable at https://anonymous.4open.science/r/TadE.

大型语言模型可以合理地理解和生成人类的表达方式，但可能缺乏全面的思考和推理机制。近来有一些增强语言模型思维能力的研究，但大多不是数据驱动或基于训练的。在本文中，我们从自然世界中的认知机制出发，设计了一种名为 TaS 的新型模型架构，使其能够首先考虑思维，然后根据查询表达响应。我们设计了多个管道来注释或生成提示-响应样本中的思维内容，然后在中间层添加语言头，该层就像思维层。通过思维增强数据对语言模型进行训练，成功地让思维层自动生成合理的思维，并最终输出更合理的回复。定性实例和定量结果都验证了 TaS 的有效性和性能。我们的代码见 https://anonymous.4open.science/r/TadE。

{"title":"Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking","authors":"Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji","doi":"arxiv-2409.12059","DOIUrl":"https://doi.org/arxiv-2409.12059","url":null,"abstract":"Large Language Model can reasonably understand and generate human expressions\u0000but may lack of thorough thinking and reasoning mechanisms. Recently there have\u0000been several studies which enhance the thinking ability of language models but\u0000most of them are not data-driven or training-based. In this paper, we are\u0000motivated by the cognitive mechanism in the natural world, and design a novel\u0000model architecture called TaS which allows it to first consider the thoughts\u0000and then express the response based upon the query. We design several pipelines\u0000to annotate or generate the thought contents from prompt-response samples, then\u0000add language heads in a middle layer which behaves as the thinking layer. We\u0000train the language model by the thoughts-augmented data and successfully let\u0000the thinking layer automatically generate reasonable thoughts and finally\u0000output more reasonable responses. Both qualitative examples and quantitative\u0000results validate the effectiveness and performance of TaS. Our code is\u0000available at https://anonymous.4open.science/r/TadE.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework 提取与抽象：在单一编码器-解码器框架内统一提取与抽象摘要法

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11827

Yuping Wu, Hao Li, Hongbo Zhu, Goran Nenadic, Xiao-Jun Zeng

Extract-then-Abstract is a naturally coherent paradigm to conduct abstractivesummarization with the help of salient information identified by the extractivemodel. Previous works that adopt this paradigm train the extractor andabstractor separately and introduce extra parameters to highlight the extractedsalients to the abstractor, which results in error accumulation and additionaltraining costs. In this paper, we first introduce a parameter-free highlightmethod into the encoder-decoder framework: replacing the encoder attention maskwith a saliency mask in the cross-attention module to force the decoder tofocus only on salient parts of the input. A preliminary analysis comparesdifferent highlight methods, demonstrating the effectiveness of our saliencymask. We further propose the novel extract-and-abstract paradigm, ExtAbs, whichjointly and seamlessly performs Extractive and Abstractive summarization taskswithin single encoder-decoder model to reduce error accumulation. In ExtAbs,the vanilla encoder is augmented to extract salients, and the vanilla decoderis modified with the proposed saliency mask to generate summaries. Built uponBART and PEGASUS, experiments on three datasets show that ExtAbs can achievesuperior performance than baselines on the extractive task and performscomparable, or even better than the vanilla models on the abstractive task.

先提取后抽象是一种自然连贯的范式，可借助提取模型识别出的突出信息进行抽象摘要。以往采用这种范式的研究分别对提取器和抽象器进行训练，并引入额外的参数来向抽象器突出提取对象，从而导致错误累积和额外的训练成本。在本文中，我们首先在编码器-解码器框架中引入了一种无参数高亮方法：在交叉注意力模块中用显著性掩码取代编码器注意力掩码，迫使解码器只关注输入的显著部分。初步分析比较了不同的突出方法，证明了我们的显著性掩码的有效性。我们进一步提出了新颖的提取-抽象范式 ExtAbs，它在单一编码器-解码器模型中联合、无缝地执行提取和抽象摘要任务，以减少错误积累。在 ExtAbs 中，对 vanilla 编码器进行增强以提取显著性，而 vanilla 解码器则使用建议的显著性掩码进行修改以生成摘要。以 BART 和 PEGASUS 为基础，在三个数据集上进行的实验表明，ExtAbs 在提取任务上的表现优于基线，在抽象任务上的表现与 vanilla 模型相当，甚至更好。

{"title":"Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework","authors":"Yuping Wu, Hao Li, Hongbo Zhu, Goran Nenadic, Xiao-Jun Zeng","doi":"arxiv-2409.11827","DOIUrl":"https://doi.org/arxiv-2409.11827","url":null,"abstract":"Extract-then-Abstract is a naturally coherent paradigm to conduct abstractive\u0000summarization with the help of salient information identified by the extractive\u0000model. Previous works that adopt this paradigm train the extractor and\u0000abstractor separately and introduce extra parameters to highlight the extracted\u0000salients to the abstractor, which results in error accumulation and additional\u0000training costs. In this paper, we first introduce a parameter-free highlight\u0000method into the encoder-decoder framework: replacing the encoder attention mask\u0000with a saliency mask in the cross-attention module to force the decoder to\u0000focus only on salient parts of the input. A preliminary analysis compares\u0000different highlight methods, demonstrating the effectiveness of our saliency\u0000mask. We further propose the novel extract-and-abstract paradigm, ExtAbs, which\u0000jointly and seamlessly performs Extractive and Abstractive summarization tasks\u0000within single encoder-decoder model to reduce error accumulation. In ExtAbs,\u0000the vanilla encoder is augmented to extract salients, and the vanilla decoder\u0000is modified with the proposed saliency mask to generate summaries. Built upon\u0000BART and PEGASUS, experiments on three datasets show that ExtAbs can achieve\u0000superior performance than baselines on the extractive task and performs\u0000comparable, or even better than the vanilla models on the abstractive task.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models PARAPHRASUS：评估转述检测模型的综合基准

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12060

Andrianos Michail, Simon Clematide, Juri Opitz

The task of determining whether two texts are paraphrases has long been achallenge in NLP. However, the prevailing notion of paraphrase is often quitesimplistic, offering only a limited view of the vast spectrum of paraphrasephenomena. Indeed, we find that evaluating models in a paraphrase dataset canleave uncertainty about their true semantic understanding. To alleviate this,we release paraphrasus, a benchmark designed for multi-dimensional assessmentof paraphrase detection models and finer model selection. We find thatparaphrase detection models under a fine-grained evaluation lens exhibittrade-offs that cannot be captured through a single classification dataset.

长期以来，确定两个文本是否为转述文本一直是 NLP 领域的一项挑战。然而，目前流行的转述概念往往过于简单，只能有限地反映转述现象的广阔范围。事实上，我们发现，在意译数据集中评估模型会给模型的真实语义理解带来不确定性。为了缓解这一问题，我们发布了 paraphrasus，这是一款专为多维度评估意译检测模型和更精细的模型选择而设计的基准软件。我们发现，在细粒度评估视角下的转述检测模型表现出的偏差是单一分类数据集无法捕捉的。

引用次数: 0