arXiv - CS - Computation and Language最新文献

英文中文

Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution 用于文体分析和可解释作者归属的潜空间解释法

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07072

Milad Alshomary, Narutatsu Ri, Marianna Apidianaki, Ajay Patel, Smaranda Muresan, Kathleen McKeown

Recent state-of-the-art authorship attribution methods learn authorshiprepresentations of texts in a latent, non-interpretable space, hindering theirusability in real-world applications. Our work proposes a novel approach tointerpreting these learned embeddings by identifying representative points inthe latent space and utilizing LLMs to generate informative natural languagedescriptions of the writing style of each point. We evaluate the alignment ofour interpretable space with the latent one and find that it achieves the bestprediction agreement compared to other baselines. Additionally, we conduct ahuman evaluation to assess the quality of these style descriptions, validatingtheir utility as explanations for the latent space. Finally, we investigatewhether human performance on the challenging AA task improves when aided by oursystem's explanations, finding an average improvement of around +20% inaccuracy.

最近最先进的作者归属方法是在一个不可解释的潜在空间中学习文本的作者归属表述，这阻碍了它们在现实世界中的应用。我们的工作提出了一种新颖的方法来解释这些学习到的嵌入，即识别潜在空间中的代表性点，并利用 LLM 生成对每个点的写作风格的翔实的自然语言描述。我们评估了我们的可解释空间与潜在空间的对齐情况，发现与其他基线相比，它实现了最好的预测一致性。此外，我们还进行了人工评估，以评估这些风格描述的质量，验证它们作为潜在空间解释的实用性。最后，我们研究了在我们系统的解释帮助下，人类在具有挑战性的 AA 任务中的表现是否有所改善，结果发现平均改善了约 +20% 的不准确性。

引用次数: 0

SimulBench: Evaluating Language Models with Creative Simulation Tasks SimulBench：用创意模拟任务评估语言模型

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07641

Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin

We introduce SimulBench, a benchmark designed to evaluate large languagemodels (LLMs) across a diverse collection of creative simulation scenarios,such as acting as a Linux terminal or playing text games with users. Whilethese simulation tasks serve as effective measures of an LLM's generalintelligence, they are seldom incorporated into existing benchmarks. A majorchallenge is to develop an evaluation framework for testing different LLMsfairly while preserving the multi-round interactive nature of simulation tasksbetween users and AI. To tackle this issue, we suggest using a fixed LLM as auser agent to engage with an LLM to collect dialogues first under differenttasks. Then, challenging dialogue scripts are extracted for evaluatingdifferent target LLMs. To facilitate automatic assessment on DataName{}, GPT-4is employed as the evaluator, tasked with reviewing the quality of the finalresponse generated by the target LLMs given multi-turn dialogue scripts. Ourcomprehensive experiments indicate that these simulation tasks continue to posea significant challenge with their unique natures and show the gap betweenproprietary models and the most advanced open LLMs. For example, GPT-4-turbooutperforms LLaMA-3-70b-Chat on 18.55% more cases.

我们介绍了 SimulBench，它是一种用于评估大型语言模型（LLM）的基准，可以在各种不同的创造性模拟场景中进行评估，例如充当 Linux 终端或与用户玩文字游戏。虽然这些模拟任务可以有效衡量 LLM 的综合智能，但它们很少被纳入现有的基准。一个主要的挑战是开发一个评估框架，在公平测试不同 LLM 的同时，保留模拟任务在用户和人工智能之间的多轮交互特性。为了解决这个问题，我们建议使用一个固定的 LLM 作为用户代理，与 LLM 进行互动，首先收集不同任务下的对话。然后，提取具有挑战性的对话脚本，用于评估不同的目标 LLM。为了便于对 DataName{} 进行自动评估，GPT-4 被用作评估者，其任务是在给出多轮对话脚本的情况下，审查目标 LLM 生成的最终响应的质量。我们的综合实验表明，这些模拟任务以其独特的性质继续构成重大挑战，并显示了专有模型与最先进的开放式 LLM 之间的差距。例如，GPT-4-turbo 在 18.55% 以上的案例上优于 LLaMA-3-70b-Chat。

{"title":"SimulBench: Evaluating Language Models with Creative Simulation Tasks","authors":"Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin","doi":"arxiv-2409.07641","DOIUrl":"https://doi.org/arxiv-2409.07641","url":null,"abstract":"We introduce SimulBench, a benchmark designed to evaluate large language\u0000models (LLMs) across a diverse collection of creative simulation scenarios,\u0000such as acting as a Linux terminal or playing text games with users. While\u0000these simulation tasks serve as effective measures of an LLM's general\u0000intelligence, they are seldom incorporated into existing benchmarks. A major\u0000challenge is to develop an evaluation framework for testing different LLMs\u0000fairly while preserving the multi-round interactive nature of simulation tasks\u0000between users and AI. To tackle this issue, we suggest using a fixed LLM as a\u0000user agent to engage with an LLM to collect dialogues first under different\u0000tasks. Then, challenging dialogue scripts are extracted for evaluating\u0000different target LLMs. To facilitate automatic assessment on DataName{}, GPT-4\u0000is employed as the evaluator, tasked with reviewing the quality of the final\u0000response generated by the target LLMs given multi-turn dialogue scripts. Our\u0000comprehensive experiments indicate that these simulation tasks continue to pose\u0000a significant challenge with their unique natures and show the gap between\u0000proprietary models and the most advanced open LLMs. For example, GPT-4-turbo\u0000outperforms LLaMA-3-70b-Chat on 18.55% more cases.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization 解释、辩论、对齐：从弱到强的语言模型泛化框架

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07335

Mehrdad Zakershahrak, Samira Ghodratnama

The rapid advancement of artificial intelligence systems has brought thechallenge of AI alignment to the forefront of research, particularly in complexdecision-making and task execution. As these systems surpass human-levelperformance in sophisticated problems, ensuring their alignment with humanvalues, intentions, and ethical guidelines becomes crucial. Building onprevious work in explanation generation for human-agent alignment, we addressthe more complex dynamics of multi-agent systems and human-AI teams. This paperintroduces a novel approach to model alignment through weak-to-stronggeneralization in the context of language models. We present a framework wherea strong model facilitates the improvement of a weaker model, bridging the gapbetween explanation generation and model alignment. Our method, formalized as afacilitation function, allows for the transfer of capabilities from advancedmodels to less capable ones without direct access to extensive training data.Our results suggest that this facilitation-based approach not only enhancesmodel performance but also provides insights into the nature of model alignmentand the potential for scalable oversight of AI systems.

人工智能系统的飞速发展将人工智能协调性的挑战推向了研究的前沿，尤其是在复杂决策和任务执行方面。随着这些系统在复杂问题上的表现超越人类水平，确保它们与人类的价值观、意图和道德准则保持一致变得至关重要。在以往为人机协调生成解释的工作基础上，我们探讨了多智能体系统和人机交互团队的更复杂动态。本文介绍了一种在语言模型背景下通过弱到强泛化实现模型对齐的新方法。我们提出了一个框架，在这个框架中，强模型有助于改进弱模型，弥补了解释生成和模型对齐之间的差距。我们的方法被形式化为一个促进函数，可以在不直接获取大量训练数据的情况下，将高级模型的能力转移到能力较弱的模型上。我们的研究结果表明，这种基于促进的方法不仅可以提高模型的性能，还可以深入了解模型对齐的本质以及人工智能系统可扩展监督的潜力。

{"title":"Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization","authors":"Mehrdad Zakershahrak, Samira Ghodratnama","doi":"arxiv-2409.07335","DOIUrl":"https://doi.org/arxiv-2409.07335","url":null,"abstract":"The rapid advancement of artificial intelligence systems has brought the\u0000challenge of AI alignment to the forefront of research, particularly in complex\u0000decision-making and task execution. As these systems surpass human-level\u0000performance in sophisticated problems, ensuring their alignment with human\u0000values, intentions, and ethical guidelines becomes crucial. Building on\u0000previous work in explanation generation for human-agent alignment, we address\u0000the more complex dynamics of multi-agent systems and human-AI teams. This paper\u0000introduces a novel approach to model alignment through weak-to-strong\u0000generalization in the context of language models. We present a framework where\u0000a strong model facilitates the improvement of a weaker model, bridging the gap\u0000between explanation generation and model alignment. Our method, formalized as a\u0000facilitation function, allows for the transfer of capabilities from advanced\u0000models to less capable ones without direct access to extensive training data.\u0000Our results suggest that this facilitation-based approach not only enhances\u0000model performance but also provides insights into the nature of model alignment\u0000and the potential for scalable oversight of AI systems.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gated Slot Attention for Efficient Linear-Time Sequence Modeling 用于高效线性时序建模的门控插槽注意力

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07146

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

Linear attention Transformers and their gated variants, celebrated forenabling parallel training and efficient recurrent inference, still fall shortin recall-intensive tasks compared to traditional Transformers and demandsignificant resources for training from scratch. This paper introduces GatedSlot Attention (GSA), which enhances Attention with Bounded-memory-Control(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizingcontext-aware memory reading and adaptive forgetting to improve memory capacitywhile maintaining compact recurrent state size. This design greatly enhancesboth training and inference efficiency through GLA's hardware-efficienttraining algorithm and reduced state size. Additionally, retaining the softmaxoperation is particularly beneficial in "finetuning pretrained Transformers toRNNs" (T2R) settings, reducing the need for extensive training from scratch.Extensive experiments confirm GSA's superior performance in scenarios requiringin-context recall and in T2R settings.

线性注意变换器及其门控变体虽然被认为可以实现并行训练和高效的循环推理，但与传统变换器相比，它们在回忆密集型任务中仍有不足，而且需要大量资源从头开始训练。本文介绍了门控插槽注意力（GatedSlot Attention，GSA），它通过结合受门控线性注意力（Gated Linear Attention，GLA）启发的门控机制，增强了有界内存控制注意力（Attention with Bounded-memory-Control，ABC）。从本质上讲，GSA 包括一个通过软最大值（softmax）连接的双层 GLA，利用上下文感知记忆读取和自适应遗忘来提高记忆容量，同时保持紧凑的递归状态大小。这种设计通过 GLA 的硬件系数训练算法和更小的状态大小，大大提高了训练和推理效率。此外，保留软最大操作在 "微调预训练变换器到 RNN"（T2R）设置中尤为有利，减少了从头开始进行大量训练的需要。

{"title":"Gated Slot Attention for Efficient Linear-Time Sequence Modeling","authors":"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu","doi":"arxiv-2409.07146","DOIUrl":"https://doi.org/arxiv-2409.07146","url":null,"abstract":"Linear attention Transformers and their gated variants, celebrated for\u0000enabling parallel training and efficient recurrent inference, still fall short\u0000in recall-intensive tasks compared to traditional Transformers and demand\u0000significant resources for training from scratch. This paper introduces Gated\u0000Slot Attention (GSA), which enhances Attention with Bounded-memory-Control\u0000(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\u0000(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\u0000context-aware memory reading and adaptive forgetting to improve memory capacity\u0000while maintaining compact recurrent state size. This design greatly enhances\u0000both training and inference efficiency through GLA's hardware-efficient\u0000training algorithm and reduced state size. Additionally, retaining the softmax\u0000operation is particularly beneficial in \"finetuning pretrained Transformers to\u0000RNNs\" (T2R) settings, reducing the need for extensive training from scratch.\u0000Extensive experiments confirm GSA's superior performance in scenarios requiring\u0000in-context recall and in T2R settings.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Generative Agents to Create Tip Sheets for Investigative Data Reporting 使用生成代理为调查数据报告创建提示表

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07286

Joris Veerbeek, Nicholas Diakopoulos

This paper introduces a system using generative AI agents to create tipsheets for investigative data reporting. Our system employs three specializedagents--an analyst, a reporter, and an editor--to collaboratively generate andrefine tips from datasets. We validate this approach using real-worldinvestigative stories, demonstrating that our agent-based system generallygenerates more newsworthy and accurate insights compared to a baseline modelwithout agents, although some variability was noted between different stories.Our findings highlight the potential of generative AI to provide leads forinvestigative data reporting.

本文介绍了一种使用生成式人工智能代理为调查性数据报告创建提示表的系统。我们的系统采用了三个专业代理--一名分析师、一名记者和一名编辑--协作生成和提炼来自数据集的提示。我们使用真实世界的调查报道验证了这种方法，结果表明，与没有代理的基线模型相比，我们基于代理的系统通常能生成更多有新闻价值的准确见解，尽管不同报道之间存在一些差异。

引用次数: 0

Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective 多模态情感计算的最新趋势：从 NLP 角度进行的调查

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07388

Guimin Hu, Yi Xin, Weimin Lyu, Haojian Huang, Chang Sun, Zhihong Zhu, Lin Gui, Ruichu Cai

Multimodal affective computing (MAC) has garnered increasing attention due toits broad applications in analyzing human behaviors and intentions, especiallyin text-dominated multimodal affective computing field. This survey presentsthe recent trends of multimodal affective computing from NLP perspectivethrough four hot tasks: multimodal sentiment analysis, multimodal emotionrecognition in conversation, multimodal aspect-based sentiment analysis andmultimodal multi-label emotion recognition. The goal of this survey is toexplore the current landscape of multimodal affective research, identifydevelopment trends, and highlight the similarities and differences acrossvarious tasks, offering a comprehensive report on the recent progress inmultimodal affective computing from an NLP perspective. This survey covers theformalization of tasks, provides an overview of relevant works, describesbenchmark datasets, and details the evaluation metrics for each task.Additionally, it briefly discusses research in multimodal affective computinginvolving facial expressions, acoustic signals, physiological signals, andemotion causes. Additionally, we discuss the technical approaches, challenges,and future directions in multimodal affective computing. To support furtherresearch, we released a repository that compiles related works in multimodalaffective computing, providing detailed resources and references for thecommunity.

多模态情感计算（MAC）在分析人类行为和意图方面有着广泛的应用，尤其是在以文本为主的多模态情感计算领域，因此受到越来越多的关注。本调查从 NLP 的角度，通过多模态情感分析、对话中的多模态情感识别、基于多模态方面的情感分析和多模态多标签情感识别这四个热点任务，介绍了多模态情感计算的最新发展趋势。本调查的目的是探索多模态情感研究的现状，确定发展的趋势，并突出不同任务之间的异同，从 NLP 的角度全面报告多模态情感计算的最新进展。本调查报告涵盖了任务的形式化，概述了相关工作，描述了基准数据集，并详细介绍了每个任务的评估指标。此外，报告还简要讨论了涉及面部表情、声音信号、生理信号和情感原因的多模态情感计算研究。此外，我们还讨论了多模态情感计算的技术方法、挑战和未来方向。为了支持进一步的研究，我们发布了一个资料库，其中汇集了多模态情感计算方面的相关作品，为社区提供了详细的资源和参考资料。

{"title":"Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective","authors":"Guimin Hu, Yi Xin, Weimin Lyu, Haojian Huang, Chang Sun, Zhihong Zhu, Lin Gui, Ruichu Cai","doi":"arxiv-2409.07388","DOIUrl":"https://doi.org/arxiv-2409.07388","url":null,"abstract":"Multimodal affective computing (MAC) has garnered increasing attention due to\u0000its broad applications in analyzing human behaviors and intentions, especially\u0000in text-dominated multimodal affective computing field. This survey presents\u0000the recent trends of multimodal affective computing from NLP perspective\u0000through four hot tasks: multimodal sentiment analysis, multimodal emotion\u0000recognition in conversation, multimodal aspect-based sentiment analysis and\u0000multimodal multi-label emotion recognition. The goal of this survey is to\u0000explore the current landscape of multimodal affective research, identify\u0000development trends, and highlight the similarities and differences across\u0000various tasks, offering a comprehensive report on the recent progress in\u0000multimodal affective computing from an NLP perspective. This survey covers the\u0000formalization of tasks, provides an overview of relevant works, describes\u0000benchmark datasets, and details the evaluation metrics for each task.\u0000Additionally, it briefly discusses research in multimodal affective computing\u0000involving facial expressions, acoustic signals, physiological signals, and\u0000emotion causes. Additionally, we discuss the technical approaches, challenges,\u0000and future directions in multimodal affective computing. To support further\u0000research, we released a repository that compiles related works in multimodal\u0000affective computing, providing detailed resources and references for the\u0000community.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling 你有十三个小时来解开迷宫：用函数调用增强人工智能游戏大师的能力

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.06949

Jaewoo Song, Andrew Zhu, Chris Callison-Burch

Developing a consistent and reliable AI game master for text-based games is achallenging task due to the limitations of large language models (LLMs) and thecomplexity of the game master's role. This paper presents a novel approach toenhance AI game masters by leveraging function calling in the context of thetable-top role-playing game "Jim Henson's Labyrinth: The Adventure Game." Ourmethodology involves integrating game-specific controls through functions,which we show improves the narrative quality and state update consistency ofthe AI game master. The experimental results, based on human evaluations andunit tests, demonstrate the effectiveness of our approach in enhancing gameplayexperience and maintaining coherence with the game state. This work contributesto the advancement of game AI and interactive storytelling, offering insightsinto the design of more engaging and consistent AI-driven game masters.

由于大型语言模型（LLM）的局限性和游戏主人角色的复杂性，为基于文本的游戏开发一致可靠的人工智能游戏主人是一项具有挑战性的任务。本文以桌面角色扮演游戏《吉姆-亨森的迷宫》为背景，介绍了一种利用函数调用来增强人工智能游戏主控的新方法：冒险游戏 "中利用函数调用来增强人工智能游戏高手的新方法。我们的方法包括通过函数整合游戏的特定控制，我们证明这可以提高人工智能游戏大师的叙事质量和状态更新一致性。基于人类评估和单元测试的实验结果表明，我们的方法在增强游戏体验和保持游戏状态一致性方面非常有效。这项工作有助于推动游戏人工智能和互动故事的发展，为设计更具吸引力和一致性的人工智能驱动游戏主程序提供启示。

引用次数: 0

Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games 从奥运会角度质疑大型语言模型的内部知识结构

arXiv - CS - Computation and Language

Pub Date : 2024-09-10 DOI: arxiv-2409.06518

Juhwan Choi, YoungBin Kim

Large language models (LLMs) have become a dominant approach in naturallanguage processing, yet their internal knowledge structures remain largelyunexplored. In this paper, we analyze the internal knowledge structures of LLMsusing historical medal tallies from the Olympic Games. We task the models withproviding the medal counts for each team and identifying which teams achievedspecific rankings. Our results reveal that while state-of-the-art LLMs performremarkably well in reporting medal counts for individual teams, they strugglesignificantly with questions about specific rankings. This suggests that theinternal knowledge structures of LLMs are fundamentally different from those ofhumans, who can easily infer rankings from known medal counts. To supportfurther research, we publicly release our code, dataset, and model outputs.

大型语言模型（LLM）已成为自然语言处理领域的主流方法，但其内部知识结构在很大程度上仍未得到探索。在本文中，我们利用奥运会的历史奖牌总数分析了 LLMs 的内部知识结构。我们要求模型提供每支队伍的奖牌数，并识别哪些队伍获得了特定排名。我们的结果表明，尽管最先进的 LLM 在报告单个团队的奖牌数方面表现出色，但在具体排名问题上却表现得非常吃力。这表明，LLM 的内部知识结构与人类的知识结构有本质区别，人类可以轻松地从已知的奖牌数推断出排名。为了支持进一步的研究，我们公开发布了我们的代码、数据集和模型输出。

引用次数: 0

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation 乒乓球：带用户模拟和多模型评估的角色扮演语言模型基准

arXiv - CS - Computation and Language

Pub Date : 2024-09-10 DOI: arxiv-2409.06820

Ilya Gusev

We introduce a novel benchmark for evaluating the role-playing capabilitiesof language models. Our approach leverages language models themselves toemulate users in dynamic, multi-turn conversations and to assess the resultingdialogues. The framework consists of three main components: a player modelassuming a specific character role, an interrogator model simulating userbehavior, and a judge model evaluating conversation quality. We conductedexperiments comparing automated evaluations with human annotations to validateour approach, demonstrating strong correlations across multiple criteria. Thiswork provides a foundation for a robust and dynamic evaluation of modelcapabilities in interactive scenarios.

我们介绍了一种评估语言模型角色扮演能力的新基准。我们的方法利用语言模型本身来模拟动态、多回合对话中的用户，并评估由此产生的对话。该框架由三个主要部分组成：扮演特定角色的玩家模型、模拟用户行为的审讯者模型以及评估对话质量的评判者模型。我们对自动评估和人工注释进行了比较实验，以验证我们的方法，结果表明在多个标准之间存在很强的相关性。这项工作为在交互场景中对模型能力进行稳健而动态的评估奠定了基础。

引用次数: 0

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio 优化选择附加语言混合比例的 Llama-3 70B 后期培训实践

arXiv - CS - Computation and Language

Pub Date : 2024-09-10 DOI: arxiv-2409.06624

Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Peng Yu, Jinxian Qu, Chenxi Liu, Zhonglin Jiang, Yong Chen, Luo Ji

Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) toobtain the unfamiliar language skill or adapt into new domains. The hugetraining cost of CPT often asks for cautious choice of key hyper-parameterssuch as the mixture ratio of extra language or domain corpus. However, there isno systematic study which bridge the gap between the optimal mixture ratio andthe actual model performance, and the gap between experimental scaling law andthe actual deployment in the full model size. In this paper, we perform CPT onLlama-3 8B and 70B to enhance its Chinese ability. We study the optimalcorrelation between the Additional Language Mixture Ratio (ALMR) and theLearning Rate (LR) on the 8B size which directly indicate the optimalexperimental set up. By thorough choice of hyper-parameter, and subsequentfine-tuning, the model capability is improved not only on the Chinese-relatedbenchmark, but also some specific domains including math, coding and emotionalintelligence. We deploy the final 70B version of LLM on an real-life chatsystem which obtain satisfying performance.

大型语言模型（LLM）通常需要经过持续预训练（CPT）才能获得陌生语言技能或适应新领域。CPT 高昂的训练成本往往要求对关键超参数（如额外语言或领域语料的混合比）进行谨慎选择。然而，目前还没有系统性的研究来弥合最佳混合比与实际模型性能之间的差距，以及实验缩放规律与实际部署全尺寸模型之间的差距。在本文中，我们对 Llama-3 8B 和 70B 进行了 CPT，以增强其中文能力。我们研究了 8B 大小的附加语言混合比（ALMR）和学习率（LR）之间的最佳相关性，这直接表明了最佳的实验设置。通过对超参数的全面选择和后续的微调，模型的能力不仅在与中文相关的基准测试中得到了提高，而且在数学、编码和情感智能等一些特定领域中也得到了提高。我们在一个真实的聊天系统上部署了最终的 70B 版本的 LLM，并获得了令人满意的性能。

{"title":"A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio","authors":"Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Peng Yu, Jinxian Qu, Chenxi Liu, Zhonglin Jiang, Yong Chen, Luo Ji","doi":"arxiv-2409.06624","DOIUrl":"https://doi.org/arxiv-2409.06624","url":null,"abstract":"Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to\u0000obtain the unfamiliar language skill or adapt into new domains. The huge\u0000training cost of CPT often asks for cautious choice of key hyper-parameters\u0000such as the mixture ratio of extra language or domain corpus. However, there is\u0000no systematic study which bridge the gap between the optimal mixture ratio and\u0000the actual model performance, and the gap between experimental scaling law and\u0000the actual deployment in the full model size. In this paper, we perform CPT on\u0000Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal\u0000correlation between the Additional Language Mixture Ratio (ALMR) and the\u0000Learning Rate (LR) on the 8B size which directly indicate the optimal\u0000experimental set up. By thorough choice of hyper-parameter, and subsequent\u0000fine-tuning, the model capability is improved not only on the Chinese-related\u0000benchmark, but also some specific domains including math, coding and emotional\u0000intelligence. We deploy the final 70B version of LLM on an real-life chat\u0000system which obtain satisfying performance.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Computation and Language

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀