Recent state-of-the-art authorship attribution methods learn authorship representations of texts in a latent, non-interpretable space, hindering their usability in real-world applications. Our work proposes a novel approach to interpreting these learned embeddings by identifying representative points in the latent space and utilizing LLMs to generate informative natural language descriptions of the writing style of each point. We evaluate the alignment of our interpretable space with the latent one and find that it achieves the best prediction agreement compared to other baselines. Additionally, we conduct a human evaluation to assess the quality of these style descriptions, validating their utility as explanations for the latent space. Finally, we investigate whether human performance on the challenging AA task improves when aided by our system's explanations, finding an average improvement of around +20% in accuracy.
最近最先进的作者归属方法是在一个不可解释的潜在空间中学习文本的作者归属表述,这阻碍了它们在现实世界中的应用。我们的工作提出了一种新颖的方法来解释这些学习到的嵌入,即识别潜在空间中的代表性点,并利用 LLM 生成对每个点的写作风格的翔实的自然语言描述。我们评估了我们的可解释空间与潜在空间的对齐情况,发现与其他基线相比,它实现了最好的预测一致性。此外,我们还进行了人工评估,以评估这些风格描述的质量,验证它们作为潜在空间解释的实用性。最后,我们研究了在我们系统的解释帮助下,人类在具有挑战性的 AA 任务中的表现是否有所改善,结果发现平均改善了约 +20% 的不准确性。
{"title":"Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution","authors":"Milad Alshomary, Narutatsu Ri, Marianna Apidianaki, Ajay Patel, Smaranda Muresan, Kathleen McKeown","doi":"arxiv-2409.07072","DOIUrl":"https://doi.org/arxiv-2409.07072","url":null,"abstract":"Recent state-of-the-art authorship attribution methods learn authorship\u0000representations of texts in a latent, non-interpretable space, hindering their\u0000usability in real-world applications. Our work proposes a novel approach to\u0000interpreting these learned embeddings by identifying representative points in\u0000the latent space and utilizing LLMs to generate informative natural language\u0000descriptions of the writing style of each point. We evaluate the alignment of\u0000our interpretable space with the latent one and find that it achieves the best\u0000prediction agreement compared to other baselines. Additionally, we conduct a\u0000human evaluation to assess the quality of these style descriptions, validating\u0000their utility as explanations for the latent space. Finally, we investigate\u0000whether human performance on the challenging AA task improves when aided by our\u0000system's explanations, finding an average improvement of around +20% in\u0000accuracy.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin
We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.
{"title":"SimulBench: Evaluating Language Models with Creative Simulation Tasks","authors":"Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin","doi":"arxiv-2409.07641","DOIUrl":"https://doi.org/arxiv-2409.07641","url":null,"abstract":"We introduce SimulBench, a benchmark designed to evaluate large language\u0000models (LLMs) across a diverse collection of creative simulation scenarios,\u0000such as acting as a Linux terminal or playing text games with users. While\u0000these simulation tasks serve as effective measures of an LLM's general\u0000intelligence, they are seldom incorporated into existing benchmarks. A major\u0000challenge is to develop an evaluation framework for testing different LLMs\u0000fairly while preserving the multi-round interactive nature of simulation tasks\u0000between users and AI. To tackle this issue, we suggest using a fixed LLM as a\u0000user agent to engage with an LLM to collect dialogues first under different\u0000tasks. Then, challenging dialogue scripts are extracted for evaluating\u0000different target LLMs. To facilitate automatic assessment on DataName{}, GPT-4\u0000is employed as the evaluator, tasked with reviewing the quality of the final\u0000response generated by the target LLMs given multi-turn dialogue scripts. Our\u0000comprehensive experiments indicate that these simulation tasks continue to pose\u0000a significant challenge with their unique natures and show the gap between\u0000proprietary models and the most advanced open LLMs. For example, GPT-4-turbo\u0000outperforms LLaMA-3-70b-Chat on 18.55% more cases.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid advancement of artificial intelligence systems has brought the challenge of AI alignment to the forefront of research, particularly in complex decision-making and task execution. As these systems surpass human-level performance in sophisticated problems, ensuring their alignment with human values, intentions, and ethical guidelines becomes crucial. Building on previous work in explanation generation for human-agent alignment, we address the more complex dynamics of multi-agent systems and human-AI teams. This paper introduces a novel approach to model alignment through weak-to-strong generalization in the context of language models. We present a framework where a strong model facilitates the improvement of a weaker model, bridging the gap between explanation generation and model alignment. Our method, formalized as a facilitation function, allows for the transfer of capabilities from advanced models to less capable ones without direct access to extensive training data. Our results suggest that this facilitation-based approach not only enhances model performance but also provides insights into the nature of model alignment and the potential for scalable oversight of AI systems.
{"title":"Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization","authors":"Mehrdad Zakershahrak, Samira Ghodratnama","doi":"arxiv-2409.07335","DOIUrl":"https://doi.org/arxiv-2409.07335","url":null,"abstract":"The rapid advancement of artificial intelligence systems has brought the\u0000challenge of AI alignment to the forefront of research, particularly in complex\u0000decision-making and task execution. As these systems surpass human-level\u0000performance in sophisticated problems, ensuring their alignment with human\u0000values, intentions, and ethical guidelines becomes crucial. Building on\u0000previous work in explanation generation for human-agent alignment, we address\u0000the more complex dynamics of multi-agent systems and human-AI teams. This paper\u0000introduces a novel approach to model alignment through weak-to-strong\u0000generalization in the context of language models. We present a framework where\u0000a strong model facilitates the improvement of a weaker model, bridging the gap\u0000between explanation generation and model alignment. Our method, formalized as a\u0000facilitation function, allows for the transfer of capabilities from advanced\u0000models to less capable ones without direct access to extensive training data.\u0000Our results suggest that this facilitation-based approach not only enhances\u0000model performance but also provides insights into the nature of model alignment\u0000and the potential for scalable oversight of AI systems.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
线性注意变换器及其门控变体虽然被认为可以实现并行训练和高效的循环推理,但与传统变换器相比,它们在回忆密集型任务中仍有不足,而且需要大量资源从头开始训练。本文介绍了门控插槽注意力(GatedSlot Attention,GSA),它通过结合受门控线性注意力(Gated Linear Attention,GLA)启发的门控机制,增强了有界内存控制注意力(Attention with Bounded-memory-Control,ABC)。从本质上讲,GSA 包括一个通过软最大值(softmax)连接的双层 GLA,利用上下文感知记忆读取和自适应遗忘来提高记忆容量,同时保持紧凑的递归状态大小。这种设计通过 GLA 的硬件系数训练算法和更小的状态大小,大大提高了训练和推理效率。此外,保留软最大操作在 "微调预训练变换器到 RNN"(T2R)设置中尤为有利,减少了从头开始进行大量训练的需要。
{"title":"Gated Slot Attention for Efficient Linear-Time Sequence Modeling","authors":"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu","doi":"arxiv-2409.07146","DOIUrl":"https://doi.org/arxiv-2409.07146","url":null,"abstract":"Linear attention Transformers and their gated variants, celebrated for\u0000enabling parallel training and efficient recurrent inference, still fall short\u0000in recall-intensive tasks compared to traditional Transformers and demand\u0000significant resources for training from scratch. This paper introduces Gated\u0000Slot Attention (GSA), which enhances Attention with Bounded-memory-Control\u0000(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\u0000(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\u0000context-aware memory reading and adaptive forgetting to improve memory capacity\u0000while maintaining compact recurrent state size. This design greatly enhances\u0000both training and inference efficiency through GLA's hardware-efficient\u0000training algorithm and reduced state size. Additionally, retaining the softmax\u0000operation is particularly beneficial in \"finetuning pretrained Transformers to\u0000RNNs\" (T2R) settings, reducing the need for extensive training from scratch.\u0000Extensive experiments confirm GSA's superior performance in scenarios requiring\u0000in-context recall and in T2R settings.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a system using generative AI agents to create tip sheets for investigative data reporting. Our system employs three specialized agents--an analyst, a reporter, and an editor--to collaboratively generate and refine tips from datasets. We validate this approach using real-world investigative stories, demonstrating that our agent-based system generally generates more newsworthy and accurate insights compared to a baseline model without agents, although some variability was noted between different stories. Our findings highlight the potential of generative AI to provide leads for investigative data reporting.
{"title":"Using Generative Agents to Create Tip Sheets for Investigative Data Reporting","authors":"Joris Veerbeek, Nicholas Diakopoulos","doi":"arxiv-2409.07286","DOIUrl":"https://doi.org/arxiv-2409.07286","url":null,"abstract":"This paper introduces a system using generative AI agents to create tip\u0000sheets for investigative data reporting. Our system employs three specialized\u0000agents--an analyst, a reporter, and an editor--to collaboratively generate and\u0000refine tips from datasets. We validate this approach using real-world\u0000investigative stories, demonstrating that our agent-based system generally\u0000generates more newsworthy and accurate insights compared to a baseline model\u0000without agents, although some variability was noted between different stories.\u0000Our findings highlight the potential of generative AI to provide leads for\u0000investigative data reporting.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guimin Hu, Yi Xin, Weimin Lyu, Haojian Huang, Chang Sun, Zhihong Zhu, Lin Gui, Ruichu Cai
Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trends of multimodal affective computing from NLP perspective through four hot tasks: multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis and multimodal multi-label emotion recognition. The goal of this survey is to explore the current landscape of multimodal affective research, identify development trends, and highlight the similarities and differences across various tasks, offering a comprehensive report on the recent progress in multimodal affective computing from an NLP perspective. This survey covers the formalization of tasks, provides an overview of relevant works, describes benchmark datasets, and details the evaluation metrics for each task. Additionally, it briefly discusses research in multimodal affective computing involving facial expressions, acoustic signals, physiological signals, and emotion causes. Additionally, we discuss the technical approaches, challenges, and future directions in multimodal affective computing. To support further research, we released a repository that compiles related works in multimodal affective computing, providing detailed resources and references for the community.
{"title":"Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective","authors":"Guimin Hu, Yi Xin, Weimin Lyu, Haojian Huang, Chang Sun, Zhihong Zhu, Lin Gui, Ruichu Cai","doi":"arxiv-2409.07388","DOIUrl":"https://doi.org/arxiv-2409.07388","url":null,"abstract":"Multimodal affective computing (MAC) has garnered increasing attention due to\u0000its broad applications in analyzing human behaviors and intentions, especially\u0000in text-dominated multimodal affective computing field. This survey presents\u0000the recent trends of multimodal affective computing from NLP perspective\u0000through four hot tasks: multimodal sentiment analysis, multimodal emotion\u0000recognition in conversation, multimodal aspect-based sentiment analysis and\u0000multimodal multi-label emotion recognition. The goal of this survey is to\u0000explore the current landscape of multimodal affective research, identify\u0000development trends, and highlight the similarities and differences across\u0000various tasks, offering a comprehensive report on the recent progress in\u0000multimodal affective computing from an NLP perspective. This survey covers the\u0000formalization of tasks, provides an overview of relevant works, describes\u0000benchmark datasets, and details the evaluation metrics for each task.\u0000Additionally, it briefly discusses research in multimodal affective computing\u0000involving facial expressions, acoustic signals, physiological signals, and\u0000emotion causes. Additionally, we discuss the technical approaches, challenges,\u0000and future directions in multimodal affective computing. To support further\u0000research, we released a repository that compiles related works in multimodal\u0000affective computing, providing detailed resources and references for the\u0000community.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Developing a consistent and reliable AI game master for text-based games is a challenging task due to the limitations of large language models (LLMs) and the complexity of the game master's role. This paper presents a novel approach to enhance AI game masters by leveraging function calling in the context of the table-top role-playing game "Jim Henson's Labyrinth: The Adventure Game." Our methodology involves integrating game-specific controls through functions, which we show improves the narrative quality and state update consistency of the AI game master. The experimental results, based on human evaluations and unit tests, demonstrate the effectiveness of our approach in enhancing gameplay experience and maintaining coherence with the game state. This work contributes to the advancement of game AI and interactive storytelling, offering insights into the design of more engaging and consistent AI-driven game masters.
{"title":"You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling","authors":"Jaewoo Song, Andrew Zhu, Chris Callison-Burch","doi":"arxiv-2409.06949","DOIUrl":"https://doi.org/arxiv-2409.06949","url":null,"abstract":"Developing a consistent and reliable AI game master for text-based games is a\u0000challenging task due to the limitations of large language models (LLMs) and the\u0000complexity of the game master's role. This paper presents a novel approach to\u0000enhance AI game masters by leveraging function calling in the context of the\u0000table-top role-playing game \"Jim Henson's Labyrinth: The Adventure Game.\" Our\u0000methodology involves integrating game-specific controls through functions,\u0000which we show improves the narrative quality and state update consistency of\u0000the AI game master. The experimental results, based on human evaluations and\u0000unit tests, demonstrate the effectiveness of our approach in enhancing gameplay\u0000experience and maintaining coherence with the game state. This work contributes\u0000to the advancement of game AI and interactive storytelling, offering insights\u0000into the design of more engaging and consistent AI-driven game masters.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"157 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) have become a dominant approach in natural language processing, yet their internal knowledge structures remain largely unexplored. In this paper, we analyze the internal knowledge structures of LLMs using historical medal tallies from the Olympic Games. We task the models with providing the medal counts for each team and identifying which teams achieved specific rankings. Our results reveal that while state-of-the-art LLMs perform remarkably well in reporting medal counts for individual teams, they struggle significantly with questions about specific rankings. This suggests that the internal knowledge structures of LLMs are fundamentally different from those of humans, who can easily infer rankings from known medal counts. To support further research, we publicly release our code, dataset, and model outputs.
{"title":"Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games","authors":"Juhwan Choi, YoungBin Kim","doi":"arxiv-2409.06518","DOIUrl":"https://doi.org/arxiv-2409.06518","url":null,"abstract":"Large language models (LLMs) have become a dominant approach in natural\u0000language processing, yet their internal knowledge structures remain largely\u0000unexplored. In this paper, we analyze the internal knowledge structures of LLMs\u0000using historical medal tallies from the Olympic Games. We task the models with\u0000providing the medal counts for each team and identifying which teams achieved\u0000specific rankings. Our results reveal that while state-of-the-art LLMs perform\u0000remarkably well in reporting medal counts for individual teams, they struggle\u0000significantly with questions about specific rankings. This suggests that the\u0000internal knowledge structures of LLMs are fundamentally different from those of\u0000humans, who can easily infer rankings from known medal counts. To support\u0000further research, we publicly release our code, dataset, and model outputs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.
{"title":"PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation","authors":"Ilya Gusev","doi":"arxiv-2409.06820","DOIUrl":"https://doi.org/arxiv-2409.06820","url":null,"abstract":"We introduce a novel benchmark for evaluating the role-playing capabilities\u0000of language models. Our approach leverages language models themselves to\u0000emulate users in dynamic, multi-turn conversations and to assess the resulting\u0000dialogues. The framework consists of three main components: a player model\u0000assuming a specific character role, an interrogator model simulating user\u0000behavior, and a judge model evaluating conversation quality. We conducted\u0000experiments comparing automated evaluations with human annotations to validate\u0000our approach, demonstrating strong correlations across multiple criteria. This\u0000work provides a foundation for a robust and dynamic evaluation of model\u0000capabilities in interactive scenarios.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Peng Yu, Jinxian Qu, Chenxi Liu, Zhonglin Jiang, Yong Chen, Luo Ji
Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to obtain the unfamiliar language skill or adapt into new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study which bridge the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicate the optimal experimental set up. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark, but also some specific domains including math, coding and emotional intelligence. We deploy the final 70B version of LLM on an real-life chat system which obtain satisfying performance.
{"title":"A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio","authors":"Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Peng Yu, Jinxian Qu, Chenxi Liu, Zhonglin Jiang, Yong Chen, Luo Ji","doi":"arxiv-2409.06624","DOIUrl":"https://doi.org/arxiv-2409.06624","url":null,"abstract":"Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to\u0000obtain the unfamiliar language skill or adapt into new domains. The huge\u0000training cost of CPT often asks for cautious choice of key hyper-parameters\u0000such as the mixture ratio of extra language or domain corpus. However, there is\u0000no systematic study which bridge the gap between the optimal mixture ratio and\u0000the actual model performance, and the gap between experimental scaling law and\u0000the actual deployment in the full model size. In this paper, we perform CPT on\u0000Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal\u0000correlation between the Additional Language Mixture Ratio (ALMR) and the\u0000Learning Rate (LR) on the 8B size which directly indicate the optimal\u0000experimental set up. By thorough choice of hyper-parameter, and subsequent\u0000fine-tuning, the model capability is improved not only on the Chinese-related\u0000benchmark, but also some specific domains including math, coding and emotional\u0000intelligence. We deploy the final 70B version of LLM on an real-life chat\u0000system which obtain satisfying performance.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}