首页 > 最新文献

arXiv - CS - Computation and Language最新文献

英文 中文
RUIE: Retrieval-based Unified Information Extraction using Large Language Model RUIE:使用大型语言模型进行基于检索的统一信息提取
Pub Date : 2024-09-18 DOI: arxiv-2409.11673
Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang
Unified information extraction (UIE) aims to complete all informationextraction tasks using a single model or framework. While previous work hasprimarily focused on instruction-tuning large language models (LLMs) withconstructed datasets, these methods require significant computational resourcesand struggle to generalize to unseen tasks. To address these limitations, wepropose RUIE (Retrieval-based Unified Information Extraction), a framework thatleverages in-context learning to enable rapid generalization while reducingcomputational costs. The key challenge in RUIE is selecting the most beneficialdemonstrations for LLMs to effectively handle diverse IE tasks. To achievethis, we integrate LLM preferences for ranking candidate demonstrations anddesign a keyword-enhanced reward model to capture fine-grained relationshipsbetween queries and demonstrations. We then train a bi-encoder retriever forUIE through contrastive learning and knowledge distillation. To the best of ourknowledge, RUIE is the first trainable retrieval framework for UIE.Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness ingeneralizing to unseen tasks, with average F1-score improvements of 19.22 and3.13 compared to instruction-tuning methods and other retrievers, respectively.Further analysis confirms RUIE's adaptability to LLMs of varying sizes and theimportance of its key components.
统一信息提取(UIE)旨在使用单一模型或框架完成所有信息提取任务。以往的工作主要集中在利用构建的数据集对大型语言模型(LLMs)进行指令调整,但这些方法需要大量的计算资源,而且很难推广到未见过的任务中。为了解决这些局限性,我们提出了 RUIE(基于检索的统一信息提取),这是一种利用上下文学习实现快速泛化,同时降低计算成本的框架。RUIE 面临的主要挑战是为 LLM 挑选最有益的演示,以有效处理各种信息提取任务。为了实现这一目标,我们整合了 LLM 对候选演示排序的偏好,并设计了一个关键字增强奖励模型来捕捉查询和演示之间的细粒度关系。然后,我们通过对比学习和知识提炼来训练用于 UIE 的双编码器检索器。据我们所知,RUIE是第一个可训练的UIE检索框架。在8个保留数据集上的实验结果表明,RUIE能有效地推广到未见任务中,与指令调整方法和其他检索器相比,平均F1分数分别提高了19.22和3.13。
{"title":"RUIE: Retrieval-based Unified Information Extraction using Large Language Model","authors":"Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang","doi":"arxiv-2409.11673","DOIUrl":"https://doi.org/arxiv-2409.11673","url":null,"abstract":"Unified information extraction (UIE) aims to complete all information\u0000extraction tasks using a single model or framework. While previous work has\u0000primarily focused on instruction-tuning large language models (LLMs) with\u0000constructed datasets, these methods require significant computational resources\u0000and struggle to generalize to unseen tasks. To address these limitations, we\u0000propose RUIE (Retrieval-based Unified Information Extraction), a framework that\u0000leverages in-context learning to enable rapid generalization while reducing\u0000computational costs. The key challenge in RUIE is selecting the most beneficial\u0000demonstrations for LLMs to effectively handle diverse IE tasks. To achieve\u0000this, we integrate LLM preferences for ranking candidate demonstrations and\u0000design a keyword-enhanced reward model to capture fine-grained relationships\u0000between queries and demonstrations. We then train a bi-encoder retriever for\u0000UIE through contrastive learning and knowledge distillation. To the best of our\u0000knowledge, RUIE is the first trainable retrieval framework for UIE.\u0000Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in\u0000generalizing to unseen tasks, with average F1-score improvements of 19.22 and\u00003.13 compared to instruction-tuning methods and other retrievers, respectively.\u0000Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the\u0000importance of its key components.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning MAgICoRe:多代理、迭代、从粗到细的推理改进
Pub Date : 2024-09-18 DOI: arxiv-2409.12147
Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal
Large Language Models' (LLM) reasoning can be improved using test-timeaggregation strategies, i.e., generating multiple samples and voting amonggenerated samples. While these improve performance, they often reach asaturation point. Refinement offers an alternative by using LLM-generatedfeedback to improve solution quality. However, refinement introduces 3 keychallenges: (1) Excessive refinement: Uniformly refining all instances canover-correct and reduce the overall performance. (2) Inability to localize andaddress errors: LLMs have a limited ability to self-correct and struggle toidentify and correct their own mistakes. (3) Insufficient refinement: Decidinghow many iterations of refinement are needed is non-trivial, and stopping toosoon could leave errors unaddressed. To tackle these issues, we proposeMAgICoRe, which avoids excessive refinement by categorizing problem difficultyas easy or hard, solving easy problems with coarse-grained aggregation and hardones with fine-grained and iterative multi-agent refinement. To improve errorlocalization, we incorporate external step-wise reward model (RM) scores.Moreover, to ensure effective refinement, we employ a multi-agent loop withthree agents: Solver, Reviewer (which generates targeted feedback based onstep-wise RM scores), and the Refiner (which incorporates feedback). To ensuresufficient refinement, we re-evaluate updated solutions, iteratively initiatingfurther rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5and show its effectiveness across 5 math datasets. Even one iteration ofMAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by4.0% while using less than half the samples. Unlike iterative refinement withbaselines, MAgICoRe continues to improve with more iterations. Finally, ourablations highlight the importance of MAgICoRe's RMs and multi-agentcommunication.
大型语言模型(LLM)的推理能力可以通过使用测试时间聚合策略(即生成多个样本并在生成的样本中进行投票)来提高。虽然这些方法能提高性能,但往往会达到饱和点。精细化提供了另一种选择,即利用 LLM 生成的反馈来提高解决方案的质量。然而,细化带来了 3 个关键挑战:(1) 过度细化:统一细化所有实例可能会过度修正,降低整体性能。(2) 无法定位和处理错误:LLM 的自我纠正能力有限,很难识别和纠正自己的错误。(3) 细化不足:决定需要进行多少次迭代细化并非易事,过早停止细化可能会导致错误得不到解决。为了解决这些问题,我们提出了 MAgICoRe,它通过将问题难度分为易和难来避免过度细化,用粗粒度聚合来解决易问题,用细粒度和多代理迭代细化来解决难问题。此外,为了确保有效的细化,我们采用了一个由三个代理组成的多代理循环:此外,我们还采用了由三个代理组成的多代理循环:求解器、审查器(根据分步 RM 分数生成有针对性的反馈)和精炼器(吸收反馈)。为确保充分完善,我们会重新评估更新后的解决方案,并迭代启动更多轮完善。我们在 Llama-3-8B 和 GPT-3.5 上对 MAgICoRe 进行了评估,并在 5 个数学数据集上展示了其有效性。即使是一次迭代,MAgICoRe 也比 Self-Consistency 高出 3.4%,比 Best-of-k 高出 3.2%,比 Self-Refine 高出 4.0%,而使用的样本还不到一半。与使用基线的迭代改进不同,MAgICoRe 会随着迭代次数的增加而不断改进。最后,我们的迭代突出了 MAgICoRe 的 RM 和多基因通信的重要性。
{"title":"MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning","authors":"Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.12147","DOIUrl":"https://doi.org/arxiv-2409.12147","url":null,"abstract":"Large Language Models' (LLM) reasoning can be improved using test-time\u0000aggregation strategies, i.e., generating multiple samples and voting among\u0000generated samples. While these improve performance, they often reach a\u0000saturation point. Refinement offers an alternative by using LLM-generated\u0000feedback to improve solution quality. However, refinement introduces 3 key\u0000challenges: (1) Excessive refinement: Uniformly refining all instances can\u0000over-correct and reduce the overall performance. (2) Inability to localize and\u0000address errors: LLMs have a limited ability to self-correct and struggle to\u0000identify and correct their own mistakes. (3) Insufficient refinement: Deciding\u0000how many iterations of refinement are needed is non-trivial, and stopping too\u0000soon could leave errors unaddressed. To tackle these issues, we propose\u0000MAgICoRe, which avoids excessive refinement by categorizing problem difficulty\u0000as easy or hard, solving easy problems with coarse-grained aggregation and hard\u0000ones with fine-grained and iterative multi-agent refinement. To improve error\u0000localization, we incorporate external step-wise reward model (RM) scores.\u0000Moreover, to ensure effective refinement, we employ a multi-agent loop with\u0000three agents: Solver, Reviewer (which generates targeted feedback based on\u0000step-wise RM scores), and the Refiner (which incorporates feedback). To ensure\u0000sufficient refinement, we re-evaluate updated solutions, iteratively initiating\u0000further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5\u0000and show its effectiveness across 5 math datasets. Even one iteration of\u0000MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by\u00004.0% while using less than half the samples. Unlike iterative refinement with\u0000baselines, MAgICoRe continues to improve with more iterations. Finally, our\u0000ablations highlight the importance of MAgICoRe's RMs and multi-agent\u0000communication.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLMs in Education: Novel Perspectives, Challenges, and Opportunities 教育学法学硕士:新视角、挑战和机遇
Pub Date : 2024-09-18 DOI: arxiv-2409.11917
Bashar Alhafni, Sowmya Vajjala, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar
The role of large language models (LLMs) in education is an increasing areaof interest today, considering the new opportunities they offer for teaching,learning, and assessment. This cutting-edge tutorial provides an overview ofthe educational applications of NLP and the impact that the recent advances inLLMs have had on this field. We will discuss the key challenges andopportunities presented by LLMs, grounding them in the context of four majoreducational applications: reading, writing, and speaking skills, andintelligent tutoring systems (ITS). This COLING 2025 tutorial is designed forresearchers and practitioners interested in the educational applications of NLPand the role LLMs have to play in this area. It is the first of its kind toaddress this timely topic.
考虑到大型语言模型(LLMs)为教学、学习和评估带来的新机遇,大型语言模型在教育领域的作用日益受到关注。本前沿教程概述了 NLP 在教育领域的应用,以及 LLM 的最新进展对该领域的影响。我们将以阅读、写作和口语技能以及智能辅导系统(ITS)这四大教育应用为背景,讨论 LLMs 带来的关键挑战和机遇。COLING 2025教程面向对NLP的教育应用和LLM在该领域的作用感兴趣的研究人员和从业人员。它是首个讨论这一及时话题的同类教程。
{"title":"LLMs in Education: Novel Perspectives, Challenges, and Opportunities","authors":"Bashar Alhafni, Sowmya Vajjala, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar","doi":"arxiv-2409.11917","DOIUrl":"https://doi.org/arxiv-2409.11917","url":null,"abstract":"The role of large language models (LLMs) in education is an increasing area\u0000of interest today, considering the new opportunities they offer for teaching,\u0000learning, and assessment. This cutting-edge tutorial provides an overview of\u0000the educational applications of NLP and the impact that the recent advances in\u0000LLMs have had on this field. We will discuss the key challenges and\u0000opportunities presented by LLMs, grounding them in the context of four major\u0000educational applications: reading, writing, and speaking skills, and\u0000intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for\u0000researchers and practitioners interested in the educational applications of NLP\u0000and the role LLMs have to play in this area. It is the first of its kind to\u0000address this timely topic.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning 要还是不要思维链?思维链主要有助于数学和符号推理
Pub Date : 2024-09-18 DOI: arxiv-2409.12183
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
Chain-of-thought (CoT) via prompting is the de facto method for elicitingreasoning capabilities from large language models (LLMs). But for what kinds oftasks is this extra ``thinking'' really helpful? To analyze this, we conducteda quantitative meta-analysis covering over 100 papers using CoT and ran our ownevaluations of 20 datasets across 14 models. Our results show that CoT givesstrong performance benefits primarily on tasks involving math or logic, withmuch smaller gains on other types of tasks. On MMLU, directly generating theanswer without CoT leads to almost identical accuracy as CoT unless thequestion or model's response contains an equals sign, indicating symbolicoperations and reasoning. Following this finding, we analyze the behavior ofCoT on these problems by separating planning and execution and comparingagainst tool-augmented LLMs. Much of CoT's gain comes from improving symbolicexecution, but it underperforms relative to using a symbolic solver. Ourresults indicate that CoT can be applied selectively, maintaining performancewhile saving inference costs. Furthermore, they suggest a need to move beyondprompt-based CoT to new paradigms that better leverage intermediate computationacross the whole range of LLM applications.
通过提示的思维链(CoT)是从大型语言模型(LLM)中激发推理能力的事实方法。但是,这种额外的 "思考 "究竟对哪些任务有帮助呢?为了分析这个问题,我们对 100 多篇使用 CoT 的论文进行了定量荟萃分析,并对 14 个模型的 20 个数据集进行了评估。我们的结果表明,CoT 主要在涉及数学或逻辑的任务中带来了强大的性能优势,而在其他类型的任务中收益要小得多。在 MMLU 任务中,不使用 CoT 直接生成答案的准确率与 CoT 几乎相同,除非问题或模型的回答包含等号,表示符号操作和推理。根据这一发现,我们通过将规划和执行分开,并与工具增强的 LLM 进行比较,分析了 CoT 在这些问题上的表现。CoT 的大部分收益来自于符号执行的改进,但相对于使用符号求解器,它的表现并不理想。我们的研究结果表明,CoT 可以有选择地应用,在保持性能的同时节省推理成本。此外,这些结果还表明,有必要超越基于提示的 CoT,转而采用能在整个 LLM 应用中更好地利用中间计算的新范式。
{"title":"To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning","authors":"Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett","doi":"arxiv-2409.12183","DOIUrl":"https://doi.org/arxiv-2409.12183","url":null,"abstract":"Chain-of-thought (CoT) via prompting is the de facto method for eliciting\u0000reasoning capabilities from large language models (LLMs). But for what kinds of\u0000tasks is this extra ``thinking'' really helpful? To analyze this, we conducted\u0000a quantitative meta-analysis covering over 100 papers using CoT and ran our own\u0000evaluations of 20 datasets across 14 models. Our results show that CoT gives\u0000strong performance benefits primarily on tasks involving math or logic, with\u0000much smaller gains on other types of tasks. On MMLU, directly generating the\u0000answer without CoT leads to almost identical accuracy as CoT unless the\u0000question or model's response contains an equals sign, indicating symbolic\u0000operations and reasoning. Following this finding, we analyze the behavior of\u0000CoT on these problems by separating planning and execution and comparing\u0000against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic\u0000execution, but it underperforms relative to using a symbolic solver. Our\u0000results indicate that CoT can be applied selectively, maintaining performance\u0000while saving inference costs. Furthermore, they suggest a need to move beyond\u0000prompt-based CoT to new paradigms that better leverage intermediate computation\u0000across the whole range of LLM applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Qwen2.5-Coder Technical Report Qwen2.5-Coder 技术报告
Pub Date : 2024-09-18 DOI: arxiv-2409.12186
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin
In this report, we introduce the Qwen2.5-Coder series, a significant upgradefrom its predecessor, CodeQwen1.5. This series includes two models:Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model,Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrainedon a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning,scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coderdemonstrates impressive code generation capabilities while retaining generalversatility. The model has been evaluated on a wide range of code-relatedtasks, achieving state-of-the-art (SOTA) performance across more than 10benchmarks, including code generation, completion, reasoning, and repair,consistently outperforming larger models of the same model size. We believethat the release of the Qwen2.5-Coder series will not only push the boundariesof research in code intelligence but also, through its permissive licensing,encourage broader adoption by developers in real-world applications.
在本报告中,我们介绍了 Qwen2.5-Coder 系列,这是其前身 CodeQwen1.5 的重大升级。该系列包括两个型号:Qwen2.5-Coder-1.5B 和 Qwen2.5-Coder-7B。作为一个代码专用模型,Qwen2.5-Coder 建立在 Qwen2.5 架构之上,并在超过 5.5 万亿个 token 的庞大语料库中继续进行预训练。通过细致的数据清理、可扩展的合成数据生成和均衡的数据混合,Qwen2.5-Coder 展示了令人印象深刻的代码生成能力,同时保留了通用性。该模型已在广泛的代码相关任务中进行了评估,在代码生成、补全、推理和修复等 10 多个基准测试中取得了最先进(SOTA)的性能,其性能始终优于相同模型规模的大型模型。我们相信,Qwen2.5-Coder 系列的发布不仅将推动代码智能研究的发展,而且还将通过其许可授权,鼓励开发人员在实际应用中更广泛地采用它。
{"title":"Qwen2.5-Coder Technical Report","authors":"Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12186","DOIUrl":"https://doi.org/arxiv-2409.12186","url":null,"abstract":"In this report, we introduce the Qwen2.5-Coder series, a significant upgrade\u0000from its predecessor, CodeQwen1.5. This series includes two models:\u0000Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model,\u0000Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained\u0000on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning,\u0000scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder\u0000demonstrates impressive code generation capabilities while retaining general\u0000versatility. The model has been evaluated on a wide range of code-related\u0000tasks, achieving state-of-the-art (SOTA) performance across more than 10\u0000benchmarks, including code generation, completion, reasoning, and repair,\u0000consistently outperforming larger models of the same model size. We believe\u0000that the release of the Qwen2.5-Coder series will not only push the boundaries\u0000of research in code intelligence but also, through its permissive licensing,\u0000encourage broader adoption by developers in real-world applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GRIN: GRadient-INformed MoE GRIN: GRadient-INformed MoE
Pub Date : 2024-09-18 DOI: arxiv-2409.12136
Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
Mixture-of-Experts (MoE) models scale more effectively than dense models dueto sparse computation through expert routing, selectively activating only asmall subset of expert modules. However, sparse computation challengestraditional training practices, as discrete expert routing hinders standardbackpropagation and thus gradient-based optimization, which are the cornerstoneof deep learning. To better pursue the scaling power of MoE, we introduce GRIN(GRadient-INformed MoE training), which incorporates sparse gradient estimationfor expert routing and configures model parallelism to avoid token dropping.Applying GRIN to autoregressive language modeling, we develop a top-216$times$3.8B MoE model. Our model, with only 6.6B activated parameters,outperforms a 7B dense model and matches the performance of a 14B dense modeltrained on the same data. Extensive evaluations across diverse tasksdemonstrate the potential of GRIN to significantly enhance MoE efficacy,achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
专家混合物(MoE)模型通过专家路由进行稀疏计算,只选择性地激活一小部分专家模块,因此比密集模型更能有效扩展。然而,稀疏计算对传统的训练方法提出了挑战,因为离散专家路由会阻碍标准后向传播,从而阻碍基于梯度的优化,而梯度优化是深度学习的基石。为了更好地发挥MoE的扩展能力,我们引入了GRIN(GRadient-INformed MoE training),它将稀疏梯度估计用于专家路由,并配置模型并行性以避免标记丢弃。我们的模型只有 6.6B 个激活参数,其性能超过了 7B 的密集模型,并与在相同数据上训练的 14B 密集模型不相上下。对不同任务的广泛评估表明,GRIN 有潜力显著提高 MoE 的效率,在 MMLU、HellaSwag、HumanEval 和 MATH 上分别取得了 79.4、83.7、74.4 和 58.9 的高分。
{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":"https://doi.org/arxiv-2409.12136","url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\u0000to sparse computation through expert routing, selectively activating only a\u0000small subset of expert modules. However, sparse computation challenges\u0000traditional training practices, as discrete expert routing hinders standard\u0000backpropagation and thus gradient-based optimization, which are the cornerstone\u0000of deep learning. To better pursue the scaling power of MoE, we introduce GRIN\u0000(GRadient-INformed MoE training), which incorporates sparse gradient estimation\u0000for expert routing and configures model parallelism to avoid token dropping.\u0000Applying GRIN to autoregressive language modeling, we develop a top-2\u000016$times$3.8B MoE model. Our model, with only 6.6B activated parameters,\u0000outperforms a 7B dense model and matches the performance of a 14B dense model\u0000trained on the same data. Extensive evaluations across diverse tasks\u0000demonstrate the potential of GRIN to significantly enhance MoE efficacy,\u0000achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Controlled Study on Long Context Extension and Generalization in LLMs 关于语言学习者长语境扩展和泛化的对照研究
Pub Date : 2024-09-18 DOI: arxiv-2409.12181
Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
Broad textual understanding and in-context learning require language modelsthat utilize full document contexts. Due to the implementation challengesassociated with directly training long-context models, many methods have beenproposed for extending models to handle long contexts. However, owing todifferences in data and model classes, it has been challenging to compare theseapproaches, leading to uncertainty as to how to evaluate long-contextperformance and whether it differs from standard evaluation. We implement acontrolled protocol for extension methods with a standardized evaluation,utilizing consistent base models and extension data. Our study yields severalinsights into long-context behavior. First, we reaffirm the critical role ofperplexity as a general-purpose performance indicator even in longer-contexttasks. Second, we find that current approximate attention methodssystematically underperform across long-context tasks. Finally, we confirm thatexact fine-tuning based methods are generally effective within the range oftheir extension, whereas extrapolation remains challenging. All codebases,models, and checkpoints will be made available open-source, promotingtransparency and facilitating further research in this critical area of AIdevelopment.
广泛的文本理解和上下文学习需要利用完整文档上下文的语言模型。由于直接训练长语境模型在实现上存在挑战,人们提出了许多方法来扩展模型以处理长语境。然而,由于数据和模型类别的差异,对这些方法进行比较具有挑战性,从而导致了如何评估长语境性能以及它是否有别于标准评估的不确定性。我们利用一致的基础模型和扩展数据,为具有标准化评估的扩展方法实施了受控协议。我们的研究对长情境行为提出了几点见解。首先,我们重申了复杂度作为通用性能指标的关键作用,即使在长情境任务中也是如此。其次,我们发现当前的近似注意方法在长情境任务中表现不佳。最后,我们证实了基于近似微调的方法在其扩展范围内通常是有效的,而外推法仍然具有挑战性。所有代码库、模型和检查点都将开源,以提高透明度,促进人工智能发展这一关键领域的进一步研究。
{"title":"A Controlled Study on Long Context Extension and Generalization in LLMs","authors":"Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush","doi":"arxiv-2409.12181","DOIUrl":"https://doi.org/arxiv-2409.12181","url":null,"abstract":"Broad textual understanding and in-context learning require language models\u0000that utilize full document contexts. Due to the implementation challenges\u0000associated with directly training long-context models, many methods have been\u0000proposed for extending models to handle long contexts. However, owing to\u0000differences in data and model classes, it has been challenging to compare these\u0000approaches, leading to uncertainty as to how to evaluate long-context\u0000performance and whether it differs from standard evaluation. We implement a\u0000controlled protocol for extension methods with a standardized evaluation,\u0000utilizing consistent base models and extension data. Our study yields several\u0000insights into long-context behavior. First, we reaffirm the critical role of\u0000perplexity as a general-purpose performance indicator even in longer-context\u0000tasks. Second, we find that current approximate attention methods\u0000systematically underperform across long-context tasks. Finally, we confirm that\u0000exact fine-tuning based methods are generally effective within the range of\u0000their extension, whereas extrapolation remains challenging. All codebases,\u0000models, and checkpoints will be made available open-source, promoting\u0000transparency and facilitating further research in this critical area of AI\u0000development.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL 只读一次 (YORO):学习内化数据库知识,实现文本到 SQL 的转换
Pub Date : 2024-09-18 DOI: arxiv-2409.12172
Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng
While significant progress has been made on the text-to-SQL task, recentsolutions repeatedly encode the same database schema for every question,resulting in unnecessary high inference cost and often overlooking crucialdatabase knowledge. To address these issues, we propose You Only Read Once(YORO), a novel paradigm that directly internalizes database knowledge into theparametric knowledge of a text-to-SQL model during training and eliminates theneed for schema encoding during inference. YORO significantly reduces the inputtoken length by 66%-98%. Despite its shorter inputs, our empirical resultsdemonstrate YORO's competitive performances with traditional systems on threebenchmarks as well as its significant outperformance on large databases.Furthermore, YORO excels in handling questions with challenging valueretrievals such as abbreviation.
虽然文本到 SQL 任务已经取得了重大进展,但最近的解决方案对每个问题都重复编码相同的数据库模式,导致不必要的高推理成本,而且经常忽略关键的数据库知识。为了解决这些问题,我们提出了 "只读一次"(YORO)这一新颖的范式,它能在训练过程中将数据库知识直接内化到文本到 SQL 模型的参数知识中,而无需在推理过程中进行模式编码。YORO 将输入令牌的长度大幅减少了 66%-98%。此外,YORO 在处理缩写等具有挑战性的数值检索问题时表现出色。
{"title":"You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL","authors":"Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng","doi":"arxiv-2409.12172","DOIUrl":"https://doi.org/arxiv-2409.12172","url":null,"abstract":"While significant progress has been made on the text-to-SQL task, recent\u0000solutions repeatedly encode the same database schema for every question,\u0000resulting in unnecessary high inference cost and often overlooking crucial\u0000database knowledge. To address these issues, we propose You Only Read Once\u0000(YORO), a novel paradigm that directly internalizes database knowledge into the\u0000parametric knowledge of a text-to-SQL model during training and eliminates the\u0000need for schema encoding during inference. YORO significantly reduces the input\u0000token length by 66%-98%. Despite its shorter inputs, our empirical results\u0000demonstrate YORO's competitive performances with traditional systems on three\u0000benchmarks as well as its significant outperformance on large databases.\u0000Furthermore, YORO excels in handling questions with challenging value\u0000retrievals such as abbreviation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-like Affective Cognition in Foundation Models 基金会模型中的类人情感认知
Pub Date : 2024-09-18 DOI: arxiv-2409.11733
Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman
Understanding emotions is fundamental to human interaction and experience.Humans easily infer emotions from situations or facial expressions, situationsfrom emotions, and do a variety of other emph{affective cognition}. How adeptis modern AI at these inferences? We introduce an evaluation framework fortesting affective cognition in foundation models. Starting from psychologicaltheory, we generate 1,280 diverse scenarios exploring relationships betweenappraisals, emotions, expressions, and outcomes. We evaluate the abilities offoundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) acrosscarefully selected conditions. Our results show foundation models tend to agreewith human intuitions, matching or exceeding interparticipant agreement. Insome conditions, models are ``superhuman'' -- they better predict modal humanjudgements than the average human. All models benefit from chain-of-thoughtreasoning. This suggests foundation models have acquired a human-likeunderstanding of emotions and their influence on beliefs and behavior.
人类很容易从情境或面部表情中推断出情绪,从情绪中推断出情境,并进行其他各种情感认知。现代人工智能在这些推断方面的能力如何?我们引入了一个评估框架,用于测试基础模型中的情感认知。从心理学理论出发,我们生成了 1280 个不同的场景,探索评价、情绪、表达和结果之间的关系。我们评估了基础模型(GPT-4、Claude-3、Gemini-1.5-Pro)和人类(N = 567)在精心选择的条件下的能力。我们的结果表明,基础模型往往与人类的直觉一致,符合或超过参与者之间的一致。在某些条件下,模型是 "超人"--它们比普通人更好地预测了人类的模态判断。所有模型都受益于思维链推理。这表明基础模型对情绪及其对信念和行为的影响有了类似人类的理解。
{"title":"Human-like Affective Cognition in Foundation Models","authors":"Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman","doi":"arxiv-2409.11733","DOIUrl":"https://doi.org/arxiv-2409.11733","url":null,"abstract":"Understanding emotions is fundamental to human interaction and experience.\u0000Humans easily infer emotions from situations or facial expressions, situations\u0000from emotions, and do a variety of other emph{affective cognition}. How adept\u0000is modern AI at these inferences? We introduce an evaluation framework for\u0000testing affective cognition in foundation models. Starting from psychological\u0000theory, we generate 1,280 diverse scenarios exploring relationships between\u0000appraisals, emotions, expressions, and outcomes. We evaluate the abilities of\u0000foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across\u0000carefully selected conditions. Our results show foundation models tend to agree\u0000with human intuitions, matching or exceeding interparticipant agreement. In\u0000some conditions, models are ``superhuman'' -- they better predict modal human\u0000judgements than the average human. All models benefit from chain-of-thought\u0000reasoning. This suggests foundation models have acquired a human-like\u0000understanding of emotions and their influence on beliefs and behavior.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning TART:基于表格的可解释推理的开源工具增强框架
Pub Date : 2024-09-18 DOI: arxiv-2409.11724
Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan
Current Large Language Models (LLMs) exhibit limited ability to understandtable structures and to apply precise numerical reasoning, which is crucial fortasks such as table question answering (TQA) and table-based fact verification(TFV). To address these challenges, we introduce our Tool-Augmented Reasoningframework for Tables (TART), which integrates LLMs with specialized tools. TARTcontains three key components: a table formatter to ensure accurate datarepresentation, a tool maker to develop specific computational tools, and anexplanation generator to maintain explainability. We also present the TOOLTABdataset, a new benchmark designed specifically for training LLMs in table-toolintegration. Our experiments indicate that TART achieves substantialimprovements over existing methods (e.g., Chain-of-Thought) by improving boththe precision of data processing and the clarity of the reasoning process.Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of theclosed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diversereal-world scenarios. All the code and data are available athttps://github.com/XinyuanLu00/TART.
当前的大型语言模型(LLM)在理解表格结构和应用精确数字推理方面能力有限,而这对于表格问题解答(TQA)和基于表格的事实验证(TFV)等任务至关重要。为了应对这些挑战,我们推出了表格工具增强推理框架(TART),它将 LLM 与专用工具集成在一起。TART 包含三个关键组件:确保数据准确呈现的表格格式化器、开发特定计算工具的工具制造商,以及保持可解释性的解释生成器。我们还提出了 TOOLTAB 数据集,这是一个新的基准,专门用于训练表-表整合的 LLM。我们的实验表明,通过提高数据处理的精度和推理过程的清晰度,TART 比现有方法(如 Chain-of-Thought)有了实质性的改进。值得注意的是,与 CodeLlama 搭配使用的 TART 达到了封闭源 LLM GPT-3.5-turbo 90.0% 的准确率,突出了它在现实世界各种场景中的鲁棒性。所有代码和数据可在https://github.com/XinyuanLu00/TART。
{"title":"TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning","authors":"Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan","doi":"arxiv-2409.11724","DOIUrl":"https://doi.org/arxiv-2409.11724","url":null,"abstract":"Current Large Language Models (LLMs) exhibit limited ability to understand\u0000table structures and to apply precise numerical reasoning, which is crucial for\u0000tasks such as table question answering (TQA) and table-based fact verification\u0000(TFV). To address these challenges, we introduce our Tool-Augmented Reasoning\u0000framework for Tables (TART), which integrates LLMs with specialized tools. TART\u0000contains three key components: a table formatter to ensure accurate data\u0000representation, a tool maker to develop specific computational tools, and an\u0000explanation generator to maintain explainability. We also present the TOOLTAB\u0000dataset, a new benchmark designed specifically for training LLMs in table-tool\u0000integration. Our experiments indicate that TART achieves substantial\u0000improvements over existing methods (e.g., Chain-of-Thought) by improving both\u0000the precision of data processing and the clarity of the reasoning process.\u0000Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the\u0000closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse\u0000real-world scenarios. All the code and data are available at\u0000https://github.com/XinyuanLu00/TART.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computation and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1