arXiv - CS - Computation and Language最新文献_第2页

RUIE: Retrieval-based Unified Information Extraction using Large Language Model RUIE：使用大型语言模型进行基于检索的统一信息提取

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11673

Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang

Unified information extraction (UIE) aims to complete all informationextraction tasks using a single model or framework. While previous work hasprimarily focused on instruction-tuning large language models (LLMs) withconstructed datasets, these methods require significant computational resourcesand struggle to generalize to unseen tasks. To address these limitations, wepropose RUIE (Retrieval-based Unified Information Extraction), a framework thatleverages in-context learning to enable rapid generalization while reducingcomputational costs. The key challenge in RUIE is selecting the most beneficialdemonstrations for LLMs to effectively handle diverse IE tasks. To achievethis, we integrate LLM preferences for ranking candidate demonstrations anddesign a keyword-enhanced reward model to capture fine-grained relationshipsbetween queries and demonstrations. We then train a bi-encoder retriever forUIE through contrastive learning and knowledge distillation. To the best of ourknowledge, RUIE is the first trainable retrieval framework for UIE.Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness ingeneralizing to unseen tasks, with average F1-score improvements of 19.22 and3.13 compared to instruction-tuning methods and other retrievers, respectively.Further analysis confirms RUIE's adaptability to LLMs of varying sizes and theimportance of its key components.

统一信息提取（UIE）旨在使用单一模型或框架完成所有信息提取任务。以往的工作主要集中在利用构建的数据集对大型语言模型（LLMs）进行指令调整，但这些方法需要大量的计算资源，而且很难推广到未见过的任务中。为了解决这些局限性，我们提出了 RUIE（基于检索的统一信息提取），这是一种利用上下文学习实现快速泛化，同时降低计算成本的框架。RUIE 面临的主要挑战是为 LLM 挑选最有益的演示，以有效处理各种信息提取任务。为了实现这一目标，我们整合了 LLM 对候选演示排序的偏好，并设计了一个关键字增强奖励模型来捕捉查询和演示之间的细粒度关系。然后，我们通过对比学习和知识提炼来训练用于 UIE 的双编码器检索器。据我们所知，RUIE是第一个可训练的UIE检索框架。在8个保留数据集上的实验结果表明，RUIE能有效地推广到未见任务中，与指令调整方法和其他检索器相比，平均F1分数分别提高了19.22和3.13。

{"title":"RUIE: Retrieval-based Unified Information Extraction using Large Language Model","authors":"Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang","doi":"arxiv-2409.11673","DOIUrl":"https://doi.org/arxiv-2409.11673","url":null,"abstract":"Unified information extraction (UIE) aims to complete all information\u0000extraction tasks using a single model or framework. While previous work has\u0000primarily focused on instruction-tuning large language models (LLMs) with\u0000constructed datasets, these methods require significant computational resources\u0000and struggle to generalize to unseen tasks. To address these limitations, we\u0000propose RUIE (Retrieval-based Unified Information Extraction), a framework that\u0000leverages in-context learning to enable rapid generalization while reducing\u0000computational costs. The key challenge in RUIE is selecting the most beneficial\u0000demonstrations for LLMs to effectively handle diverse IE tasks. To achieve\u0000this, we integrate LLM preferences for ranking candidate demonstrations and\u0000design a keyword-enhanced reward model to capture fine-grained relationships\u0000between queries and demonstrations. We then train a bi-encoder retriever for\u0000UIE through contrastive learning and knowledge distillation. To the best of our\u0000knowledge, RUIE is the first trainable retrieval framework for UIE.\u0000Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in\u0000generalizing to unseen tasks, with average F1-score improvements of 19.22 and\u00003.13 compared to instruction-tuning methods and other retrievers, respectively.\u0000Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the\u0000importance of its key components.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning MAgICoRe：多代理、迭代、从粗到细的推理改进

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12147

Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

Large Language Models' (LLM) reasoning can be improved using test-timeaggregation strategies, i.e., generating multiple samples and voting amonggenerated samples. While these improve performance, they often reach asaturation point. Refinement offers an alternative by using LLM-generatedfeedback to improve solution quality. However, refinement introduces 3 keychallenges: (1) Excessive refinement: Uniformly refining all instances canover-correct and reduce the overall performance. (2) Inability to localize andaddress errors: LLMs have a limited ability to self-correct and struggle toidentify and correct their own mistakes. (3) Insufficient refinement: Decidinghow many iterations of refinement are needed is non-trivial, and stopping toosoon could leave errors unaddressed. To tackle these issues, we proposeMAgICoRe, which avoids excessive refinement by categorizing problem difficultyas easy or hard, solving easy problems with coarse-grained aggregation and hardones with fine-grained and iterative multi-agent refinement. To improve errorlocalization, we incorporate external step-wise reward model (RM) scores.Moreover, to ensure effective refinement, we employ a multi-agent loop withthree agents: Solver, Reviewer (which generates targeted feedback based onstep-wise RM scores), and the Refiner (which incorporates feedback). To ensuresufficient refinement, we re-evaluate updated solutions, iteratively initiatingfurther rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5and show its effectiveness across 5 math datasets. Even one iteration ofMAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by4.0% while using less than half the samples. Unlike iterative refinement withbaselines, MAgICoRe continues to improve with more iterations. Finally, ourablations highlight the importance of MAgICoRe's RMs and multi-agentcommunication.

大型语言模型（LLM）的推理能力可以通过使用测试时间聚合策略（即生成多个样本并在生成的样本中进行投票）来提高。虽然这些方法能提高性能，但往往会达到饱和点。精细化提供了另一种选择，即利用 LLM 生成的反馈来提高解决方案的质量。然而，细化带来了 3 个关键挑战：(1) 过度细化：统一细化所有实例可能会过度修正，降低整体性能。(2) 无法定位和处理错误：LLM 的自我纠正能力有限，很难识别和纠正自己的错误。(3) 细化不足：决定需要进行多少次迭代细化并非易事，过早停止细化可能会导致错误得不到解决。为了解决这些问题，我们提出了 MAgICoRe，它通过将问题难度分为易和难来避免过度细化，用粗粒度聚合来解决易问题，用细粒度和多代理迭代细化来解决难问题。此外，为了确保有效的细化，我们采用了一个由三个代理组成的多代理循环：此外，我们还采用了由三个代理组成的多代理循环：求解器、审查器（根据分步 RM 分数生成有针对性的反馈）和精炼器（吸收反馈）。为确保充分完善，我们会重新评估更新后的解决方案，并迭代启动更多轮完善。我们在 Llama-3-8B 和 GPT-3.5 上对 MAgICoRe 进行了评估，并在 5 个数学数据集上展示了其有效性。即使是一次迭代，MAgICoRe 也比 Self-Consistency 高出 3.4%，比 Best-of-k 高出 3.2%，比 Self-Refine 高出 4.0%，而使用的样本还不到一半。与使用基线的迭代改进不同，MAgICoRe 会随着迭代次数的增加而不断改进。最后，我们的迭代突出了 MAgICoRe 的 RM 和多基因通信的重要性。

{"title":"MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning","authors":"Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.12147","DOIUrl":"https://doi.org/arxiv-2409.12147","url":null,"abstract":"Large Language Models' (LLM) reasoning can be improved using test-time\u0000aggregation strategies, i.e., generating multiple samples and voting among\u0000generated samples. While these improve performance, they often reach a\u0000saturation point. Refinement offers an alternative by using LLM-generated\u0000feedback to improve solution quality. However, refinement introduces 3 key\u0000challenges: (1) Excessive refinement: Uniformly refining all instances can\u0000over-correct and reduce the overall performance. (2) Inability to localize and\u0000address errors: LLMs have a limited ability to self-correct and struggle to\u0000identify and correct their own mistakes. (3) Insufficient refinement: Deciding\u0000how many iterations of refinement are needed is non-trivial, and stopping too\u0000soon could leave errors unaddressed. To tackle these issues, we propose\u0000MAgICoRe, which avoids excessive refinement by categorizing problem difficulty\u0000as easy or hard, solving easy problems with coarse-grained aggregation and hard\u0000ones with fine-grained and iterative multi-agent refinement. To improve error\u0000localization, we incorporate external step-wise reward model (RM) scores.\u0000Moreover, to ensure effective refinement, we employ a multi-agent loop with\u0000three agents: Solver, Reviewer (which generates targeted feedback based on\u0000step-wise RM scores), and the Refiner (which incorporates feedback). To ensure\u0000sufficient refinement, we re-evaluate updated solutions, iteratively initiating\u0000further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5\u0000and show its effectiveness across 5 math datasets. Even one iteration of\u0000MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by\u00004.0% while using less than half the samples. Unlike iterative refinement with\u0000baselines, MAgICoRe continues to improve with more iterations. Finally, our\u0000ablations highlight the importance of MAgICoRe's RMs and multi-agent\u0000communication.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLMs in Education: Novel Perspectives, Challenges, and Opportunities 教育学法学硕士：新视角、挑战和机遇

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11917

Bashar Alhafni, Sowmya Vajjala, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar

The role of large language models (LLMs) in education is an increasing areaof interest today, considering the new opportunities they offer for teaching,learning, and assessment. This cutting-edge tutorial provides an overview ofthe educational applications of NLP and the impact that the recent advances inLLMs have had on this field. We will discuss the key challenges andopportunities presented by LLMs, grounding them in the context of four majoreducational applications: reading, writing, and speaking skills, andintelligent tutoring systems (ITS). This COLING 2025 tutorial is designed forresearchers and practitioners interested in the educational applications of NLPand the role LLMs have to play in this area. It is the first of its kind toaddress this timely topic.

考虑到大型语言模型（LLMs）为教学、学习和评估带来的新机遇，大型语言模型在教育领域的作用日益受到关注。本前沿教程概述了 NLP 在教育领域的应用，以及 LLM 的最新进展对该领域的影响。我们将以阅读、写作和口语技能以及智能辅导系统（ITS）这四大教育应用为背景，讨论 LLMs 带来的关键挑战和机遇。COLING 2025教程面向对NLP的教育应用和LLM在该领域的作用感兴趣的研究人员和从业人员。它是首个讨论这一及时话题的同类教程。

引用次数: 0

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning 要还是不要思维链？思维链主要有助于数学和符号推理

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12183

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett

Chain-of-thought (CoT) via prompting is the de facto method for elicitingreasoning capabilities from large language models (LLMs). But for what kinds oftasks is this extra ``thinking'' really helpful? To analyze this, we conducteda quantitative meta-analysis covering over 100 papers using CoT and ran our ownevaluations of 20 datasets across 14 models. Our results show that CoT givesstrong performance benefits primarily on tasks involving math or logic, withmuch smaller gains on other types of tasks. On MMLU, directly generating theanswer without CoT leads to almost identical accuracy as CoT unless thequestion or model's response contains an equals sign, indicating symbolicoperations and reasoning. Following this finding, we analyze the behavior ofCoT on these problems by separating planning and execution and comparingagainst tool-augmented LLMs. Much of CoT's gain comes from improving symbolicexecution, but it underperforms relative to using a symbolic solver. Ourresults indicate that CoT can be applied selectively, maintaining performancewhile saving inference costs. Furthermore, they suggest a need to move beyondprompt-based CoT to new paradigms that better leverage intermediate computationacross the whole range of LLM applications.

通过提示的思维链（CoT）是从大型语言模型（LLM）中激发推理能力的事实方法。但是，这种额外的 "思考 "究竟对哪些任务有帮助呢？为了分析这个问题，我们对 100 多篇使用 CoT 的论文进行了定量荟萃分析，并对 14 个模型的 20 个数据集进行了评估。我们的结果表明，CoT 主要在涉及数学或逻辑的任务中带来了强大的性能优势，而在其他类型的任务中收益要小得多。在 MMLU 任务中，不使用 CoT 直接生成答案的准确率与 CoT 几乎相同，除非问题或模型的回答包含等号，表示符号操作和推理。根据这一发现，我们通过将规划和执行分开，并与工具增强的 LLM 进行比较，分析了 CoT 在这些问题上的表现。CoT 的大部分收益来自于符号执行的改进，但相对于使用符号求解器，它的表现并不理想。我们的研究结果表明，CoT 可以有选择地应用，在保持性能的同时节省推理成本。此外，这些结果还表明，有必要超越基于提示的 CoT，转而采用能在整个 LLM 应用中更好地利用中间计算的新范式。

{"title":"To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning","authors":"Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett","doi":"arxiv-2409.12183","DOIUrl":"https://doi.org/arxiv-2409.12183","url":null,"abstract":"Chain-of-thought (CoT) via prompting is the de facto method for eliciting\u0000reasoning capabilities from large language models (LLMs). But for what kinds of\u0000tasks is this extra ``thinking'' really helpful? To analyze this, we conducted\u0000a quantitative meta-analysis covering over 100 papers using CoT and ran our own\u0000evaluations of 20 datasets across 14 models. Our results show that CoT gives\u0000strong performance benefits primarily on tasks involving math or logic, with\u0000much smaller gains on other types of tasks. On MMLU, directly generating the\u0000answer without CoT leads to almost identical accuracy as CoT unless the\u0000question or model's response contains an equals sign, indicating symbolic\u0000operations and reasoning. Following this finding, we analyze the behavior of\u0000CoT on these problems by separating planning and execution and comparing\u0000against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic\u0000execution, but it underperforms relative to using a symbolic solver. Our\u0000results indicate that CoT can be applied selectively, maintaining performance\u0000while saving inference costs. Furthermore, they suggest a need to move beyond\u0000prompt-based CoT to new paradigms that better leverage intermediate computation\u0000across the whole range of LLM applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Qwen2.5-Coder Technical Report Qwen2.5-Coder 技术报告

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12186

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin

In this report, we introduce the Qwen2.5-Coder series, a significant upgradefrom its predecessor, CodeQwen1.5. This series includes two models:Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model,Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrainedon a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning,scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coderdemonstrates impressive code generation capabilities while retaining generalversatility. The model has been evaluated on a wide range of code-relatedtasks, achieving state-of-the-art (SOTA) performance across more than 10benchmarks, including code generation, completion, reasoning, and repair,consistently outperforming larger models of the same model size. We believethat the release of the Qwen2.5-Coder series will not only push the boundariesof research in code intelligence but also, through its permissive licensing,encourage broader adoption by developers in real-world applications.

在本报告中，我们介绍了 Qwen2.5-Coder 系列，这是其前身 CodeQwen1.5 的重大升级。该系列包括两个型号：Qwen2.5-Coder-1.5B 和 Qwen2.5-Coder-7B。作为一个代码专用模型，Qwen2.5-Coder 建立在 Qwen2.5 架构之上，并在超过 5.5 万亿个 token 的庞大语料库中继续进行预训练。通过细致的数据清理、可扩展的合成数据生成和均衡的数据混合，Qwen2.5-Coder 展示了令人印象深刻的代码生成能力，同时保留了通用性。该模型已在广泛的代码相关任务中进行了评估，在代码生成、补全、推理和修复等 10 多个基准测试中取得了最先进（SOTA）的性能，其性能始终优于相同模型规模的大型模型。我们相信，Qwen2.5-Coder 系列的发布不仅将推动代码智能研究的发展，而且还将通过其许可授权，鼓励开发人员在实际应用中更广泛地采用它。

{"title":"Qwen2.5-Coder Technical Report","authors":"Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12186","DOIUrl":"https://doi.org/arxiv-2409.12186","url":null,"abstract":"In this report, we introduce the Qwen2.5-Coder series, a significant upgrade\u0000from its predecessor, CodeQwen1.5. This series includes two models:\u0000Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model,\u0000Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained\u0000on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning,\u0000scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder\u0000demonstrates impressive code generation capabilities while retaining general\u0000versatility. The model has been evaluated on a wide range of code-related\u0000tasks, achieving state-of-the-art (SOTA) performance across more than 10\u0000benchmarks, including code generation, completion, reasoning, and repair,\u0000consistently outperforming larger models of the same model size. We believe\u0000that the release of the Qwen2.5-Coder series will not only push the boundaries\u0000of research in code intelligence but also, through its permissive licensing,\u0000encourage broader adoption by developers in real-world applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GRIN: GRadient-INformed MoE GRIN: GRadient-INformed MoE

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.12136

Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

Mixture-of-Experts (MoE) models scale more effectively than dense models dueto sparse computation through expert routing, selectively activating only asmall subset of expert modules. However, sparse computation challengestraditional training practices, as discrete expert routing hinders standardbackpropagation and thus gradient-based optimization, which are the cornerstoneof deep learning. To better pursue the scaling power of MoE, we introduce GRIN(GRadient-INformed MoE training), which incorporates sparse gradient estimationfor expert routing and configures model parallelism to avoid token dropping.Applying GRIN to autoregressive language modeling, we develop a top-216$times$3.8B MoE model. Our model, with only 6.6B activated parameters,outperforms a 7B dense model and matches the performance of a 14B dense modeltrained on the same data. Extensive evaluations across diverse tasksdemonstrate the potential of GRIN to significantly enhance MoE efficacy,achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

专家混合物（MoE）模型通过专家路由进行稀疏计算，只选择性地激活一小部分专家模块，因此比密集模型更能有效扩展。然而，稀疏计算对传统的训练方法提出了挑战，因为离散专家路由会阻碍标准后向传播，从而阻碍基于梯度的优化，而梯度优化是深度学习的基石。为了更好地发挥MoE的扩展能力，我们引入了GRIN（GRadient-INformed MoE training），它将稀疏梯度估计用于专家路由，并配置模型并行性以避免标记丢弃。我们的模型只有 6.6B 个激活参数，其性能超过了 7B 的密集模型，并与在相同数据上训练的 14B 密集模型不相上下。对不同任务的广泛评估表明，GRIN 有潜力显著提高 MoE 的效率，在 MMLU、HellaSwag、HumanEval 和 MATH 上分别取得了 79.4、83.7、74.4 和 58.9 的高分。

{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":"https://doi.org/arxiv-2409.12136","url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\u0000to sparse computation through expert routing, selectively activating only a\u0000small subset of expert modules. However, sparse computation challenges\u0000traditional training practices, as discrete expert routing hinders standard\u0000backpropagation and thus gradient-based optimization, which are the cornerstone\u0000of deep learning. To better pursue the scaling power of MoE, we introduce GRIN\u0000(GRadient-INformed MoE training), which incorporates sparse gradient estimation\u0000for expert routing and configures model parallelism to avoid token dropping.\u0000Applying GRIN to autoregressive language modeling, we develop a top-2\u000016$times$3.8B MoE model. Our model, with only 6.6B activated parameters,\u0000outperforms a 7B dense model and matches the performance of a 14B dense model\u0000trained on the same data. Extensive evaluations across diverse tasks\u0000demonstrate the potential of GRIN to significantly enhance MoE efficacy,\u0000achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Human-like Affective Cognition in Foundation Models 基金会模型中的类人情感认知

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11733

Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman

Understanding emotions is fundamental to human interaction and experience.Humans easily infer emotions from situations or facial expressions, situationsfrom emotions, and do a variety of other emph{affective cognition}. How adeptis modern AI at these inferences? We introduce an evaluation framework fortesting affective cognition in foundation models. Starting from psychologicaltheory, we generate 1,280 diverse scenarios exploring relationships betweenappraisals, emotions, expressions, and outcomes. We evaluate the abilities offoundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) acrosscarefully selected conditions. Our results show foundation models tend to agreewith human intuitions, matching or exceeding interparticipant agreement. Insome conditions, models are ``superhuman'' -- they better predict modal humanjudgements than the average human. All models benefit from chain-of-thoughtreasoning. This suggests foundation models have acquired a human-likeunderstanding of emotions and their influence on beliefs and behavior.

人类很容易从情境或面部表情中推断出情绪，从情绪中推断出情境，并进行其他各种情感认知。现代人工智能在这些推断方面的能力如何？我们引入了一个评估框架，用于测试基础模型中的情感认知。从心理学理论出发，我们生成了 1280 个不同的场景，探索评价、情绪、表达和结果之间的关系。我们评估了基础模型（GPT-4、Claude-3、Gemini-1.5-Pro）和人类（N = 567）在精心选择的条件下的能力。我们的结果表明，基础模型往往与人类的直觉一致，符合或超过参与者之间的一致。在某些条件下，模型是 "超人"--它们比普通人更好地预测了人类的模态判断。所有模型都受益于思维链推理。这表明基础模型对情绪及其对信念和行为的影响有了类似人类的理解。

{"title":"Human-like Affective Cognition in Foundation Models","authors":"Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman","doi":"arxiv-2409.11733","DOIUrl":"https://doi.org/arxiv-2409.11733","url":null,"abstract":"Understanding emotions is fundamental to human interaction and experience.\u0000Humans easily infer emotions from situations or facial expressions, situations\u0000from emotions, and do a variety of other emph{affective cognition}. How adept\u0000is modern AI at these inferences? We introduce an evaluation framework for\u0000testing affective cognition in foundation models. Starting from psychological\u0000theory, we generate 1,280 diverse scenarios exploring relationships between\u0000appraisals, emotions, expressions, and outcomes. We evaluate the abilities of\u0000foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across\u0000carefully selected conditions. Our results show foundation models tend to agree\u0000with human intuitions, matching or exceeding interparticipant agreement. In\u0000some conditions, models are ``superhuman'' -- they better predict modal human\u0000judgements than the average human. All models benefit from chain-of-thought\u0000reasoning. This suggests foundation models have acquired a human-like\u0000understanding of emotions and their influence on beliefs and behavior.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning TART：基于表格的可解释推理的开源工具增强框架

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11724

Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan

Current Large Language Models (LLMs) exhibit limited ability to understandtable structures and to apply precise numerical reasoning, which is crucial fortasks such as table question answering (TQA) and table-based fact verification(TFV). To address these challenges, we introduce our Tool-Augmented Reasoningframework for Tables (TART), which integrates LLMs with specialized tools. TARTcontains three key components: a table formatter to ensure accurate datarepresentation, a tool maker to develop specific computational tools, and anexplanation generator to maintain explainability. We also present the TOOLTABdataset, a new benchmark designed specifically for training LLMs in table-toolintegration. Our experiments indicate that TART achieves substantialimprovements over existing methods (e.g., Chain-of-Thought) by improving boththe precision of data processing and the clarity of the reasoning process.Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of theclosed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diversereal-world scenarios. All the code and data are available athttps://github.com/XinyuanLu00/TART.

当前的大型语言模型（LLM）在理解表格结构和应用精确数字推理方面能力有限，而这对于表格问题解答（TQA）和基于表格的事实验证（TFV）等任务至关重要。为了应对这些挑战，我们推出了表格工具增强推理框架（TART），它将 LLM 与专用工具集成在一起。TART 包含三个关键组件：确保数据准确呈现的表格格式化器、开发特定计算工具的工具制造商，以及保持可解释性的解释生成器。我们还提出了 TOOLTAB 数据集，这是一个新的基准，专门用于训练表-表整合的 LLM。我们的实验表明，通过提高数据处理的精度和推理过程的清晰度，TART 比现有方法（如 Chain-of-Thought）有了实质性的改进。值得注意的是，与 CodeLlama 搭配使用的 TART 达到了封闭源 LLM GPT-3.5-turbo 90.0% 的准确率，突出了它在现实世界各种场景中的鲁棒性。所有代码和数据可在https://github.com/XinyuanLu00/TART。

{"title":"TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning","authors":"Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan","doi":"arxiv-2409.11724","DOIUrl":"https://doi.org/arxiv-2409.11724","url":null,"abstract":"Current Large Language Models (LLMs) exhibit limited ability to understand\u0000table structures and to apply precise numerical reasoning, which is crucial for\u0000tasks such as table question answering (TQA) and table-based fact verification\u0000(TFV). To address these challenges, we introduce our Tool-Augmented Reasoning\u0000framework for Tables (TART), which integrates LLMs with specialized tools. TART\u0000contains three key components: a table formatter to ensure accurate data\u0000representation, a tool maker to develop specific computational tools, and an\u0000explanation generator to maintain explainability. We also present the TOOLTAB\u0000dataset, a new benchmark designed specifically for training LLMs in table-tool\u0000integration. Our experiments indicate that TART achieves substantial\u0000improvements over existing methods (e.g., Chain-of-Thought) by improving both\u0000the precision of data processing and the clarity of the reasoning process.\u0000Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the\u0000closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse\u0000real-world scenarios. All the code and data are available at\u0000https://github.com/XinyuanLu00/TART.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficacy of Synthetic Data as a Benchmark 合成数据作为基准的功效

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11968

Gaurav Maheshwari, Dmitry Ivanov, Kevin El Haddad

Large language models (LLMs) have enabled a range of applications inzero-shot and few-shot learning settings, including the generation of syntheticdatasets for training and testing. However, to reliably use these syntheticdatasets, it is essential to understand how representative they are ofreal-world data. We investigate this by assessing the effectiveness ofgenerating synthetic data through LLM and using it as a benchmark for variousNLP tasks. Our experiments across six datasets, and three different tasks, showthat while synthetic data can effectively capture performance of variousmethods for simpler tasks, such as intent classification, it falls short formore complex tasks like named entity recognition. Additionally, we propose anew metric called the bias factor, which evaluates the biases introduced whenthe same LLM is used to both generate benchmarking data and to perform thetasks. We find that smaller LLMs exhibit biases towards their own generateddata, whereas larger models do not. Overall, our findings suggest that theeffectiveness of synthetic data as a benchmark varies depending on the task,and that practitioners should rely on data generated from multiple largermodels whenever possible.

大型语言模型（LLMs）在零拍和少拍学习环境中实现了一系列应用，包括生成用于训练和测试的合成数据集。然而，要可靠地使用这些合成数据集，了解它们对真实世界数据的代表性至关重要。我们通过评估通过 LLM 生成合成数据并将其作为各种 NLP 任务的基准的有效性来研究这一点。我们在六个数据集和三个不同任务中进行的实验表明，虽然合成数据可以有效地捕捉各种方法在较简单任务（如意图分类）中的性能，但在更复杂的任务（如命名实体识别）中，合成数据就显得力不从心了。此外，我们还提出了一种称为偏差因子的新指标，用于评估在使用同一 LLM 生成基准数据和执行任务时引入的偏差。我们发现，较小的 LLM 会对自己生成的数据产生偏差，而较大的模型则不会。总之，我们的研究结果表明，合成数据作为基准的有效性因任务而异，实践者应尽可能依赖多个大型模型生成的数据。

{"title":"Efficacy of Synthetic Data as a Benchmark","authors":"Gaurav Maheshwari, Dmitry Ivanov, Kevin El Haddad","doi":"arxiv-2409.11968","DOIUrl":"https://doi.org/arxiv-2409.11968","url":null,"abstract":"Large language models (LLMs) have enabled a range of applications in\u0000zero-shot and few-shot learning settings, including the generation of synthetic\u0000datasets for training and testing. However, to reliably use these synthetic\u0000datasets, it is essential to understand how representative they are of\u0000real-world data. We investigate this by assessing the effectiveness of\u0000generating synthetic data through LLM and using it as a benchmark for various\u0000NLP tasks. Our experiments across six datasets, and three different tasks, show\u0000that while synthetic data can effectively capture performance of various\u0000methods for simpler tasks, such as intent classification, it falls short for\u0000more complex tasks like named entity recognition. Additionally, we propose a\u0000new metric called the bias factor, which evaluates the biases introduced when\u0000the same LLM is used to both generate benchmarking data and to perform the\u0000tasks. We find that smaller LLMs exhibit biases towards their own generated\u0000data, whereas larger models do not. Overall, our findings suggest that the\u0000effectiveness of synthetic data as a benchmark varies depending on the task,\u0000and that practitioners should rely on data generated from multiple larger\u0000models whenever possible.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla BanStereoSet：测量孟加拉语词典中陈规定型社会偏见的数据集

arXiv - CS - Computation and Language

Pub Date : 2024-09-18 DOI: arxiv-2409.11638

Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim

This study presents BanStereoSet, a dataset designed to evaluatestereotypical social biases in multilingual LLMs for the Bangla language. In aneffort to extend the focus of bias research beyond English-centric datasets, wehave localized the content from the StereoSet, IndiBias, and Kamruzzaman et.al.'s datasets, producing a resource tailored to capture biases prevalentwithin the Bangla-speaking community. Our BanStereoSet dataset consists of1,194 sentences spanning 9 categories of bias: race, profession, gender,ageism, beauty, beauty in profession, region, caste, and religion. This datasetnot only serves as a crucial tool for measuring bias in multilingual LLMs butalso facilitates the exploration of stereotypical bias across different socialcategories, potentially guiding the development of more equitable languagetechnologies in Bangladeshi contexts. Our analysis of several language modelsusing this dataset indicates significant biases, reinforcing the necessity forculturally and linguistically adapted datasets to develop more equitablelanguage technologies.

本研究介绍的 BanStereoSet 是一个数据集，旨在评估孟加拉语多语种 LLM 中的社会偏见。为了将偏见研究的重点扩展到以英语为中心的数据集之外，我们对 StereoSet、IndiBias 和 Kamruzzaman 等人的数据集中的内容进行了本地化，生成了一个专门用于捕捉孟加拉语社区中普遍存在的偏见的资源。我们的 BanStereoSet 数据集包含 1194 个句子，涵盖 9 个偏见类别：种族、职业、性别、年龄歧视、美貌、职业中的美貌、地区、种姓和宗教。该数据集不仅是测量多语言 LLM 中偏见的重要工具，还有助于探索不同社会类别中的刻板偏见，从而为在孟加拉国环境中开发更公平的语言技术提供潜在指导。我们对使用该数据集的几个语言模型进行的分析表明，这些模型存在明显的偏差，这就更加说明，要开发更加公平的语言技术，就必须建立适应文化和语言的数据集。

{"title":"BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla","authors":"Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim","doi":"arxiv-2409.11638","DOIUrl":"https://doi.org/arxiv-2409.11638","url":null,"abstract":"This study presents BanStereoSet, a dataset designed to evaluate\u0000stereotypical social biases in multilingual LLMs for the Bangla language. In an\u0000effort to extend the focus of bias research beyond English-centric datasets, we\u0000have localized the content from the StereoSet, IndiBias, and Kamruzzaman et.\u0000al.'s datasets, producing a resource tailored to capture biases prevalent\u0000within the Bangla-speaking community. Our BanStereoSet dataset consists of\u00001,194 sentences spanning 9 categories of bias: race, profession, gender,\u0000ageism, beauty, beauty in profession, region, caste, and religion. This dataset\u0000not only serves as a crucial tool for measuring bias in multilingual LLMs but\u0000also facilitates the exploration of stereotypical bias across different social\u0000categories, potentially guiding the development of more equitable language\u0000technologies in Bangladeshi contexts. Our analysis of several language models\u0000using this dataset indicates significant biases, reinforcing the necessity for\u0000culturally and linguistically adapted datasets to develop more equitable\u0000language technologies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0