我们的数学你的大型多模态模型能实现类似人类的数学推理吗?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang
{"title":"我们的数学你的大型多模态模型能实现类似人类的数学推理吗?","authors":"Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang","doi":"arxiv-2407.01284","DOIUrl":null,"url":null,"abstract":"Visual mathematical reasoning, as a fundamental visual reasoning ability, has\nreceived widespread attention from the Large Multimodal Models (LMMs)\ncommunity. Existing benchmarks, such as MathVista and MathVerse, focus more on\nthe result-oriented performance but neglect the underlying principles in\nknowledge acquisition and generalization. Inspired by human-like mathematical\nreasoning, we introduce WE-MATH, the first benchmark specifically designed to\nexplore the problem-solving principles beyond end-to-end performance. We\nmeticulously collect and categorize 6.5K visual math problems, spanning 67\nhierarchical knowledge concepts and five layers of knowledge granularity. We\ndecompose composite problems into sub-problems according to the required\nknowledge concepts and introduce a novel four-dimensional metric, namely\nInsufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery\n(CM), and Rote Memorization (RM), to hierarchically assess inherent issues in\nLMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of\nexisting LMMs in visual mathematical reasoning and reveal a negative\ncorrelation between solving steps and problem-specific performance. We confirm\nthe IK issue of LMMs can be effectively improved via knowledge augmentation\nstrategies. More notably, the primary challenge of GPT-4o has significantly\ntransitioned from IK to IG, establishing it as the first LMM advancing towards\nthe knowledge generalization stage. In contrast, other LMMs exhibit a marked\ninclination towards Rote Memorization - they correctly solve composite problems\ninvolving multiple knowledge concepts yet fail to answer sub-problems. We\nanticipate that WE-MATH will open new pathways for advancements in visual\nmathematical reasoning for LMMs. The WE-MATH data and evaluation code are\navailable at https://github.com/We-Math/We-Math.","PeriodicalId":501033,"journal":{"name":"arXiv - CS - Symbolic Computation","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?\",\"authors\":\"Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang\",\"doi\":\"arxiv-2407.01284\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual mathematical reasoning, as a fundamental visual reasoning ability, has\\nreceived widespread attention from the Large Multimodal Models (LMMs)\\ncommunity. Existing benchmarks, such as MathVista and MathVerse, focus more on\\nthe result-oriented performance but neglect the underlying principles in\\nknowledge acquisition and generalization. Inspired by human-like mathematical\\nreasoning, we introduce WE-MATH, the first benchmark specifically designed to\\nexplore the problem-solving principles beyond end-to-end performance. We\\nmeticulously collect and categorize 6.5K visual math problems, spanning 67\\nhierarchical knowledge concepts and five layers of knowledge granularity. We\\ndecompose composite problems into sub-problems according to the required\\nknowledge concepts and introduce a novel four-dimensional metric, namely\\nInsufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery\\n(CM), and Rote Memorization (RM), to hierarchically assess inherent issues in\\nLMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of\\nexisting LMMs in visual mathematical reasoning and reveal a negative\\ncorrelation between solving steps and problem-specific performance. We confirm\\nthe IK issue of LMMs can be effectively improved via knowledge augmentation\\nstrategies. More notably, the primary challenge of GPT-4o has significantly\\ntransitioned from IK to IG, establishing it as the first LMM advancing towards\\nthe knowledge generalization stage. In contrast, other LMMs exhibit a marked\\ninclination towards Rote Memorization - they correctly solve composite problems\\ninvolving multiple knowledge concepts yet fail to answer sub-problems. We\\nanticipate that WE-MATH will open new pathways for advancements in visual\\nmathematical reasoning for LMMs. The WE-MATH data and evaluation code are\\navailable at https://github.com/We-Math/We-Math.\",\"PeriodicalId\":501033,\"journal\":{\"name\":\"arXiv - CS - Symbolic Computation\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Symbolic Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.01284\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Symbolic Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

视觉数学推理作为一种基本的视觉推理能力,受到了大型多模态模型(LMMs)界的广泛关注。现有的基准(如 MathVista 和 MathVerse)更注重面向结果的性能,却忽视了知识获取和概括的基本原理。受到类人数学推理的启发,我们推出了 WE-MATH,这是第一个专门用于探索端到端性能之外的问题解决原理的基准。我们精心收集并归类了 6.5K 个可视化数学问题,涵盖 67 个层次知识概念和 5 层知识粒度。我们根据所需的知识概念将综合问题分解为子问题,并引入了新颖的四维度量,即知识不足(IK)、概括不足(IG)、完全掌握(CM)和死记硬背(RM),以分层评估 LMMs 推理过程中的内在问题。通过 WE-MATH,我们对视觉数学推理中现有的 LMM 进行了全面评估,发现解题步骤与特定问题的表现之间存在负相关。我们证实了 LMM 的 IK 问题可以通过知识增强策略得到有效改善。更值得注意的是,GPT-4o 的主要挑战已经从 IK 显著过渡到了 IG,使其成为第一个迈向知识泛化阶段的 LMM。相比之下,其他 LMM 则表现出明显的死记硬背倾向--它们能正确解决涉及多个知识概念的综合问题,但却无法回答子问题。我们预计,WE-MATH 将为 LMM 在视觉数学推理方面的进步开辟新的道路。WE-MATH数据和评估代码可在https://github.com/We-Math/We-Math。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Synthesizing Evolving Symbolic Representations for Autonomous Systems Introducing Quantification into a Hierarchical Graph Rewriting Language Towards Verified Polynomial Factorisation Symbolic Regression with a Learned Concept Library Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1