We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

arXiv - CS - Symbolic Computation Pub Date : 2024-07-01 DOI:arxiv-2407.01284

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang

{"title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","authors":"Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang","doi":"arxiv-2407.01284","DOIUrl":null,"url":null,"abstract":"Visual mathematical reasoning, as a fundamental visual reasoning ability, has\nreceived widespread attention from the Large Multimodal Models (LMMs)\ncommunity. Existing benchmarks, such as MathVista and MathVerse, focus more on\nthe result-oriented performance but neglect the underlying principles in\nknowledge acquisition and generalization. Inspired by human-like mathematical\nreasoning, we introduce WE-MATH, the first benchmark specifically designed to\nexplore the problem-solving principles beyond end-to-end performance. We\nmeticulously collect and categorize 6.5K visual math problems, spanning 67\nhierarchical knowledge concepts and five layers of knowledge granularity. We\ndecompose composite problems into sub-problems according to the required\nknowledge concepts and introduce a novel four-dimensional metric, namely\nInsufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery\n(CM), and Rote Memorization (RM), to hierarchically assess inherent issues in\nLMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of\nexisting LMMs in visual mathematical reasoning and reveal a negative\ncorrelation between solving steps and problem-specific performance. We confirm\nthe IK issue of LMMs can be effectively improved via knowledge augmentation\nstrategies. More notably, the primary challenge of GPT-4o has significantly\ntransitioned from IK to IG, establishing it as the first LMM advancing towards\nthe knowledge generalization stage. In contrast, other LMMs exhibit a marked\ninclination towards Rote Memorization - they correctly solve composite problems\ninvolving multiple knowledge concepts yet fail to answer sub-problems. We\nanticipate that WE-MATH will open new pathways for advancements in visual\nmathematical reasoning for LMMs. The WE-MATH data and evaluation code are\navailable at https://github.com/We-Math/We-Math.","PeriodicalId":501033,"journal":{"name":"arXiv - CS - Symbolic Computation","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Symbolic Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

我们的数学你的大型多模态模型能实现类似人类的数学推理吗？

视觉数学推理作为一种基本的视觉推理能力，受到了大型多模态模型（LMMs）界的广泛关注。现有的基准（如 MathVista 和 MathVerse）更注重面向结果的性能，却忽视了知识获取和概括的基本原理。受到类人数学推理的启发，我们推出了 WE-MATH，这是第一个专门用于探索端到端性能之外的问题解决原理的基准。我们精心收集并归类了 6.5K 个可视化数学问题，涵盖 67 个层次知识概念和 5 层知识粒度。我们根据所需的知识概念将综合问题分解为子问题，并引入了新颖的四维度量，即知识不足（IK）、概括不足（IG）、完全掌握（CM）和死记硬背（RM），以分层评估 LMMs 推理过程中的内在问题。通过 WE-MATH，我们对视觉数学推理中现有的 LMM 进行了全面评估，发现解题步骤与特定问题的表现之间存在负相关。我们证实了 LMM 的 IK 问题可以通过知识增强策略得到有效改善。更值得注意的是，GPT-4o 的主要挑战已经从 IK 显著过渡到了 IG，使其成为第一个迈向知识泛化阶段的 LMM。相比之下，其他 LMM 则表现出明显的死记硬背倾向--它们能正确解决涉及多个知识概念的综合问题，但却无法回答子问题。我们预计，WE-MATH 将为 LMM 在视觉数学推理方面的进步开辟新的道路。WE-MATH数据和评估代码可在https://github.com/We-Math/We-Math。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Symbolic Computation

自引率

0.00%

发文量

期刊最新文献

Synthesizing Evolving Symbolic Representations for Autonomous Systems Introducing Quantification into a Hierarchical Graph Rewriting Language Towards Verified Polynomial Factorisation Symbolic Regression with a Learned Concept Library Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching