Hong-Yu Guo , Chuang Wang , Fei Yin , Xiao-Hui Li , Cheng-Lin Liu
{"title":"Vision–language pre-training for graph-based handwritten mathematical expression recognition","authors":"Hong-Yu Guo , Chuang Wang , Fei Yin , Xiao-Hui Li , Cheng-Lin Liu","doi":"10.1016/j.patcog.2025.111346","DOIUrl":null,"url":null,"abstract":"<div><div>Vision–language pre-training models have shown promise in improving various downstream tasks. However, handwritten mathematical expression recognition (HMER), as a typical structured learning problem, can hardly benefit from existing pre-training methods due to the presence of multiple symbols and complicated structural relationships, as well as the scarcity of paired data. To overcome these problems, we propose a <strong>V</strong>ision-<strong>L</strong>anguage <strong>P</strong>re-training paradigm for <strong>G</strong>raph-based HMER (VLPG), utilizing unpaired mathematical expression images and LaTeX labels. Our HMER model is built upon a graph parsing method with superior explainability, which is enhanced by the proposed graph-structure aware transformer decoder. Based on this framework, the symbol localization pretext task and language modeling task are employed for vision–language pre-training. First, we make use of unlabeled mathematical symbol images to pre-train the visual feature extractor through the localization pretext task, improving the symbol localization and discrimination ability. Second, the structure understanding module is pre-trained using LaTeX corpora through language modeling task, which promotes the model’s context comprehension ability. The pre-trained model is fine-tuned and aligned on the downstream HMER task using benchmark datasets. Experiments on public datasets demonstrate that the pre-training paradigm significantly improves the mathematical expression recognition performance. Our VLPG achieves state-of-the-art performance on standard CROHME datasets and comparable performance on the HME100K dataset, highlighting the effectiveness and superiority of the proposed model. We released our codes at <span><span>https://github.com/guohy17/VLPG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111346"},"PeriodicalIF":7.5000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325000068","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision–language pre-training models have shown promise in improving various downstream tasks. However, handwritten mathematical expression recognition (HMER), as a typical structured learning problem, can hardly benefit from existing pre-training methods due to the presence of multiple symbols and complicated structural relationships, as well as the scarcity of paired data. To overcome these problems, we propose a Vision-Language Pre-training paradigm for Graph-based HMER (VLPG), utilizing unpaired mathematical expression images and LaTeX labels. Our HMER model is built upon a graph parsing method with superior explainability, which is enhanced by the proposed graph-structure aware transformer decoder. Based on this framework, the symbol localization pretext task and language modeling task are employed for vision–language pre-training. First, we make use of unlabeled mathematical symbol images to pre-train the visual feature extractor through the localization pretext task, improving the symbol localization and discrimination ability. Second, the structure understanding module is pre-trained using LaTeX corpora through language modeling task, which promotes the model’s context comprehension ability. The pre-trained model is fine-tuned and aligned on the downstream HMER task using benchmark datasets. Experiments on public datasets demonstrate that the pre-training paradigm significantly improves the mathematical expression recognition performance. Our VLPG achieves state-of-the-art performance on standard CROHME datasets and comparable performance on the HME100K dataset, highlighting the effectiveness and superiority of the proposed model. We released our codes at https://github.com/guohy17/VLPG.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.