Yi Ren, Chenglong Yu, Weibin Li, Wei Li, Zixuan Zhu, TianYi Zhang, ChenHao Qin, WenBo Ji, Jianjun Zhang
{"title":"TableGPT: a novel table understanding method based on table recognition and large language model collaborative enhancement","authors":"Yi Ren, Chenglong Yu, Weibin Li, Wei Li, Zixuan Zhu, TianYi Zhang, ChenHao Qin, WenBo Ji, Jianjun Zhang","doi":"10.1007/s10489-024-05937-6","DOIUrl":null,"url":null,"abstract":"<div><p>In today's information age, table images play a crucial role in storing structured information, making table image recognition technology an essential component in many fields. However, accurately recognizing the structure and text content of various complex table images has remained a challenge. Recently, large language models (LLMs) have demonstrated exceptional capabilities in various natural language processing tasks. Therefore, applying LLMs to the correction tasks of structure and text content after table image recognition presents a novel solution. This paper introduces a new method, TableGPT, which combines table recognition with LLMs and develops a specialized multimodal agent to enhance the effectiveness of table image recognition. Our approach is divided into four stages. In the first stage, TableGPT_agent initially evaluates whether the input is a table image and, upon confirmation, uses algorithms such as the transformer for preliminary recognition. In the second stage, the agent converts the recognition results into HTML format and autonomously assesses whether corrections are needed. If corrections are needed, the data are input into a trained LLM to achieve more accurate table recognition and optimization. In the third stage, the agent evaluates user satisfaction through feedback and applies superresolution algorithms to low-quality images, as this is often the main reason for user dissatisfaction. Finally, the agent inputs both the enhanced and original images into the trained model, integrating the information to obtain the optimal table text representation. Our research shows that trained LLMs can effectively interpret table images, improving the Tree Edit Distance Similarity (TEDS) score by an average of 4% even when based on the best current table recognition methods, across both public and private datasets. They also demonstrate better performance in correcting structural and textual errors. We also explore the impact of image superresolution technology on low-quality table images. Combined with the LLMs, our TEDS score significantly increased by 54%, greatly enhancing the recognition performance. Finally, by leveraging agent technology, our multimodal model improved table recognition performance, with the TEDS score of TableGPT_agent surpassing that of GPT-4 by 34%.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 4","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-05937-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In today's information age, table images play a crucial role in storing structured information, making table image recognition technology an essential component in many fields. However, accurately recognizing the structure and text content of various complex table images has remained a challenge. Recently, large language models (LLMs) have demonstrated exceptional capabilities in various natural language processing tasks. Therefore, applying LLMs to the correction tasks of structure and text content after table image recognition presents a novel solution. This paper introduces a new method, TableGPT, which combines table recognition with LLMs and develops a specialized multimodal agent to enhance the effectiveness of table image recognition. Our approach is divided into four stages. In the first stage, TableGPT_agent initially evaluates whether the input is a table image and, upon confirmation, uses algorithms such as the transformer for preliminary recognition. In the second stage, the agent converts the recognition results into HTML format and autonomously assesses whether corrections are needed. If corrections are needed, the data are input into a trained LLM to achieve more accurate table recognition and optimization. In the third stage, the agent evaluates user satisfaction through feedback and applies superresolution algorithms to low-quality images, as this is often the main reason for user dissatisfaction. Finally, the agent inputs both the enhanced and original images into the trained model, integrating the information to obtain the optimal table text representation. Our research shows that trained LLMs can effectively interpret table images, improving the Tree Edit Distance Similarity (TEDS) score by an average of 4% even when based on the best current table recognition methods, across both public and private datasets. They also demonstrate better performance in correcting structural and textual errors. We also explore the impact of image superresolution technology on low-quality table images. Combined with the LLMs, our TEDS score significantly increased by 54%, greatly enhancing the recognition performance. Finally, by leveraging agent technology, our multimodal model improved table recognition performance, with the TEDS score of TableGPT_agent surpassing that of GPT-4 by 34%.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.