TableGPT：一种基于表识别和大语言模型协同增强的表理解新方法

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Applied Intelligence Pub Date : 2025-01-14 DOI:10.1007/s10489-024-05937-6

Yi Ren, Chenglong Yu, Weibin Li, Wei Li, Zixuan Zhu, TianYi Zhang, ChenHao Qin, WenBo Ji, Jianjun Zhang

{"title":"TableGPT：一种基于表识别和大语言模型协同增强的表理解新方法","authors":"Yi Ren, Chenglong Yu, Weibin Li, Wei Li, Zixuan Zhu, TianYi Zhang, ChenHao Qin, WenBo Ji, Jianjun Zhang","doi":"10.1007/s10489-024-05937-6","DOIUrl":null,"url":null,"abstract":"<div><p>In today's information age, table images play a crucial role in storing structured information, making table image recognition technology an essential component in many fields. However, accurately recognizing the structure and text content of various complex table images has remained a challenge. Recently, large language models (LLMs) have demonstrated exceptional capabilities in various natural language processing tasks. Therefore, applying LLMs to the correction tasks of structure and text content after table image recognition presents a novel solution. This paper introduces a new method, TableGPT, which combines table recognition with LLMs and develops a specialized multimodal agent to enhance the effectiveness of table image recognition. Our approach is divided into four stages. In the first stage, TableGPT_agent initially evaluates whether the input is a table image and, upon confirmation, uses algorithms such as the transformer for preliminary recognition. In the second stage, the agent converts the recognition results into HTML format and autonomously assesses whether corrections are needed. If corrections are needed, the data are input into a trained LLM to achieve more accurate table recognition and optimization. In the third stage, the agent evaluates user satisfaction through feedback and applies superresolution algorithms to low-quality images, as this is often the main reason for user dissatisfaction. Finally, the agent inputs both the enhanced and original images into the trained model, integrating the information to obtain the optimal table text representation. Our research shows that trained LLMs can effectively interpret table images, improving the Tree Edit Distance Similarity (TEDS) score by an average of 4% even when based on the best current table recognition methods, across both public and private datasets. They also demonstrate better performance in correcting structural and textual errors. We also explore the impact of image superresolution technology on low-quality table images. Combined with the LLMs, our TEDS score significantly increased by 54%, greatly enhancing the recognition performance. Finally, by leveraging agent technology, our multimodal model improved table recognition performance, with the TEDS score of TableGPT_agent surpassing that of GPT-4 by 34%.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 4","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TableGPT: a novel table understanding method based on table recognition and large language model collaborative enhancement\",\"authors\":\"Yi Ren, Chenglong Yu, Weibin Li, Wei Li, Zixuan Zhu, TianYi Zhang, ChenHao Qin, WenBo Ji, Jianjun Zhang\",\"doi\":\"10.1007/s10489-024-05937-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In today's information age, table images play a crucial role in storing structured information, making table image recognition technology an essential component in many fields. However, accurately recognizing the structure and text content of various complex table images has remained a challenge. Recently, large language models (LLMs) have demonstrated exceptional capabilities in various natural language processing tasks. Therefore, applying LLMs to the correction tasks of structure and text content after table image recognition presents a novel solution. This paper introduces a new method, TableGPT, which combines table recognition with LLMs and develops a specialized multimodal agent to enhance the effectiveness of table image recognition. Our approach is divided into four stages. In the first stage, TableGPT_agent initially evaluates whether the input is a table image and, upon confirmation, uses algorithms such as the transformer for preliminary recognition. In the second stage, the agent converts the recognition results into HTML format and autonomously assesses whether corrections are needed. If corrections are needed, the data are input into a trained LLM to achieve more accurate table recognition and optimization. In the third stage, the agent evaluates user satisfaction through feedback and applies superresolution algorithms to low-quality images, as this is often the main reason for user dissatisfaction. Finally, the agent inputs both the enhanced and original images into the trained model, integrating the information to obtain the optimal table text representation. Our research shows that trained LLMs can effectively interpret table images, improving the Tree Edit Distance Similarity (TEDS) score by an average of 4% even when based on the best current table recognition methods, across both public and private datasets. They also demonstrate better performance in correcting structural and textual errors. We also explore the impact of image superresolution technology on low-quality table images. Combined with the LLMs, our TEDS score significantly increased by 54%, greatly enhancing the recognition performance. Finally, by leveraging agent technology, our multimodal model improved table recognition performance, with the TEDS score of TableGPT_agent surpassing that of GPT-4 by 34%.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 4\",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-024-05937-6\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-05937-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在当今信息时代，表格图像在存储结构化信息方面发挥着至关重要的作用，使得表格图像识别技术成为许多领域必不可少的组成部分。然而，准确识别各种复杂表格图像的结构和文本内容仍然是一个挑战。近年来，大型语言模型（llm）在各种自然语言处理任务中表现出了卓越的能力。因此，将llm应用于表图像识别后的结构和文本内容校正任务是一种新颖的解决方案。本文介绍了一种将表识别与llm相结合的新方法TableGPT，并开发了一种专门的多模态代理来提高表图像识别的有效性。我们的方法分为四个阶段。在第一阶段，TableGPT_agent首先评估输入是否是表图像，并在确认后使用transformer等算法进行初步识别。在第二阶段，代理将识别结果转换为HTML格式，并自主评估是否需要更正。如果需要更正，则将数据输入训练有素的LLM，以实现更准确的表识别和优化。在第三阶段，agent通过反馈评估用户满意度，并对低质量图像应用超分辨率算法，因为这通常是用户不满意的主要原因。最后，智能体将增强后的图像和原始图像同时输入到训练好的模型中，整合信息以获得最优的表文本表示。我们的研究表明，经过训练的llm可以有效地解释表图像，即使基于当前最好的表识别方法，在公共和私人数据集上，也可以将树编辑距离相似性（TEDS）得分平均提高4%。它们在纠正结构和文本错误方面也表现出更好的性能。我们还探讨了图像超分辨率技术对低质量表图像的影响。结合llm，我们的TEDS得分显著提高了54%，大大提高了识别性能。最后，通过利用智能体技术，我们的多模态模型提高了表识别性能，TableGPT_agent的TEDS得分比GPT-4高出34%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TableGPT: a novel table understanding method based on table recognition and large language model collaborative enhancement

In today's information age, table images play a crucial role in storing structured information, making table image recognition technology an essential component in many fields. However, accurately recognizing the structure and text content of various complex table images has remained a challenge. Recently, large language models (LLMs) have demonstrated exceptional capabilities in various natural language processing tasks. Therefore, applying LLMs to the correction tasks of structure and text content after table image recognition presents a novel solution. This paper introduces a new method, TableGPT, which combines table recognition with LLMs and develops a specialized multimodal agent to enhance the effectiveness of table image recognition. Our approach is divided into four stages. In the first stage, TableGPT_agent initially evaluates whether the input is a table image and, upon confirmation, uses algorithms such as the transformer for preliminary recognition. In the second stage, the agent converts the recognition results into HTML format and autonomously assesses whether corrections are needed. If corrections are needed, the data are input into a trained LLM to achieve more accurate table recognition and optimization. In the third stage, the agent evaluates user satisfaction through feedback and applies superresolution algorithms to low-quality images, as this is often the main reason for user dissatisfaction. Finally, the agent inputs both the enhanced and original images into the trained model, integrating the information to obtain the optimal table text representation. Our research shows that trained LLMs can effectively interpret table images, improving the Tree Edit Distance Similarity (TEDS) score by an average of 4% even when based on the best current table recognition methods, across both public and private datasets. They also demonstrate better performance in correcting structural and textual errors. We also explore the impact of image superresolution technology on low-quality table images. Combined with the LLMs, our TEDS score significantly increased by 54%, greatly enhancing the recognition performance. Finally, by leveraging agent technology, our multimodal model improved table recognition performance, with the TEDS score of TableGPT_agent surpassing that of GPT-4 by 34%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.