使用摘要模型从 OCR 文本中进行端到端实体提取

Neural Computing and Applications Pub Date : 2024-09-19 DOI:10.1007/s00521-024-10422-9

Pedro A. Villa-García, Raúl Alonso-Calvo, Miguel García-Remesal

{"title":"使用摘要模型从 OCR 文本中进行端到端实体提取","authors":"Pedro A. Villa-García, Raúl Alonso-Calvo, Miguel García-Remesal","doi":"10.1007/s00521-024-10422-9","DOIUrl":null,"url":null,"abstract":"<p>A novel methodology is introduced for extracting entities from noisy scanned documents by using end-to-end data and reformulating the entity extraction task as a text summarization problem. This approach offers two significant advantages over traditional entity extraction methods while maintaining comparable performance. First, it utilizes preexisting data to construct datasets, thereby eliminating the need for labor-intensive annotation procedures. Second, it employs multitask learning, enabling the training of a model via a single dataset. To evaluate our approach against state-of-the-art methods, we adapted three commonly used datasets, namely, Conference on Natural Language Learning (CoNLL++), few-shot named entity recognition (Few-NERD), and WikiNEuRal domain adaptation (WikiNEuRal + DA), to the format required by our methodology. We subsequently fine-tuned four sequence-to-sequence models: text-to-text transfer transformer (T5), fine-tuned language net T5 (FLAN-T5), bidirectional autoregressive transformer (BART), and pretraining with extracted gap sentences for abstractive summarization sequence-to-sequence models (PEGASUS). The results indicate that, in the absence of optical character recognition (OCR) noise, the BART model performs comparably to state-of-the-art methods. Furthermore, the performance degradation was limited to 3.49–5.23% when 39–62% of the sentences contained OCR noise. This performance is significantly superior to that of previous studies, which reported a 10–20% decrease in the F1 score with texts that had a 20% OCR error rate. Our experimental results demonstrate that a single model trained via our methodology can reliably extract entities from noisy OCRed texts, unlike existing state-of-the-art approaches, which require separate models for correcting OCR errors and extracting entities.</p>","PeriodicalId":18925,"journal":{"name":"Neural Computing and Applications","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"End-to-end entity extraction from OCRed texts using summarization models\",\"authors\":\"Pedro A. Villa-García, Raúl Alonso-Calvo, Miguel García-Remesal\",\"doi\":\"10.1007/s00521-024-10422-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>A novel methodology is introduced for extracting entities from noisy scanned documents by using end-to-end data and reformulating the entity extraction task as a text summarization problem. This approach offers two significant advantages over traditional entity extraction methods while maintaining comparable performance. First, it utilizes preexisting data to construct datasets, thereby eliminating the need for labor-intensive annotation procedures. Second, it employs multitask learning, enabling the training of a model via a single dataset. To evaluate our approach against state-of-the-art methods, we adapted three commonly used datasets, namely, Conference on Natural Language Learning (CoNLL++), few-shot named entity recognition (Few-NERD), and WikiNEuRal domain adaptation (WikiNEuRal + DA), to the format required by our methodology. We subsequently fine-tuned four sequence-to-sequence models: text-to-text transfer transformer (T5), fine-tuned language net T5 (FLAN-T5), bidirectional autoregressive transformer (BART), and pretraining with extracted gap sentences for abstractive summarization sequence-to-sequence models (PEGASUS). The results indicate that, in the absence of optical character recognition (OCR) noise, the BART model performs comparably to state-of-the-art methods. Furthermore, the performance degradation was limited to 3.49–5.23% when 39–62% of the sentences contained OCR noise. This performance is significantly superior to that of previous studies, which reported a 10–20% decrease in the F1 score with texts that had a 20% OCR error rate. Our experimental results demonstrate that a single model trained via our methodology can reliably extract entities from noisy OCRed texts, unlike existing state-of-the-art approaches, which require separate models for correcting OCR errors and extracting entities.</p>\",\"PeriodicalId\":18925,\"journal\":{\"name\":\"Neural Computing and Applications\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Computing and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00521-024-10422-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00521-024-10422-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过使用端到端数据，并将实体提取任务重新表述为文本摘要问题，引入了一种从噪声扫描文档中提取实体的新方法。与传统的实体提取方法相比，这种方法有两个显著优势，同时还能保持相当的性能。首先，它利用已有数据构建数据集，从而省去了耗费大量人力的标注程序。其次，它采用了多任务学习技术，可以通过单个数据集来训练模型。为了将我们的方法与最先进的方法进行对比评估，我们将三个常用数据集，即自然语言学习会议（CoNLL++）、少量命名实体识别（Few-NERD）和 WikiNEuRal 领域适应（WikiNEuRal + DA），调整为我们的方法所需的格式。随后，我们对四种序列到序列模型进行了微调：文本到文本传输转换器（T5）、微调语言网 T5（FLAN-T5）、双向自回归转换器（BART），以及抽象概括序列到序列模型（PEGASUS）的提取空白句预训练。结果表明，在没有光学字符识别（OCR）噪声的情况下，BART 模型的性能与最先进的方法相当。此外，当 39-62% 的句子含有 OCR 噪音时，性能下降幅度限制在 3.49-5.23% 之间。这一性能明显优于之前的研究，之前的研究报告称，在 OCR 错误率为 20% 的文本中，F1 分数下降了 10-20%。我们的实验结果表明，通过我们的方法训练出的单一模型可以从有噪声的 OCR 文本中可靠地提取实体，这与现有的先进方法不同，后者需要单独的模型来纠正 OCR 错误和提取实体。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

End-to-end entity extraction from OCRed texts using summarization models

A novel methodology is introduced for extracting entities from noisy scanned documents by using end-to-end data and reformulating the entity extraction task as a text summarization problem. This approach offers two significant advantages over traditional entity extraction methods while maintaining comparable performance. First, it utilizes preexisting data to construct datasets, thereby eliminating the need for labor-intensive annotation procedures. Second, it employs multitask learning, enabling the training of a model via a single dataset. To evaluate our approach against state-of-the-art methods, we adapted three commonly used datasets, namely, Conference on Natural Language Learning (CoNLL++), few-shot named entity recognition (Few-NERD), and WikiNEuRal domain adaptation (WikiNEuRal + DA), to the format required by our methodology. We subsequently fine-tuned four sequence-to-sequence models: text-to-text transfer transformer (T5), fine-tuned language net T5 (FLAN-T5), bidirectional autoregressive transformer (BART), and pretraining with extracted gap sentences for abstractive summarization sequence-to-sequence models (PEGASUS). The results indicate that, in the absence of optical character recognition (OCR) noise, the BART model performs comparably to state-of-the-art methods. Furthermore, the performance degradation was limited to 3.49–5.23% when 39–62% of the sentences contained OCR noise. This performance is significantly superior to that of previous studies, which reported a 10–20% decrease in the F1 score with texts that had a 20% OCR error rate. Our experimental results demonstrate that a single model trained via our methodology can reliably extract entities from noisy OCRed texts, unlike existing state-of-the-art approaches, which require separate models for correcting OCR errors and extracting entities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Computing and Applications

自引率

0.00%

发文量