Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

ACM Multimedia Asia Pub Date : 2021-11-24 DOI:10.1145/3469877.3490571

Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura

{"title":"Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages","authors":"Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura","doi":"10.1145/3469877.3490571","DOIUrl":null,"url":null,"abstract":"This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"329 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469877.3490571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用资源丰富的语言数据集进行资源贫乏语言的端到端场景文本识别

提出了一种新的端到端场景文本识别训练方法。端到端场景文本识别提供了较高的识别精度，特别是当使用基于Transformer的编码器-解码器模型时。为了训练一个高度精确的端到端模型，我们需要为目标语言准备一个大型的图像到文本配对数据集。然而，很难收集这些数据，特别是对于资源贫乏的语言。为了克服这一困难，我们提出的方法利用资源丰富的语言(如英语)中准备充分的大型数据集来训练资源贫乏的编码器-解码器模型。我们的关键思想是建立一个模型，在这个模型中，编码器反映多种语言的知识，而解码器只专注于资源贫乏的语言的知识。为此，该方法利用资源贫乏语言数据集和资源丰富语言数据集相结合的多语言数据集对编码器进行预训练，学习用于场景文本识别的语言不变知识。该方法还利用资源贫乏语言的数据集对解码器进行预训练，使解码器更适合资源贫乏语言。使用小型公开数据集进行的日语场景文本识别实验证明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Multimedia Asia

自引率

0.00%

发文量