TEXTRON：通过数据编程进行弱监督多语言文本检测

ArXiv Pub Date : 2024-02-15 DOI:10.48550/arXiv.2402.09811

Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, P. Chaudhuri, Ganesh Ramakrishnan

{"title":"TEXTRON：通过数据编程进行弱监督多语言文本检测","authors":"Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, P. Chaudhuri, Ganesh Ramakrishnan","doi":"10.48550/arXiv.2402.09811","DOIUrl":null,"url":null,"abstract":"Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in the field of computer vision (CV), especially for low-resource or handwritten languages. Furthermore, there is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts that incorporate both printed and handwritten text. Conventionally, Indian script text detection requires training a DL model on plenty of labeled data, but to the best of our knowledge, no relevant datasets are available. Manual annotation of such data requires a lot of time, effort, and expertise. In order to solve this problem, we propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework. One can view this approach to multilingual text detection as an ensemble of different CV-based techniques and DL approaches. TEXTRON can leverage the predictions of DL models pre-trained on a significant amount of language data in conjunction with CV-based methods to improve text detection in other languages. We demonstrate that TEXTRON can improve the detection performance for documents written in Indian languages, despite the absence of corresponding labeled data. Further, through extensive experimentation, we show improvement brought about by our approach over the current State-of-the-art (SOTA) models, especially for handwritten Devanagari text. Code and dataset has been made available at https://github.com/IITB-LEAP-OCR/TEXTRON","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":"27 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming\",\"authors\":\"Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, P. Chaudhuri, Ganesh Ramakrishnan\",\"doi\":\"10.48550/arXiv.2402.09811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in the field of computer vision (CV), especially for low-resource or handwritten languages. Furthermore, there is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts that incorporate both printed and handwritten text. Conventionally, Indian script text detection requires training a DL model on plenty of labeled data, but to the best of our knowledge, no relevant datasets are available. Manual annotation of such data requires a lot of time, effort, and expertise. In order to solve this problem, we propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework. One can view this approach to multilingual text detection as an ensemble of different CV-based techniques and DL approaches. TEXTRON can leverage the predictions of DL models pre-trained on a significant amount of language data in conjunction with CV-based methods to improve text detection in other languages. We demonstrate that TEXTRON can improve the detection performance for documents written in Indian languages, despite the absence of corresponding labeled data. Further, through extensive experimentation, we show improvement brought about by our approach over the current State-of-the-art (SOTA) models, especially for handwritten Devanagari text. Code and dataset has been made available at https://github.com/IITB-LEAP-OCR/TEXTRON\",\"PeriodicalId\":8425,\"journal\":{\"name\":\"ArXiv\",\"volume\":\"27 4\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2402.09811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2402.09811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近几种基于深度学习（DL）的技术在基于图像的多语言文本检测方面表现相当出色。然而，它们的性能在很大程度上取决于训练数据的可用性和质量。页面级文档图像种类繁多，包含多种模式、语言、字体和布局的信息。这使得文本检测成为计算机视觉（CV）领域的一个挑战性问题，尤其是对于低资源或手写语言。此外，用于文本检测的单词级标记数据非常稀缺，特别是对于多语言环境和包含印刷和手写文本的印度脚本。按照惯例，印度文字文本检测需要在大量标注数据上训练 DL 模型，但据我们所知，目前还没有相关的数据集。对这些数据进行人工标注需要大量的时间、精力和专业知识。为了解决这个问题，我们提出了基于数据编程的 TEXTRON 方法，用户可以将各种文本检测方法插入基于弱监督的学习框架中。我们可以将这种多语言文本检测方法视为不同的基于 CV 的技术和 DL 方法的集合。TEXTRON 可以利用在大量语言数据上预先训练好的 DL 模型的预测结果，结合基于 CV 的方法来改进其他语言的文本检测。我们证明，尽管缺乏相应的标记数据，TEXTRON 仍能提高以印度语言编写的文档的检测性能。此外，通过广泛的实验，我们展示了我们的方法对当前最先进（SOTA）模型所带来的改进，尤其是在手写 Devanagari 文本方面。代码和数据集可从 https://github.com/IITB-LEAP-OCR/TEXTRON 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming

Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in the field of computer vision (CV), especially for low-resource or handwritten languages. Furthermore, there is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts that incorporate both printed and handwritten text. Conventionally, Indian script text detection requires training a DL model on plenty of labeled data, but to the best of our knowledge, no relevant datasets are available. Manual annotation of such data requires a lot of time, effort, and expertise. In order to solve this problem, we propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework. One can view this approach to multilingual text detection as an ensemble of different CV-based techniques and DL approaches. TEXTRON can leverage the predictions of DL models pre-trained on a significant amount of language data in conjunction with CV-based methods to improve text detection in other languages. We demonstrate that TEXTRON can improve the detection performance for documents written in Indian languages, despite the absence of corresponding labeled data. Further, through extensive experimentation, we show improvement brought about by our approach over the current State-of-the-art (SOTA) models, especially for handwritten Devanagari text. Code and dataset has been made available at https://github.com/IITB-LEAP-OCR/TEXTRON

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ArXiv

自引率

0.00%

发文量