语义感知视频文本检测

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2021-06-01 DOI:10.1109/CVPR46437.2021.00174

Wei Feng, Fei Yin, Xu-Yao Zhang, Cheng-Lin Liu

{"title":"语义感知视频文本检测","authors":"Wei Feng, Fei Yin, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1109/CVPR46437.2021.00174","DOIUrl":null,"url":null,"abstract":"Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more robust cues for matching text instances. In this paper, we propose an end-to-end trainable video text detector that tracks texts based on semantic features. First, we introduce a new character center segmentation branch to extract semantic features, which encode the category and position of characters. Then we propose a novel appearance-semantic-geometry descriptor to track text instances, in which se-mantic features can improve the robustness against appearance changes. To overcome the lack of character-level an-notations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on three video text benchmarks ICDAR 2013 Video, Minetto and RT-1K, and two Chinese scene text benchmarks CA-SIA10K and MSRA-TD500.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Semantic-Aware Video Text Detection\",\"authors\":\"Wei Feng, Fei Yin, Xu-Yao Zhang, Cheng-Lin Liu\",\"doi\":\"10.1109/CVPR46437.2021.00174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more robust cues for matching text instances. In this paper, we propose an end-to-end trainable video text detector that tracks texts based on semantic features. First, we introduce a new character center segmentation branch to extract semantic features, which encode the category and position of characters. Then we propose a novel appearance-semantic-geometry descriptor to track text instances, in which se-mantic features can improve the robustness against appearance changes. To overcome the lack of character-level an-notations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on three video text benchmarks ICDAR 2013 Video, Minetto and RT-1K, and two Chinese scene text benchmarks CA-SIA10K and MSRA-TD500.\",\"PeriodicalId\":339646,\"journal\":{\"name\":\"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR46437.2021.00174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR46437.2021.00174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

现有的视频文本检测方法大多跟踪具有外观特征的文本，这些特征容易受到视角和光照变化的影响。与外观特征相比，语义特征是更可靠的文本匹配线索。在本文中，我们提出了一种基于语义特征跟踪文本的端到端可训练视频文本检测器。首先，引入新的字符中心分割分支提取语义特征，对字符的类别和位置进行编码;然后，我们提出了一种新的外观-语义-几何描述符来跟踪文本实例，其中语义特征可以提高对外观变化的鲁棒性。为了克服字符级标注的不足，我们提出了一种新的弱监督字符中心检测模块，该模块仅使用单词级标注的真实图像来生成字符级标注。该方法在三个视频文本基准测试(ICDAR 2013 video、Minetto和RT-1K)和两个中文场景文本基准测试(CA-SIA10K和MSRA-TD500)上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Semantic-Aware Video Text Detection

Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more robust cues for matching text instances. In this paper, we propose an end-to-end trainable video text detector that tracks texts based on semantic features. First, we introduce a new character center segmentation branch to extract semantic features, which encode the category and position of characters. Then we propose a novel appearance-semantic-geometry descriptor to track text instances, in which se-mantic features can improve the robustness against appearance changes. To overcome the lack of character-level an-notations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on three video text benchmarks ICDAR 2013 Video, Minetto and RT-1K, and two Chinese scene text benchmarks CA-SIA10K and MSRA-TD500.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

Multi-Label Learning from Single Positive Labels Panoramic Image Reflection Removal Self-Aligned Video Deraining with Transmission-Depth Consistency PSD: Principled Synthetic-to-Real Dehazing Guided by Physical Priors Ultra-High-Definition Image Dehazing via Multi-Guided Bilateral Learning