{"title":"ESRNet:用于任意形状场景文本检测的探索样本关系网络","authors":"Huageng Fan, Tongwei Lu","doi":"10.1007/s10489-024-05773-8","DOIUrl":null,"url":null,"abstract":"<div><p>Recently transformer-based scene text detection methods have been gradually investigated. However, these methods usually use attention to model visual content relationships in single sample, ignoring the relationships between samples. Exploring sample relationships enables feature propagation between samples, which facilitates detector to detect scene text images with more complex features. Aware of the challenges above, this paper proposes exploring sample relationships network (ESRNet) for detecting arbitrary-shaped texts. In detail, we construct the exploring sample relationships module (ESRM) to model sample relationships in the encoder, capturing interactions between all samples in each batch and propagating features across samples. Because of the inconsistency in batch sizes for training and testing leads to differences in exploring sample relationships between these two phases, so two-stream encoder method is used to solve the problem. Moreover, we propose location-aware factorized self-attention (LAFSA), which incorporates the sequential information of text polygon control points into the modeling and effectively improves the accuracy of label reading order in terms of visual features. Experimental results on multiple datasets demonstrate that ESRNet exhibits superior performance compared to other methods. Notably, ESRNet achieves F-measure of 88.9<span>\\(\\%\\)</span>, 88.4<span>\\(\\%\\)</span>, and 77.4<span>\\(\\%\\)</span> on the Total-Text, CTW1500, and ArT datasets, respectively.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"54 22","pages":"11995 - 12008"},"PeriodicalIF":3.4000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ESRNet: an exploring sample relationships network for arbitrary-shaped scene text detection\",\"authors\":\"Huageng Fan, Tongwei Lu\",\"doi\":\"10.1007/s10489-024-05773-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Recently transformer-based scene text detection methods have been gradually investigated. However, these methods usually use attention to model visual content relationships in single sample, ignoring the relationships between samples. Exploring sample relationships enables feature propagation between samples, which facilitates detector to detect scene text images with more complex features. Aware of the challenges above, this paper proposes exploring sample relationships network (ESRNet) for detecting arbitrary-shaped texts. In detail, we construct the exploring sample relationships module (ESRM) to model sample relationships in the encoder, capturing interactions between all samples in each batch and propagating features across samples. Because of the inconsistency in batch sizes for training and testing leads to differences in exploring sample relationships between these two phases, so two-stream encoder method is used to solve the problem. Moreover, we propose location-aware factorized self-attention (LAFSA), which incorporates the sequential information of text polygon control points into the modeling and effectively improves the accuracy of label reading order in terms of visual features. Experimental results on multiple datasets demonstrate that ESRNet exhibits superior performance compared to other methods. Notably, ESRNet achieves F-measure of 88.9<span>\\\\(\\\\%\\\\)</span>, 88.4<span>\\\\(\\\\%\\\\)</span>, and 77.4<span>\\\\(\\\\%\\\\)</span> on the Total-Text, CTW1500, and ArT datasets, respectively.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"54 22\",\"pages\":\"11995 - 12008\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-024-05773-8\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-05773-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
ESRNet: an exploring sample relationships network for arbitrary-shaped scene text detection
Recently transformer-based scene text detection methods have been gradually investigated. However, these methods usually use attention to model visual content relationships in single sample, ignoring the relationships between samples. Exploring sample relationships enables feature propagation between samples, which facilitates detector to detect scene text images with more complex features. Aware of the challenges above, this paper proposes exploring sample relationships network (ESRNet) for detecting arbitrary-shaped texts. In detail, we construct the exploring sample relationships module (ESRM) to model sample relationships in the encoder, capturing interactions between all samples in each batch and propagating features across samples. Because of the inconsistency in batch sizes for training and testing leads to differences in exploring sample relationships between these two phases, so two-stream encoder method is used to solve the problem. Moreover, we propose location-aware factorized self-attention (LAFSA), which incorporates the sequential information of text polygon control points into the modeling and effectively improves the accuracy of label reading order in terms of visual features. Experimental results on multiple datasets demonstrate that ESRNet exhibits superior performance compared to other methods. Notably, ESRNet achieves F-measure of 88.9\(\%\), 88.4\(\%\), and 77.4\(\%\) on the Total-Text, CTW1500, and ArT datasets, respectively.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.