Data augmentation methods are crucial to improve the accuracy of densely occluded object recognition in the scene where the quantity and diversity of training images are insufficient. However, the current methods that use regional dropping and mixing strategies suffer from the problem of missing foreground objects and redundant background features, which can lead to densely occluded object recognition issues in classification or detection tasks. Herein, saliency information and mosaic based data augmentation method for densely occluded object recognition is proposed, which utilizes saliency information as prior knowledge to supervise the mosaic process of training images containing densely occluded objects. And the method uses fogging processing and class label mixing to construct new augmented images, in order to improve the accuracy of image classification and object recognition tasks by augmenting the quantity and diversity of training images. Extensive experiments on different classification datasets with various CNN architectures prove the effectiveness of our method.
{"title":"Saliency information and mosaic based data augmentation method for densely occluded object recognition","authors":"Ying Tong, Xiangfeng Luo, Liyan Ma, Shaorong Xie, Wenbin Yang, Yinsai Guo","doi":"10.1007/s10044-024-01258-z","DOIUrl":"https://doi.org/10.1007/s10044-024-01258-z","url":null,"abstract":"<p>Data augmentation methods are crucial to improve the accuracy of densely occluded object recognition in the scene where the quantity and diversity of training images are insufficient. However, the current methods that use regional dropping and mixing strategies suffer from the problem of missing foreground objects and redundant background features, which can lead to densely occluded object recognition issues in classification or detection tasks. Herein, saliency information and mosaic based data augmentation method for densely occluded object recognition is proposed, which utilizes saliency information as prior knowledge to supervise the mosaic process of training images containing densely occluded objects. And the method uses fogging processing and class label mixing to construct new augmented images, in order to improve the accuracy of image classification and object recognition tasks by augmenting the quantity and diversity of training images. Extensive experiments on different classification datasets with various CNN architectures prove the effectiveness of our method.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"43 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1007/s10044-024-01259-y
Palanichamy Naveen, Mahmoud Hassaballah
Scene text detection poses a considerable challenge due to the diverse nature of text appearance, backgrounds, and orientations. Enhancing robustness, accuracy, and efficiency in this context is vital for several applications, such as optical character recognition, image understanding, and autonomous vehicles. This paper explores the integration of generative adversarial network (GAN) and network variational autoencoder (VAE) to create a robust and potent text detection network. The proposed architecture comprises three interconnected modules: the VAE module, the GAN module, and the text detection module. In this framework, the VAE module plays a pivotal role in generating diverse and variable text regions. Subsequently, the GAN module refines and enhances these regions, ensuring heightened realism and accuracy. Then, the text detection module takes charge of identifying text regions in the input image via assigning confidence scores to each region. The comprehensive training of the entire network involves minimizing a joint loss function that encompasses the VAE loss, the GAN loss, and the text detection loss. The VAE loss ensures diversity in generated text regions and the GAN loss guarantees realism and accuracy, while the text detection loss ensures high-precision identification of text regions. The proposed method employs an encoder-decoder structure within the VAE module and a generator-discriminator structure in the GAN module. Rigorous testing on diverse datasets including Total-Text, CTW1500, ICDAR 2015, ICDAR 2017, ReCTS, TD500, COCO-Text, SynthText, Street View Text, and KIAST Scene Text demonstrates the superior performance of the proposed method compared to existing approaches.
由于文字外观、背景和方向的多样性,场景文字检测是一项相当大的挑战。在这种情况下,提高鲁棒性、准确性和效率对于光学字符识别、图像理解和自动驾驶汽车等多种应用至关重要。本文探讨了生成式对抗网络(GAN)与网络变异自动编码器(VAE)的整合,以创建一个强大而有效的文本检测网络。所提出的架构由三个相互关联的模块组成:VAE 模块、GAN 模块和文本检测模块。在此框架中,VAE 模块在生成多样化和可变的文本区域方面发挥着关键作用。随后,GAN 模块对这些区域进行细化和增强,以确保更高的真实性和准确性。然后,文本检测模块负责通过为每个区域分配置信度分数来识别输入图像中的文本区域。整个网络的综合训练包括最小化联合损失函数,其中包括 VAE 损失、GAN 损失和文本检测损失。VAE 损失可确保生成文本区域的多样性,GAN 损失可确保真实性和准确性,而文本检测损失则可确保文本区域的高精度识别。所提出的方法在 VAE 模块中采用了编码器-解码器结构,在 GAN 模块中采用了生成器-鉴别器结构。在不同的数据集(包括 Total-Text、CTW1500、ICDAR 2015、ICDAR 2017、ReCTS、TD500、COCO-Text、SynthText、Street View Text 和 KIAST Scene Text)上进行的严格测试表明,与现有方法相比,所提出的方法具有更优越的性能。
{"title":"Scene text detection using structured information and an end-to-end trainable generative adversarial networks","authors":"Palanichamy Naveen, Mahmoud Hassaballah","doi":"10.1007/s10044-024-01259-y","DOIUrl":"https://doi.org/10.1007/s10044-024-01259-y","url":null,"abstract":"<p>Scene text detection poses a considerable challenge due to the diverse nature of text appearance, backgrounds, and orientations. Enhancing robustness, accuracy, and efficiency in this context is vital for several applications, such as optical character recognition, image understanding, and autonomous vehicles. This paper explores the integration of generative adversarial network (GAN) and network variational autoencoder (VAE) to create a robust and potent text detection network. The proposed architecture comprises three interconnected modules: the VAE module, the GAN module, and the text detection module. In this framework, the VAE module plays a pivotal role in generating diverse and variable text regions. Subsequently, the GAN module refines and enhances these regions, ensuring heightened realism and accuracy. Then, the text detection module takes charge of identifying text regions in the input image via assigning confidence scores to each region. The comprehensive training of the entire network involves minimizing a joint loss function that encompasses the VAE loss, the GAN loss, and the text detection loss. The VAE loss ensures diversity in generated text regions and the GAN loss guarantees realism and accuracy, while the text detection loss ensures high-precision identification of text regions. The proposed method employs an encoder-decoder structure within the VAE module and a generator-discriminator structure in the GAN module. Rigorous testing on diverse datasets including Total-Text, CTW1500, ICDAR 2015, ICDAR 2017, ReCTS, TD500, COCO-Text, SynthText, Street View Text, and KIAST Scene Text demonstrates the superior performance of the proposed method compared to existing approaches.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-14DOI: 10.1007/s10044-024-01256-1
Yuxiang Wu, Xiaoyan Wang, Tianpan Chen, Yan Dou
It is important to generate both diverse and representative video summary for massive videos. In this paper, a convolution neural network based on dual-stream attention mechanism(DA-ResNet) is designed to obtain candidate summary sequences for classroom scenes. DA-ResNet constructs a dual stream input of image frame sequence and optical flow frame sequence to enhance the expression ability. The network also embeds the attention mechanism into ResNet. On the other hand, the final video summary is obtained by removing redundant frames with the improved hash clustering algorithm. In this process, preprocessing is performed first to reduce computational complexity. And then hash clustering is used to retain the frame with the highest entropy value in each class, removing other similar frames. To verify its effectiveness in classroom scenes, we also created ClassVideo, a real dataset consisting of 45 videos from the normal teaching environment of our school. The results of the experiments show the competitiveness of the proposed method DA-ResNet outperforms the existing methods by about 8% in terms of the F-measure. Besides, the visual results also demonstrate its ability to produce classroom video summaries that are very close to the human preferences.
{"title":"DA-ResNet: dual-stream ResNet with attention mechanism for classroom video summary","authors":"Yuxiang Wu, Xiaoyan Wang, Tianpan Chen, Yan Dou","doi":"10.1007/s10044-024-01256-1","DOIUrl":"https://doi.org/10.1007/s10044-024-01256-1","url":null,"abstract":"<p>It is important to generate both diverse and representative video summary for massive videos. In this paper, a convolution neural network based on dual-stream attention mechanism(DA-ResNet) is designed to obtain candidate summary sequences for classroom scenes. DA-ResNet constructs a dual stream input of image frame sequence and optical flow frame sequence to enhance the expression ability. The network also embeds the attention mechanism into ResNet. On the other hand, the final video summary is obtained by removing redundant frames with the improved hash clustering algorithm. In this process, preprocessing is performed first to reduce computational complexity. And then hash clustering is used to retain the frame with the highest entropy value in each class, removing other similar frames. To verify its effectiveness in classroom scenes, we also created ClassVideo, a real dataset consisting of 45 videos from the normal teaching environment of our school. The results of the experiments show the competitiveness of the proposed method DA-ResNet outperforms the existing methods by about 8% in terms of the F-measure. Besides, the visual results also demonstrate its ability to produce classroom video summaries that are very close to the human preferences.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"20 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article presents a novel methodology that comprises of end-to-end Venus’ visible image processing neoteric workflow. The visible raw image is denoised using Tri-State median filter with background dark subtraction, and then enhanced using Contrast Limited Adaptive Histogram Equalization. The multi-modal image registration technique is developed using Segmented Affine Scale Invariant Feature Transform and Motion Smoothness Constraint outlier removal for co-registration of Venus’ visible and radar image. A novel image fusion algorithm using guided filter is developed to merge multi-modal Visible-Radar Venus’ image pair for generating the fused image. The Venus’ visible image quality assessment is performed at each processing step, and results are quantified and visualized. In addition, fuzzy color-coded segmentation map is generated for crucial information retrieval about Venus’ surface feature characteristics. It is found that Venus’ fused image clearly demarked planetary morphological features and validated with publicly available Venus’ radar nomenclature map.