Pub Date : 2022-07-23DOI: 10.48550/arXiv.2207.11469
Xiangcheng Du, Zhao Zhou, Yingbin Zheng, Xingjiao Wu, Tianlong Ma, Cheng Jin
Scene text erasing seeks to erase text contents from scene images and current state-of-the-art text erasing models are trained on large-scale synthetic data. Although data synthetic engines can provide vast amounts of annotated training samples, there are differences between synthetic and real-world data. In this paper, we employ self-supervision for feature representation on unlabeled real-world scene text images. A novel pretext task is designed to keep consistent among text stroke masks of image variants. We design the Progressive Erasing Network in order to remove residual texts. The scene text is erased progressively by leveraging the intermediate generated results which provide the foundation for subsequent higher quality results. Experiments show that our method significantly improves the generalization of the text erasing task and achieves state-of-the-art performance on public benchmarks.
{"title":"Progressive Scene Text Erasing with Self-Supervision","authors":"Xiangcheng Du, Zhao Zhou, Yingbin Zheng, Xingjiao Wu, Tianlong Ma, Cheng Jin","doi":"10.48550/arXiv.2207.11469","DOIUrl":"https://doi.org/10.48550/arXiv.2207.11469","url":null,"abstract":"Scene text erasing seeks to erase text contents from scene images and current state-of-the-art text erasing models are trained on large-scale synthetic data. Although data synthetic engines can provide vast amounts of annotated training samples, there are differences between synthetic and real-world data. In this paper, we employ self-supervision for feature representation on unlabeled real-world scene text images. A novel pretext task is designed to keep consistent among text stroke masks of image variants. We design the Progressive Erasing Network in order to remove residual texts. The scene text is erased progressively by leveraging the intermediate generated results which provide the foundation for subsequent higher quality results. Experiments show that our method significantly improves the generalization of the text erasing task and achieves state-of-the-art performance on public benchmarks.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74485378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-20DOI: 10.48550/arXiv.2207.09925
Leiyang Xu, Qianqian Wang, Xiaotian Lin, Lin Yuan
Temporal action segmentation (TAS) aims to classify and locate actions in the long untrimmed action sequence. With the success of deep learning, many deep models for action segmentation have emerged. However, few-shot TAS is still a challenging problem. This study proposes an efficient framework for the few-shot skeleton-based TAS, including a data augmentation method and an improved model. The data augmentation approach based on motion interpolation is presented here to solve the problem of insufficient data, and can increase the number of samples significantly by synthesizing action sequences. Besides, we concatenate a Connectionist Temporal Classification (CTC) layer with a network designed for skeleton-based TAS to obtain an optimized model. Leveraging CTC can enhance the temporal alignment between prediction and ground truth and further improve the segment-wise metrics of segmentation results. Extensive experiments on both public and self-constructed datasets, including two small-scale datasets and one large-scale dataset, show the effectiveness of two proposed methods in improving the performance of the few-shot skeleton-based TAS task.
{"title":"An Efficient Framework for Few-shot Skeleton-based Temporal Action Segmentation","authors":"Leiyang Xu, Qianqian Wang, Xiaotian Lin, Lin Yuan","doi":"10.48550/arXiv.2207.09925","DOIUrl":"https://doi.org/10.48550/arXiv.2207.09925","url":null,"abstract":"Temporal action segmentation (TAS) aims to classify and locate actions in the long untrimmed action sequence. With the success of deep learning, many deep models for action segmentation have emerged. However, few-shot TAS is still a challenging problem. This study proposes an efficient framework for the few-shot skeleton-based TAS, including a data augmentation method and an improved model. The data augmentation approach based on motion interpolation is presented here to solve the problem of insufficient data, and can increase the number of samples significantly by synthesizing action sequences. Besides, we concatenate a Connectionist Temporal Classification (CTC) layer with a network designed for skeleton-based TAS to obtain an optimized model. Leveraging CTC can enhance the temporal alignment between prediction and ground truth and further improve the segment-wise metrics of segmentation results. Extensive experiments on both public and self-constructed datasets, including two small-scale datasets and one large-scale dataset, show the effectiveness of two proposed methods in improving the performance of the few-shot skeleton-based TAS task.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90475265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-16DOI: 10.48550/arXiv.2207.08003
Antonio Bărbălău, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, J. Dueholm, B. Ramachandra, Kamal Nasrollahi, F. Khan, T. Moeslund, M. Shah
A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.
{"title":"SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection","authors":"Antonio Bărbălău, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, J. Dueholm, B. Ramachandra, Kamal Nasrollahi, F. Khan, T. Moeslund, M. Shah","doi":"10.48550/arXiv.2207.08003","DOIUrl":"https://doi.org/10.48550/arXiv.2207.08003","url":null,"abstract":"A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79033053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-09DOI: 10.48550/arXiv.2207.04224
Xin Jia, Changlei Dongye, Yan-Tsung Peng
RGB-D SOD uses depth information to handle challenging scenes and obtain high-quality saliency maps. Existing state-of-the-art RGB-D saliency detection methods overwhelmingly rely on the strategy of directly fusing depth information. Although these methods improve the accuracy of saliency prediction through various cross-modality fusion strategies, misinformation provided by some poor-quality depth images can affect the saliency prediction result. To address this issue, a novel RGB-D salient object detection model (SiaTrans) is proposed in this paper, which allows training on depth image quality classification at the same time as training on SOD. In light of the common information between RGB and depth images on salient objects, SiaTrans uses a Siamese transformer network with shared weight parameters as the encoder and extracts RGB and depth features concatenated on the batch dimension, saving space resources without compromising performance. SiaTrans uses the Class token in the backbone network (T2T-ViT) to classify the quality of depth images without preventing the token sequence from going on with the saliency detection task. Transformer-based cross-modality fusion module (CMF) can effectively fuse RGB and depth information. And in the testing process, CMF can choose to fuse cross-modality information or enhance RGB information according to the quality classification signal of the depth image. The greatest benefit of our designed CMF and decoder is that they maintain the consistency of RGB and RGB-D information decoding: SiaTrans decodes RGB-D or RGB information under the same model parameters according to the classification signal during testing. Comprehensive experiments on nine RGB-D SOD benchmark datasets show that SiaTrans has the best overall performance and the least computation compared with recent state-of-the-art methods.
{"title":"SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification","authors":"Xin Jia, Changlei Dongye, Yan-Tsung Peng","doi":"10.48550/arXiv.2207.04224","DOIUrl":"https://doi.org/10.48550/arXiv.2207.04224","url":null,"abstract":"RGB-D SOD uses depth information to handle challenging scenes and obtain high-quality saliency maps. Existing state-of-the-art RGB-D saliency detection methods overwhelmingly rely on the strategy of directly fusing depth information. Although these methods improve the accuracy of saliency prediction through various cross-modality fusion strategies, misinformation provided by some poor-quality depth images can affect the saliency prediction result. To address this issue, a novel RGB-D salient object detection model (SiaTrans) is proposed in this paper, which allows training on depth image quality classification at the same time as training on SOD. In light of the common information between RGB and depth images on salient objects, SiaTrans uses a Siamese transformer network with shared weight parameters as the encoder and extracts RGB and depth features concatenated on the batch dimension, saving space resources without compromising performance. SiaTrans uses the Class token in the backbone network (T2T-ViT) to classify the quality of depth images without preventing the token sequence from going on with the saliency detection task. Transformer-based cross-modality fusion module (CMF) can effectively fuse RGB and depth information. And in the testing process, CMF can choose to fuse cross-modality information or enhance RGB information according to the quality classification signal of the depth image. The greatest benefit of our designed CMF and decoder is that they maintain the consistency of RGB and RGB-D information decoding: SiaTrans decodes RGB-D or RGB information under the same model parameters according to the classification signal during testing. Comprehensive experiments on nine RGB-D SOD benchmark datasets show that SiaTrans has the best overall performance and the least computation compared with recent state-of-the-art methods.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76098488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-07DOI: 10.48550/arXiv.2207.03618
S. Guan, Haiyan Lu, Linchao Zhu, Gengfa Fang
3D pose estimation has recently gained substantial interests in computer vision domain. Existing 3D pose estimation methods have a strong reliance on large size well-annotated 3D pose datasets, and they suffer poor model generalization on unseen poses due to limited diversity of 3D poses in training sets. In this work, we propose PoseGU, a novel human pose generator that generates diverse poses with access only to a small size of seed samples, while equipping the Counterfactual Risk Minimization to pursue an unbiased evaluation objective. Extensive experiments demonstrate PoseGU outforms almost all the state-of-the-art 3D human pose methods under consideration over three popular benchmark datasets. Empirical analysis also proves PoseGU generates 3D poses with improved data diversity and better generalization ability.
{"title":"PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning","authors":"S. Guan, Haiyan Lu, Linchao Zhu, Gengfa Fang","doi":"10.48550/arXiv.2207.03618","DOIUrl":"https://doi.org/10.48550/arXiv.2207.03618","url":null,"abstract":"3D pose estimation has recently gained substantial interests in computer vision domain. Existing 3D pose estimation methods have a strong reliance on large size well-annotated 3D pose datasets, and they suffer poor model generalization on unseen poses due to limited diversity of 3D poses in training sets. In this work, we propose PoseGU, a novel human pose generator that generates diverse poses with access only to a small size of seed samples, while equipping the Counterfactual Risk Minimization to pursue an unbiased evaluation objective. Extensive experiments demonstrate PoseGU outforms almost all the state-of-the-art 3D human pose methods under consideration over three popular benchmark datasets. Empirical analysis also proves PoseGU generates 3D poses with improved data diversity and better generalization ability.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85834408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.48550/arXiv.2206.07431
Cyprien Ruffino, Rachel Blin, Samia Ainouz, G. Gasso, Romain H'erault, F. Mériaudeau, S. Canu
Polarimetric imaging, along with deep learning, has shown improved performances on different tasks including scene analysis. However, its robustness may be questioned because of the small size of the training datasets. Though the issue could be solved by data augmentation, polarization modalities are subject to physical feasibility constraints unaddressed by classical data augmentation techniques. To address this issue, we propose to use CycleGAN, an image translation technique based on deep generative models that solely relies on unpaired data, to transfer large labeled road scene datasets to the polarimetric domain. We design several auxiliary loss terms that, alongside the CycleGAN losses, deal with the physical constraints of polarimetric images. The efficiency of this solution is demonstrated on road scene object detection tasks where generated realistic polarimetric images allow to improve performances on cars and pedestrian detection up to 9%. The resulting constrained CycleGAN is publicly released, allowing anyone to generate their own polarimetric images.
{"title":"Physically-admissible polarimetric data augmentation for road-scene analysis","authors":"Cyprien Ruffino, Rachel Blin, Samia Ainouz, G. Gasso, Romain H'erault, F. Mériaudeau, S. Canu","doi":"10.48550/arXiv.2206.07431","DOIUrl":"https://doi.org/10.48550/arXiv.2206.07431","url":null,"abstract":"Polarimetric imaging, along with deep learning, has shown improved performances on different tasks including scene analysis. However, its robustness may be questioned because of the small size of the training datasets. Though the issue could be solved by data augmentation, polarization modalities are subject to physical feasibility constraints unaddressed by classical data augmentation techniques. To address this issue, we propose to use CycleGAN, an image translation technique based on deep generative models that solely relies on unpaired data, to transfer large labeled road scene datasets to the polarimetric domain. We design several auxiliary loss terms that, alongside the CycleGAN losses, deal with the physical constraints of polarimetric images. The efficiency of this solution is demonstrated on road scene object detection tasks where generated realistic polarimetric images allow to improve performances on cars and pedestrian detection up to 9%. The resulting constrained CycleGAN is publicly released, allowing anyone to generate their own polarimetric images.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85407885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}