Pub Date : 2022-05-21DOI: 10.48550/arXiv.2205.14042
Zhong Ji, Zhenfei Hu, Yaodong Wang, Shengjia Li
Pedestrian Attribute Recognition (PAR) is a challenging task in intelligent video surveillance. Two key challenges in PAR include complex alignment relations between images and attributes, and imbalanced data distribution. Existing approaches usually formulate PAR as a recognition task. Different from them, this paper addresses it as a decision-making task via a reinforcement learning framework. Specifically, PAR is formulated as a Markov decision process (MDP) by designing ingenious states, action space, reward function and state transition. To alleviate the inter-attribute imbalance problem, we apply an Attribute Grouping Strategy (AGS) by dividing all attributes into subgroups according to their region and category information. Then we employ an agent to recognize each group of attributes, which is trained with Deep Q-learning algorithm. We also propose a Group Optimization Reward (GOR) function to alleviate the intra-attribute imbalance problem. Experimental results on the three benchmark datasets of PETA, RAP and PA100K illustrate the effectiveness and competitiveness of the proposed approach and demonstrate that the application of reinforcement learning to PAR is a valuable research direction.
{"title":"Reinforced Pedestrian Attribute Recognition with Group Optimization Reward","authors":"Zhong Ji, Zhenfei Hu, Yaodong Wang, Shengjia Li","doi":"10.48550/arXiv.2205.14042","DOIUrl":"https://doi.org/10.48550/arXiv.2205.14042","url":null,"abstract":"Pedestrian Attribute Recognition (PAR) is a challenging task in intelligent video surveillance. Two key challenges in PAR include complex alignment relations between images and attributes, and imbalanced data distribution. Existing approaches usually formulate PAR as a recognition task. Different from them, this paper addresses it as a decision-making task via a reinforcement learning framework. Specifically, PAR is formulated as a Markov decision process (MDP) by designing ingenious states, action space, reward function and state transition. To alleviate the inter-attribute imbalance problem, we apply an Attribute Grouping Strategy (AGS) by dividing all attributes into subgroups according to their region and category information. Then we employ an agent to recognize each group of attributes, which is trained with Deep Q-learning algorithm. We also propose a Group Optimization Reward (GOR) function to alleviate the intra-attribute imbalance problem. Experimental results on the three benchmark datasets of PETA, RAP and PA100K illustrate the effectiveness and competitiveness of the proposed approach and demonstrate that the application of reinforcement learning to PAR is a valuable research direction.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82958034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-17DOI: 10.1007/s00371-022-02505-2
Andre Mühlenbrock, Roland Fischer, Christoph Schröder‐Dering, René Weller, G. Zachmann
{"title":"Fast, accurate and robust registration of multiple depth sensors without need for RGB and IR images","authors":"Andre Mühlenbrock, Roland Fischer, Christoph Schröder‐Dering, René Weller, G. Zachmann","doi":"10.1007/s00371-022-02505-2","DOIUrl":"https://doi.org/10.1007/s00371-022-02505-2","url":null,"abstract":"","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72984378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-16DOI: 10.48550/arXiv.2205.08013
M. Zamorski, Michal Stypulkowski, Konrad Karanowski, Tomasz Trzci'nski, Maciej Ziȩba
Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.
{"title":"Continual learning on 3D point clouds with random compressed rehearsal","authors":"M. Zamorski, Michal Stypulkowski, Konrad Karanowski, Tomasz Trzci'nski, Maciej Ziȩba","doi":"10.48550/arXiv.2205.08013","DOIUrl":"https://doi.org/10.48550/arXiv.2205.08013","url":null,"abstract":"Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76712660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-05DOI: 10.48550/arXiv.2205.02717
Mingdong Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Limin Wang
Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.
{"title":"BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection","authors":"Mingdong Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Limin Wang","doi":"10.48550/arXiv.2205.02717","DOIUrl":"https://doi.org/10.48550/arXiv.2205.02717","url":null,"abstract":"Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74452018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.48550/arXiv.2205.08659
Tristan Frizza, D. Dansereau, Nagita Mehr Seresht, M. Bewley
This work addresses the problems of semantic segmentation and image super-resolution by jointly considering the performance of both in training a Generative Adversarial Network (GAN). We propose a novel architecture and domain-specific feature loss, allowing super-resolution to operate as a pre-processing step to increase the performance of downstream computer vision tasks, specifically semantic segmentation. We demonstrate this approach using Nearmap’s aerial imagery dataset which covers hundreds of urban areas at 5-7 cm per pixel resolution. We show the proposed approach improves perceived image quality as well as quantitative segmentation accuracy across all prediction classes, yielding an average accuracy improvement of 11.8% and 108% at 4 × and 32 × super-resolution, compared with state-of-the art single-network methods. This work demonstrates that jointly considering image-based and task-specific losses can improve the performance of both, and advances the state-of-the-art in semantic-aware super-resolution of aerial imagery. 1: A comparison of of three potential generator model architec- tures for 4 × super-resolution. We chose RRDN for all subsequent ex-periments due to its superior overall performance on pixel-wise loss
{"title":"Semantically Accurate Super-Resolution Generative Adversarial Networks","authors":"Tristan Frizza, D. Dansereau, Nagita Mehr Seresht, M. Bewley","doi":"10.48550/arXiv.2205.08659","DOIUrl":"https://doi.org/10.48550/arXiv.2205.08659","url":null,"abstract":"This work addresses the problems of semantic segmentation and image super-resolution by jointly considering the performance of both in training a Generative Adversarial Network (GAN). We propose a novel architecture and domain-specific feature loss, allowing super-resolution to operate as a pre-processing step to increase the performance of downstream computer vision tasks, specifically semantic segmentation. We demonstrate this approach using Nearmap’s aerial imagery dataset which covers hundreds of urban areas at 5-7 cm per pixel resolution. We show the proposed approach improves perceived image quality as well as quantitative segmentation accuracy across all prediction classes, yielding an average accuracy improvement of 11.8% and 108% at 4 × and 32 × super-resolution, compared with state-of-the art single-network methods. This work demonstrates that jointly considering image-based and task-specific losses can improve the performance of both, and advances the state-of-the-art in semantic-aware super-resolution of aerial imagery. 1: A comparison of of three potential generator model architec- tures for 4 × super-resolution. We chose RRDN for all subsequent ex-periments due to its superior overall performance on pixel-wise loss","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74313737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from a single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, for the first time, we propose a concise but efficient single-stage pipeline for multi-hand reconstruction. Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners. The code and models are available at {https://github.com/zijinxuxu/SMHR}.
{"title":"End-to-end weakly-supervised single-stage multiple 3D hand mesh reconstruction from a single RGB image","authors":"Jinwei Ren, Jianke Zhu, Jialiang Zhang","doi":"10.2139/ssrn.4199294","DOIUrl":"https://doi.org/10.2139/ssrn.4199294","url":null,"abstract":"In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from a single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, for the first time, we propose a concise but efficient single-stage pipeline for multi-hand reconstruction. Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners. The code and models are available at {https://github.com/zijinxuxu/SMHR}.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73698711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or images sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021. As a result, the most-used datasets and metrics are identified. Also, the main approaches used and the best ones. We compute a set of rankings based on several performance metrics to obtain, according to its performance, the best method with the best result on the video captioning task. Finally, some insights are concluded about which could be the next steps or opportunity areas to improve dealing with this complex task.
{"title":"Video Captioning: a comparative review of where we are and which could be the route","authors":"Daniela Moctezuma, Tania Ram'irez-delReal, Guillermo Ruiz, Oth'on Gonz'alez-Ch'avez","doi":"10.48550/arXiv.2204.05976","DOIUrl":"https://doi.org/10.48550/arXiv.2204.05976","url":null,"abstract":"Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or images sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021. As a result, the most-used datasets and metrics are identified. Also, the main approaches used and the best ones. We compute a set of rankings based on several performance metrics to obtain, according to its performance, the best method with the best result on the video captioning task. Finally, some insights are concluded about which could be the next steps or opportunity areas to improve dealing with this complex task.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86286089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}