Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.
Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.
Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.
Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.
The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (TFE) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (IME) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the TFE and IME modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.
The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.
Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.
Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.
Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag-inferring and tag-guided Transformer for image captioning to generate fine-grained captions. First, a tag-inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag-guided decoder that includes short-term attention to improve the features of words in the sentence and gated cross-modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU-4 score and 135.3% CIDEr score on the MSCOCO data set.