Junjie Li, Shengli Du, Jianfeng Liu, Weibiao Chen, Manfu Tang, Lei Zheng, Lianfa Wang, Chunle Ji, Xiao Yu, Wanli Yu
In recent years, contrastive language-image pre-training (CLIP) has gained popularity for processing 2D data. However, the application of cross-modal transferable learning to 3D data remains a relatively unexplored area. In addition, high-quality, labelled point cloud data for Mechanical, Electrical, and Plumbing (MEP) scenarios are in short supply. To address this issue, the authors introduce a novel object detection system that employs 3D point clouds and 2D camera images, as well as text descriptions as input, using image-text matching knowledge to guide dense detection models for 3D point clouds in MEP environments. Specifically, the authors put forth the proposition of a language-guided point cloud modelling (PCM) module, which leverages the shared image weights inherent in the CLIP backbone. This is done with the aim of generating pertinent category information for the target, thereby augmenting the efficacy of 3D point cloud target detection. After sufficient experiments, the proposed point cloud detection system with the PCM module is proven to have a comparable performance with current state-of-the-art networks. The approach has 5.64% and 2.9% improvement in KITTI and SUN-RGBD, respectively. In addition, the same good detection results are obtained in their proposed MEP scene dataset.
{"title":"Language guided 3D object detection in point clouds for MEP scenes","authors":"Junjie Li, Shengli Du, Jianfeng Liu, Weibiao Chen, Manfu Tang, Lei Zheng, Lianfa Wang, Chunle Ji, Xiao Yu, Wanli Yu","doi":"10.1049/cvi2.12261","DOIUrl":"10.1049/cvi2.12261","url":null,"abstract":"<p>In recent years, contrastive language-image pre-training (CLIP) has gained popularity for processing 2D data. However, the application of cross-modal transferable learning to 3D data remains a relatively unexplored area. In addition, high-quality, labelled point cloud data for Mechanical, Electrical, and Plumbing (MEP) scenarios are in short supply. To address this issue, the authors introduce a novel object detection system that employs 3D point clouds and 2D camera images, as well as text descriptions as input, using image-text matching knowledge to guide dense detection models for 3D point clouds in MEP environments. Specifically, the authors put forth the proposition of a language-guided point cloud modelling (PCM) module, which leverages the shared image weights inherent in the CLIP backbone. This is done with the aim of generating pertinent category information for the target, thereby augmenting the efficacy of 3D point cloud target detection. After sufficient experiments, the proposed point cloud detection system with the PCM module is proven to have a comparable performance with current state-of-the-art networks. The approach has 5.64% and 2.9% improvement in KITTI and SUN-RGBD, respectively. In addition, the same good detection results are obtained in their proposed MEP scene dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"526-539"},"PeriodicalIF":1.7,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12261","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139007927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors design a novel convolutional network architecture, that is, deep network with double reuses and convolutional shortcuts, in which new compressed reuse units are presented. Compressed reuse units combine the reused features from the first 3 × 3 convolutional layer and the features from the last 3 × 3 convolutional layer to produce new feature maps in the current compressed reuse unit, simultaneously reuse the feature maps from all previous compressed reuse units to generate a shortcut by an 1 × 1 convolution, and then concatenate these new maps and this shortcut as the input to next compressed reuse unit. Deep network with double reuses and convolutional shortcuts uses the feature reuse concatenation from all compressed reuse units as the final features for classification. In deep network with double reuses and convolutional shortcuts, the inner- and outer-unit feature reuses and the convolutional shortcut compressed from the previous outer-unit feature reuses can alleviate the vanishing-gradient problem by strengthening the forward feature propagation inside and outside the units, improve the effectiveness of features and reduce calculation cost. Experimental results on CIFAR-10, CIFAR-100, ImageNet ILSVRC 2012, Pascal VOC2007 and MS COCO benchmark databases demonstrate the effectiveness of authors’ architecture for object recognition and detection, as compared with the state-of-the-art.
{"title":"Deep network with double reuses and convolutional shortcuts","authors":"Qian Liu, Cunbao Wang","doi":"10.1049/cvi2.12260","DOIUrl":"10.1049/cvi2.12260","url":null,"abstract":"<p>The authors design a novel convolutional network architecture, that is, deep network with double reuses and convolutional shortcuts, in which new compressed reuse units are presented. Compressed reuse units combine the reused features from the first 3 × 3 convolutional layer and the features from the last 3 × 3 convolutional layer to produce new feature maps in the current compressed reuse unit, simultaneously reuse the feature maps from all previous compressed reuse units to generate a shortcut by an 1 × 1 convolution, and then concatenate these new maps and this shortcut as the input to next compressed reuse unit. Deep network with double reuses and convolutional shortcuts uses the feature reuse concatenation from all compressed reuse units as the final features for classification. In deep network with double reuses and convolutional shortcuts, the inner- and outer-unit feature reuses and the convolutional shortcut compressed from the previous outer-unit feature reuses can alleviate the vanishing-gradient problem by strengthening the forward feature propagation inside and outside the units, improve the effectiveness of features and reduce calculation cost. Experimental results on CIFAR-10, CIFAR-100, ImageNet ILSVRC 2012, Pascal VOC2007 and MS COCO benchmark databases demonstrate the effectiveness of authors’ architecture for object recognition and detection, as compared with the state-of-the-art.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"512-525"},"PeriodicalIF":1.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12260","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138585472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In industrial manufacturing, how to accurately classify defective products and locate the location of defects has always been a concern. Previous studies mainly measured similarity based on extracting single-scale features of samples. However, only using the features of a single scale is hard to represent different sizes and types of anomalies. Therefore, the authors propose a set of memory banks of multi-scale features (MBMF) to enrich feature representation and detect and locate various anomalies. To extract features of different scales, different aggregation functions are designed to produce the feature maps at different granularity. Based on the multi-scale features of normal samples, the MBMF are constructed. Meanwhile, to better adapt to the feature distribution of the training samples, the authors proposed a new iterative updating method for the memory banks. Testing on the widely used and challenging dataset of MVTec AD, the proposed MBMF achieves competitive image-level anomaly detection performance (Image-level Area Under the Receiver Operator Curve (AUROC)) and pixel-level anomaly segmentation performance (Pixel-level AUROC). To further evaluate the generalisation of the proposed method, we also implement anomaly detection on the BeanTech AD dataset, a commonly used dataset in the field of anomaly detection, and the Fashion-MNIST dataset, a widely used dataset in the field of image classification. The experimental results also verify the effectiveness of the proposed method.
{"title":"MBMF: Constructing memory banks of multi-scale features for anomaly detection","authors":"Yanfeng Sun, Haitao Wang, Yongli Hu, Huajie Jiang, Baocai Yin","doi":"10.1049/cvi2.12258","DOIUrl":"10.1049/cvi2.12258","url":null,"abstract":"<p>In industrial manufacturing, how to accurately classify defective products and locate the location of defects has always been a concern. Previous studies mainly measured similarity based on extracting single-scale features of samples. However, only using the features of a single scale is hard to represent different sizes and types of anomalies. Therefore, the authors propose a set of memory banks of multi-scale features (MBMF) to enrich feature representation and detect and locate various anomalies. To extract features of different scales, different aggregation functions are designed to produce the feature maps at different granularity. Based on the multi-scale features of normal samples, the MBMF are constructed. Meanwhile, to better adapt to the feature distribution of the training samples, the authors proposed a new iterative updating method for the memory banks. Testing on the widely used and challenging dataset of MVTec AD, the proposed MBMF achieves competitive image-level anomaly detection performance (Image-level Area Under the Receiver Operator Curve (AUROC)) and pixel-level anomaly segmentation performance (Pixel-level AUROC). To further evaluate the generalisation of the proposed method, we also implement anomaly detection on the BeanTech AD dataset, a commonly used dataset in the field of anomaly detection, and the Fashion-MNIST dataset, a widely used dataset in the field of image classification. The experimental results also verify the effectiveness of the proposed method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"355-369"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12258","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138612082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junjie Wen, Jie Ma, Yuehua Zhao, Tong Nie, Mengxuan Sun, Ziming Fan
Semantic segmentation from a three-dimensional point cloud is vital in autonomous driving, computer vision, and augmented reality. However, current semantic segmentation does not effectively use the point cloud's local geometric features and contextual information, essential for improving segmentation accuracy. A semantic segmentation network that uses local feature fusion and a multilayer attention mechanism is proposed to address these challenges. Specifically, the authors designed a local feature fusion module to encode the geometric and feature information separately, which fully leverages the point cloud's feature perception and geometric structure representation. Furthermore, the authors designed a multilayer attention pooling module consisting of local attention pooling and cascade attention pooling to extract contextual information. Local attention pooling is used to learn local neighbourhood information, and cascade attention pooling captures contextual information from deeper local neighbourhoods. Finally, an enhanced feature representation of important information is obtained by aggregating the features from the two deep attention pooling methods. Extensive experiments on large-scale point-cloud datasets Stanford 3D large-scale indoor spaces and SemanticKITTI indicate that authors network shows excellent advantages over existing representative methods regarding local geometric feature description and global contextual relationships.
{"title":"Point cloud semantic segmentation based on local feature fusion and multilayer attention network","authors":"Junjie Wen, Jie Ma, Yuehua Zhao, Tong Nie, Mengxuan Sun, Ziming Fan","doi":"10.1049/cvi2.12255","DOIUrl":"10.1049/cvi2.12255","url":null,"abstract":"<p>Semantic segmentation from a three-dimensional point cloud is vital in autonomous driving, computer vision, and augmented reality. However, current semantic segmentation does not effectively use the point cloud's local geometric features and contextual information, essential for improving segmentation accuracy. A semantic segmentation network that uses local feature fusion and a multilayer attention mechanism is proposed to address these challenges. Specifically, the authors designed a local feature fusion module to encode the geometric and feature information separately, which fully leverages the point cloud's feature perception and geometric structure representation. Furthermore, the authors designed a multilayer attention pooling module consisting of local attention pooling and cascade attention pooling to extract contextual information. Local attention pooling is used to learn local neighbourhood information, and cascade attention pooling captures contextual information from deeper local neighbourhoods. Finally, an enhanced feature representation of important information is obtained by aggregating the features from the two deep attention pooling methods. Extensive experiments on large-scale point-cloud datasets Stanford 3D large-scale indoor spaces and SemanticKITTI indicate that authors network shows excellent advantages over existing representative methods regarding local geometric feature description and global contextual relationships.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"381-392"},"PeriodicalIF":1.7,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12255","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139233156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunyun Meng, Ernest Domanaanmwi Ganaa, Bin Wu, Zhen Tan, Li Luan
In real-world scenarios, pedestrian images often suffer from occlusion, where certain body features become invisible, making it challenging for existing methods to accurately identify pedestrians with the same ID. Traditional approaches typically focus on matching only the visible body parts, which can lead to misalignment when the occlusion patterns vary. To address this issue and alleviate misalignment in occluded pedestrian images, the authors propose a novel framework called body topology information generation and matching. The framework consists of two main modules: the body topology information generation module and the body topology information matching module. The body topology information generation module employs an adaptive detection mechanism and capsule generative adversarial network to restore a holistic pedestrian image while preserving the body topology information. The body topology information matching module leverages the restored holistic image from body topology information generation to overcome spatial misalignment and utilises cosine distance as the similarity measure for matching. By combining the body topology information generation and body topology information matching modules, the authors achieve consistency in the body topology information features of pedestrian images, ranging from restoration to retrieval. Extensive experiments are conducted on both holistic person re-identification datasets (Market-1501, DukeMTMC-ReID) and occluded person re-identification datasets (Occluded-DukeMTMC, Occluded-ReID). The results demonstrate the superior performance of the authors proposed model, and visualisations of the generation and matching modules are provided to illustrate their effectiveness. Furthermore, an ablation study is conducted to validate the contributions of the proposed framework.
{"title":"Anti-occlusion person re-identification via body topology information restoration and similarity evaluation","authors":"Chunyun Meng, Ernest Domanaanmwi Ganaa, Bin Wu, Zhen Tan, Li Luan","doi":"10.1049/cvi2.12256","DOIUrl":"10.1049/cvi2.12256","url":null,"abstract":"<p>In real-world scenarios, pedestrian images often suffer from occlusion, where certain body features become invisible, making it challenging for existing methods to accurately identify pedestrians with the same ID. Traditional approaches typically focus on matching only the visible body parts, which can lead to misalignment when the occlusion patterns vary. To address this issue and alleviate misalignment in occluded pedestrian images, the authors propose a novel framework called body topology information generation and matching. The framework consists of two main modules: the body topology information generation module and the body topology information matching module. The body topology information generation module employs an adaptive detection mechanism and capsule generative adversarial network to restore a holistic pedestrian image while preserving the body topology information. The body topology information matching module leverages the restored holistic image from body topology information generation to overcome spatial misalignment and utilises cosine distance as the similarity measure for matching. By combining the body topology information generation and body topology information matching modules, the authors achieve consistency in the body topology information features of pedestrian images, ranging from restoration to retrieval. Extensive experiments are conducted on both holistic person re-identification datasets (Market-1501, DukeMTMC-ReID) and occluded person re-identification datasets (Occluded-DukeMTMC, Occluded-ReID). The results demonstrate the superior performance of the authors proposed model, and visualisations of the generation and matching modules are provided to illustrate their effectiveness. Furthermore, an ablation study is conducted to validate the contributions of the proposed framework.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"393-404"},"PeriodicalIF":1.7,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12256","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139232904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent methods of learning latent representations in Domain Adaptation (DA) often entangle the learning of features and exploration of latent space into a unified process. However, these methods can cause a false alignment problem and do not generalise well to the alignment of distributions with large discrepancy. In this study, the authors propose to explore a robust subspace for Semi-Supervised Domain Adaptation (SSDA) explicitly. To be concrete, for disentangling the intricate relationship between feature learning and subspace exploration, the authors iterate and optimise them in two steps: in the first step, the authors aim to learn well-clustered latent representations by aggregating the target feature around the estimated class-wise prototypes; in the second step, the authors adaptively explore a subspace of an autoencoder for robust SSDA. Specially, a novel denoising strategy via class-agnostic disturbance to improve the discriminative ability of subspace is adopted. Extensive experiments on publicly available datasets verify the promising and competitive performance of our approach against state-of-the-art methods.
{"title":"Semi-supervised domain adaptation via subspace exploration","authors":"Zheng Han, Xiaobin Zhu, Chun Yang, Zhiyu Fang, Jingyan Qin, Xucheng Yin","doi":"10.1049/cvi2.12254","DOIUrl":"10.1049/cvi2.12254","url":null,"abstract":"<p>Recent methods of learning latent representations in Domain Adaptation (DA) often entangle the learning of features and exploration of latent space into a unified process. However, these methods can cause a false alignment problem and do not generalise well to the alignment of distributions with large discrepancy. In this study, the authors propose to explore a robust subspace for Semi-Supervised Domain Adaptation (SSDA) explicitly. To be concrete, for disentangling the intricate relationship between feature learning and subspace exploration, the authors iterate and optimise them in two steps: in the first step, the authors aim to learn well-clustered latent representations by aggregating the target feature around the estimated class-wise prototypes; in the second step, the authors adaptively explore a subspace of an autoencoder for robust SSDA. Specially, a novel denoising strategy via class-agnostic disturbance to improve the discriminative ability of subspace is adopted. Extensive experiments on publicly available datasets verify the promising and competitive performance of our approach against state-of-the-art methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"370-380"},"PeriodicalIF":1.7,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12254","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139229171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the robustness of skeleton data to human scale, illumination changes, dynamic camera views, and complex backgrounds, great progress has been made in skeleton-based video anomaly detection in recent years. The spatio-temporal graph convolutional network has been proven to be effective in modelling the spatio-temporal dependencies of non-Euclidean data such as human skeleton graphs, and the autoencoder based on this basic unit is widely used to model sequence features. However, due to the limitations of the convolution kernel, the model cannot capture the correlation between non-adjacent joints, and it is difficult to deal with long-term sequences, resulting in an insufficient understanding of behaviour. To address this issue, this paper applies the Transformer to the human skeleton and proposes the Spatio-Temporal Enhanced Graph-Transformer AutoEncoder (STEGT-AE) to improve the capability of modelling. In addition, the multi-memory model with skip connections is employed to provide different levels of coding features, thereby enhancing the ability of the model to distinguish similar heterogeneous behaviours. Furthermore, the STEGT-AE has a single encoder-double decoder architecture, which can improve the detection performance by the combining reconstruction and prediction error. The experimental results show that performances of STEGT-AE is significantly better than other advanced algorithms on four baseline datasets.
{"title":"A Spatio-Temporal Enhanced Graph-Transformer AutoEncoder embedded pose for anomaly detection","authors":"Honglei Zhu, Pengjuan Wei, Zhigang Xu","doi":"10.1049/cvi2.12257","DOIUrl":"10.1049/cvi2.12257","url":null,"abstract":"<p>Due to the robustness of skeleton data to human scale, illumination changes, dynamic camera views, and complex backgrounds, great progress has been made in skeleton-based video anomaly detection in recent years. The spatio-temporal graph convolutional network has been proven to be effective in modelling the spatio-temporal dependencies of non-Euclidean data such as human skeleton graphs, and the autoencoder based on this basic unit is widely used to model sequence features. However, due to the limitations of the convolution kernel, the model cannot capture the correlation between non-adjacent joints, and it is difficult to deal with long-term sequences, resulting in an insufficient understanding of behaviour. To address this issue, this paper applies the Transformer to the human skeleton and proposes the Spatio-Temporal Enhanced Graph-Transformer AutoEncoder (STEGT-AE) to improve the capability of modelling. In addition, the multi-memory model with skip connections is employed to provide different levels of coding features, thereby enhancing the ability of the model to distinguish similar heterogeneous behaviours. Furthermore, the STEGT-AE has a single encoder-double decoder architecture, which can improve the detection performance by the combining reconstruction and prediction error. The experimental results show that performances of STEGT-AE is significantly better than other advanced algorithms on four baseline datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"405-419"},"PeriodicalIF":1.7,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12257","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139246264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in deep convolutional neural networks have shown improved performance in face super-resolution through joint training with other tasks such as face analysis and landmark prediction. However, these methods have certain limitations. One major limitation is the requirement for manual marking information on the dataset for multi-task joint learning. This additional marking process increases the computational cost of the network model. Additionally, since prior information is often estimated from low-quality faces, the obtained guidance information tends to be inaccurate. To address these challenges, a novel Decoder Structure Guided CNN-Transformer Network (DCTNet) is introduced, which utilises the newly proposed Global-Local Feature Extraction Unit (GLFEU) for effective embedding. Specifically, the proposed GLFEU is composed of an attention branch and a Transformer branch, to simultaneously restore global facial structure and local texture details. Additionally, a Multi-Stage Feature Fusion Module is incorporated to fuse features from different network stages, further improving the quality of the restored face images. Compared with previous methods, DCTNet improves Peak Signal-to-Noise Ratio by 0.23 and 0.19 dB on the CelebA and Helen datasets, respectively. Experimental results demonstrate that the designed DCTNet offers a simple yet powerful solution to recover detailed facial structures from low-quality images.
{"title":"A Decoder Structure Guided CNN-Transformer Network for face super-resolution","authors":"Rui Dou, Jiawen Li, Xujie Wan, Heyou Chang, Hao Zheng, Guangwei Gao","doi":"10.1049/cvi2.12251","DOIUrl":"10.1049/cvi2.12251","url":null,"abstract":"<p>Recent advances in deep convolutional neural networks have shown improved performance in face super-resolution through joint training with other tasks such as face analysis and landmark prediction. However, these methods have certain limitations. One major limitation is the requirement for manual marking information on the dataset for multi-task joint learning. This additional marking process increases the computational cost of the network model. Additionally, since prior information is often estimated from low-quality faces, the obtained guidance information tends to be inaccurate. To address these challenges, a novel Decoder Structure Guided CNN-Transformer Network (DCTNet) is introduced, which utilises the newly proposed Global-Local Feature Extraction Unit (GLFEU) for effective embedding. Specifically, the proposed GLFEU is composed of an attention branch and a Transformer branch, to simultaneously restore global facial structure and local texture details. Additionally, a Multi-Stage Feature Fusion Module is incorporated to fuse features from different network stages, further improving the quality of the restored face images. Compared with previous methods, DCTNet improves Peak Signal-to-Noise Ratio by 0.23 and 0.19 dB on the CelebA and Helen datasets, respectively. Experimental results demonstrate that the designed DCTNet offers a simple yet powerful solution to recover detailed facial structures from low-quality images.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"473-484"},"PeriodicalIF":1.7,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12251","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139247701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skeleton-based action recognition methods commonly employ graph neural networks to learn different aspects of skeleton topology information However, these methods often struggle to capture contextual information beyond the skeleton topology. To address this issue, a Scene Context-aware Graph Convolutional Network (SCA-GCN) that leverages potential contextual information in the scene is proposed. Specifically, SCA-GCN learns the co-occurrence probabilities of actions in specific scenarios from a common knowledge base and fuses these probabilities into the original skeleton topology decoder, producing more robust results. To demonstrate the effectiveness of SCA-GCN, extensive experiments on four widely used datasets, that is, SBU, N-UCLA, NTU RGB + D, and NTU RGB + D 120 are conducted. The experimental results show that SCA-GCN surpasses existing methods, and its core idea can be extended to other methods with only some concatenation operations that consume less computational complexity.
基于骨架的动作识别方法通常采用图神经网络来学习骨架拓扑结构的不同方面信息,但这些方法往往难以捕捉骨架拓扑结构以外的上下文信息。为了解决这个问题,我们提出了一种场景上下文感知图卷积网络(SCA-GCN),它能充分利用场景中潜在的上下文信息。具体来说,SCA-GCN 从一个共同的知识库中学习特定场景中动作的共现概率,并将这些概率融合到原始骨架拓扑解码器中,从而产生更稳健的结果。为了证明 SCA-GCN 的有效性,我们在四个广泛使用的数据集(即 SBU、N-UCLA、NTU RGB + D 和 NTU RGB + D 120)上进行了大量实验。实验结果表明,SCA-GCN 超越了现有的方法,其核心思想可以扩展到其他方法,只需进行一些连接操作,计算复杂度较低。
{"title":"Scene context-aware graph convolutional network for skeleton-based action recognition","authors":"Wenxian Zhang","doi":"10.1049/cvi2.12253","DOIUrl":"10.1049/cvi2.12253","url":null,"abstract":"<p>Skeleton-based action recognition methods commonly employ graph neural networks to learn different aspects of skeleton topology information However, these methods often struggle to capture contextual information beyond the skeleton topology. To address this issue, a Scene Context-aware Graph Convolutional Network (SCA-GCN) that leverages potential contextual information in the scene is proposed. Specifically, SCA-GCN learns the co-occurrence probabilities of actions in specific scenarios from a common knowledge base and fuses these probabilities into the original skeleton topology decoder, producing more robust results. To demonstrate the effectiveness of SCA-GCN, extensive experiments on four widely used datasets, that is, SBU, N-UCLA, NTU RGB + D, and NTU RGB + D 120 are conducted. The experimental results show that SCA-GCN surpasses existing methods, and its core idea can be extended to other methods with only some concatenation operations that consume less computational complexity.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"343-354"},"PeriodicalIF":1.7,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12253","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139263769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grasping detection, which involves identifying and assessing the grasp ability of objects by robotic systems, has garnered significant attention in recent years due to its pivotal role in the development of robotic systems and automated assembly processes. Despite notable advancements in this field, current methods often grapple with both practical and theoretical challenges that hinder their real-world applicability. These challenges encompass low detection accuracy, the burden of oversized model parameters, and the inherent complexity of real-world scenarios. In response to these multifaceted challenges, a novel lightweight grasping detection model that not only addresses the technical aspects but also delves into the underlying theoretical complexities is introduced. The proposed model incorporates attention mechanisms and residual modules to tackle the theoretical challenges posed by varying object shapes, sizes, materials, and environmental conditions. To enhance its performance in the face of these theoretical complexities, the proposed model employs a Convolutional Block Attention Module (CBAM) to extract features from RGB and depth channels, recognising the multifaceted nature of object properties. Subsequently, a feature fusion module effectively combines these diverse features, providing a solution to the theoretical challenge of information integration. The model then processes the fused features through five residual blocks, followed by another CBAM attention module, culminating in the generation of three distinct images representing capture quality, grasping angle, and grasping width. These images collectively yield the final grasp detection results, addressing the theoretical complexities inherent in this task. The proposed model's rigorous training and evaluation on the Cornell Grasp dataset demonstrate remarkable detection accuracy rates of 98.44% on the Image-wise split and 96.88% on the Object-wise split. The experimental results strongly corroborate the exceptional performance of the proposed model, underscoring its ability to overcome the theoretical challenges associated with grasping detection. The integration of the residual module ensures rapid training, while the attention module facilitates precise feature extraction, ultimately striking an effective balance between detection time and accuracy.
{"title":"CR-Net: Robot grasping detection method integrating convolutional block attention module and residual module","authors":"Song Yan, Lei Zhang","doi":"10.1049/cvi2.12252","DOIUrl":"10.1049/cvi2.12252","url":null,"abstract":"<p>Grasping detection, which involves identifying and assessing the grasp ability of objects by robotic systems, has garnered significant attention in recent years due to its pivotal role in the development of robotic systems and automated assembly processes. Despite notable advancements in this field, current methods often grapple with both practical and theoretical challenges that hinder their real-world applicability. These challenges encompass low detection accuracy, the burden of oversized model parameters, and the inherent complexity of real-world scenarios. In response to these multifaceted challenges, a novel lightweight grasping detection model that not only addresses the technical aspects but also delves into the underlying theoretical complexities is introduced. The proposed model incorporates attention mechanisms and residual modules to tackle the theoretical challenges posed by varying object shapes, sizes, materials, and environmental conditions. To enhance its performance in the face of these theoretical complexities, the proposed model employs a Convolutional Block Attention Module (CBAM) to extract features from RGB and depth channels, recognising the multifaceted nature of object properties. Subsequently, a feature fusion module effectively combines these diverse features, providing a solution to the theoretical challenge of information integration. The model then processes the fused features through five residual blocks, followed by another CBAM attention module, culminating in the generation of three distinct images representing capture quality, grasping angle, and grasping width. These images collectively yield the final grasp detection results, addressing the theoretical complexities inherent in this task. The proposed model's rigorous training and evaluation on the Cornell Grasp dataset demonstrate remarkable detection accuracy rates of 98.44% on the Image-wise split and 96.88% on the Object-wise split. The experimental results strongly corroborate the exceptional performance of the proposed model, underscoring its ability to overcome the theoretical challenges associated with grasping detection. The integration of the residual module ensures rapid training, while the attention module facilitates precise feature extraction, ultimately striking an effective balance between detection time and accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"420-433"},"PeriodicalIF":1.7,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12252","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135041680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}