In in-vehicle driving scenarios, composite action recognition is crucial for improving safety and understanding the driver's intention. Due to spatial constraints and occlusion factors, the driver's range of motion is limited, thus resulting in similar action patterns that are difficult to differentiate. Additionally, collecting skeleton data that characterise the full human posture is difficult, posing additional challenges for action recognition. To address the problems, a novel Graph-Reinforcement Transformer (GR-Former) model is proposed. Using limited skeleton data as inputs, by introducing graph structure information to directionally reinforce the effect of the self-attention mechanism, dynamically learning and aggregating features between joints at multiple levels, the authors’ model constructs a richer feature vector space, enhancing its expressiveness and recognition accuracy. Based on the Drive & Act dataset for composite action recognition, the authors’ work only applies human upper-body skeleton data to achieve state-of-the-art performance compared to existing methods. Using complete human skeleton data also has excellent recognition accuracy on the NTU RGB + D- and NTU RGB + D 120 dataset, demonstrating the great generalisability of the GR-Former. Generally, the authors’ work provides a new and effective solution for driver action recognition in in-vehicle scenarios.
{"title":"GR-Former: Graph-reinforcement transformer for skeleton-based driver action recognition","authors":"Zhuoyan Xu, Jingke Xu","doi":"10.1049/cvi2.12298","DOIUrl":"10.1049/cvi2.12298","url":null,"abstract":"<p>In in-vehicle driving scenarios, composite action recognition is crucial for improving safety and understanding the driver's intention. Due to spatial constraints and occlusion factors, the driver's range of motion is limited, thus resulting in similar action patterns that are difficult to differentiate. Additionally, collecting skeleton data that characterise the full human posture is difficult, posing additional challenges for action recognition. To address the problems, a novel Graph-Reinforcement Transformer (GR-Former) model is proposed. Using limited skeleton data as inputs, by introducing graph structure information to directionally reinforce the effect of the self-attention mechanism, dynamically learning and aggregating features between joints at multiple levels, the authors’ model constructs a richer feature vector space, enhancing its expressiveness and recognition accuracy. Based on the Drive & Act dataset for composite action recognition, the authors’ work only applies human upper-body skeleton data to achieve state-of-the-art performance compared to existing methods. Using complete human skeleton data also has excellent recognition accuracy on the NTU RGB + D- and NTU RGB + D 120 dataset, demonstrating the great generalisability of the GR-Former. Generally, the authors’ work provides a new and effective solution for driver action recognition in in-vehicle scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"982-991"},"PeriodicalIF":1.5,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12298","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141659905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human action recognition based on graph convolutional networks (GCNs) is one of the hotspots in computer vision. However, previous methods generally rely on handcrafted graph, which limits the effectiveness of the model in characterising the connections between indirectly connected joints. The limitation leads to weakened connections when joints are separated by long distances. To address the above issue, the authors propose a skeleton simplification method which aims to reduce the number of joints and the distance between joints by merging adjacent joints into simplified joints. Group convolutional block is devised to extract the internal features of the simplified joints. Additionally, the authors enhance the method by introducing multi-scale modelling, which maps inputs into sequences across various levels of simplification. Combining with spatial temporal graph convolution, a multi-scale skeleton simplification GCN for skeleton-based action recognition (M3S-GCN) is proposed for fusing multi-scale skeleton sequences and modelling the connections between joints. Finally, M3S-GCN is evaluated on five benchmarks of NTU RGB+D 60 (C-Sub, C-View), NTU RGB+D 120 (X-Sub, X-Set) and NW-UCLA datasets. Experimental results show that the authors’ M3S-GCN achieves state-of-the-art performance with the accuracies of 93.0%, 97.0% and 91.2% on C-Sub, C-View and X-Set benchmarks, which validates the effectiveness of the method.
{"title":"Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition","authors":"Fan Zhang, Ding Chongyang, Kai Liu, Liu Hongjin","doi":"10.1049/cvi2.12300","DOIUrl":"10.1049/cvi2.12300","url":null,"abstract":"<p>Human action recognition based on graph convolutional networks (GCNs) is one of the hotspots in computer vision. However, previous methods generally rely on handcrafted graph, which limits the effectiveness of the model in characterising the connections between indirectly connected joints. The limitation leads to weakened connections when joints are separated by long distances. To address the above issue, the authors propose a skeleton simplification method which aims to reduce the number of joints and the distance between joints by merging adjacent joints into simplified joints. Group convolutional block is devised to extract the internal features of the simplified joints. Additionally, the authors enhance the method by introducing multi-scale modelling, which maps inputs into sequences across various levels of simplification. Combining with spatial temporal graph convolution, a multi-scale skeleton simplification GCN for skeleton-based action recognition (M3S-GCN) is proposed for fusing multi-scale skeleton sequences and modelling the connections between joints. Finally, M3S-GCN is evaluated on five benchmarks of NTU RGB+D 60 (C-Sub, C-View), NTU RGB+D 120 (X-Sub, X-Set) and NW-UCLA datasets. Experimental results show that the authors’ M3S-GCN achieves state-of-the-art performance with the accuracies of 93.0%, 97.0% and 91.2% on C-Sub, C-View and X-Set benchmarks, which validates the effectiveness of the method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"992-1003"},"PeriodicalIF":1.5,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141668289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Schneider, Kim Lindner, Markus Vogelbacher, Hicham Bellafkir, Nina Farwig, Bernd Freisleben
Most machine learning methods for animal recognition in camera trap images are limited to mammal identification and group birds into a single class. Machine learning methods for visually discriminating birds, in turn, cannot discriminate between mammals and are not designed for camera trap images. The authors present deep neural network models to recognise both mammals and bird species in camera trap images. They train neural network models for species classification as well as for predicting the animal taxonomy, that is, genus, family, order, group, and class names. Different neural network architectures, including ResNet, EfficientNetV2, Vision Transformer, Swin Transformer, and ConvNeXt, are compared for these tasks. Furthermore, the authors investigate approaches to overcome various challenges associated with camera trap image analysis. The authors’ best species classification models achieve a mean average precision (mAP) of 97.91% on a validation data set and mAPs of 90.39% and 82.77% on test data sets recorded in forests in Germany and Poland, respectively. Their best taxonomic classification models reach a validation mAP of 97.18% and mAPs of 94.23% and 79.92% on the two test data sets, respectively.
{"title":"Recognition of European mammals and birds in camera trap images using deep neural networks","authors":"Daniel Schneider, Kim Lindner, Markus Vogelbacher, Hicham Bellafkir, Nina Farwig, Bernd Freisleben","doi":"10.1049/cvi2.12294","DOIUrl":"10.1049/cvi2.12294","url":null,"abstract":"<p>Most machine learning methods for animal recognition in camera trap images are limited to mammal identification and group birds into a single class. Machine learning methods for visually discriminating birds, in turn, cannot discriminate between mammals and are not designed for camera trap images. The authors present deep neural network models to recognise both mammals and bird species in camera trap images. They train neural network models for species classification as well as for predicting the animal taxonomy, that is, genus, family, order, group, and class names. Different neural network architectures, including ResNet, EfficientNetV2, Vision Transformer, Swin Transformer, and ConvNeXt, are compared for these tasks. Furthermore, the authors investigate approaches to overcome various challenges associated with camera trap image analysis. The authors’ best species classification models achieve a mean average precision (mAP) of 97.91% on a validation data set and mAPs of 90.39% and 82.77% on test data sets recorded in forests in Germany and Poland, respectively. Their best taxonomic classification models reach a validation mAP of 97.18% and mAPs of 94.23% and 79.92% on the two test data sets, respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1162-1192"},"PeriodicalIF":1.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12294","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141683177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, multi-view clustering (MVC) has had significant implications in the fields of cross-modal representation learning and data-driven decision-making. Its main objective is to cluster samples into distinct groups by leveraging consistency and complementary information among multiple views. However, the field of computer vision has witnessed the evolution of contrastive learning, and self-supervised learning has made substantial research progress. Consequently, self-supervised learning is progressively becoming dominant in MVC methods. It involves designing proxy tasks to extract supervisory information from image and video data, thereby guiding the clustering process. Despite the rapid development of self-supervised MVC, there is currently no comprehensive survey analysing and summarising the current state of research progress. Hence, the authors aim to explore the emergence of self-supervised MVC by discussing the reasons and advantages behind it. Additionally, the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods are investigated. The authors not only introduce the mechanisms for each category of methods, but also provide illustrative examples of their applications. Finally, some open problems are identified for further investigation and development.
{"title":"Self-supervised multi-view clustering in computer vision: A survey","authors":"Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng","doi":"10.1049/cvi2.12299","DOIUrl":"https://doi.org/10.1049/cvi2.12299","url":null,"abstract":"<p>In recent years, multi-view clustering (MVC) has had significant implications in the fields of cross-modal representation learning and data-driven decision-making. Its main objective is to cluster samples into distinct groups by leveraging consistency and complementary information among multiple views. However, the field of computer vision has witnessed the evolution of contrastive learning, and self-supervised learning has made substantial research progress. Consequently, self-supervised learning is progressively becoming dominant in MVC methods. It involves designing proxy tasks to extract supervisory information from image and video data, thereby guiding the clustering process. Despite the rapid development of self-supervised MVC, there is currently no comprehensive survey analysing and summarising the current state of research progress. Hence, the authors aim to explore the emergence of self-supervised MVC by discussing the reasons and advantages behind it. Additionally, the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods are investigated. The authors not only introduce the mechanisms for each category of methods, but also provide illustrative examples of their applications. Finally, some open problems are identified for further investigation and development.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"709-734"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12299","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the challenge of detecting anomalies in real-world surveillance videos using weakly supervised data has emerged. Traditional methods, utilising multi-instance learning (MIL) with video snippets, struggle with background noise and tend to overlook subtle anomalies. To tackle this, the authors propose a novel approach that crops snippets to create multiple instances with less noise, separately evaluates them and then fuses these evaluations for more precise anomaly detection. This method, however, leads to higher computational demands, especially during inference. Addressing this, our solution employs mutual learning to guide snippet feature training using these low-noise crops. The authors integrate multiple instance learning (MIL) for the primary task with snippets as inputs and multiple-multiple instance learning (MMIL) for an auxiliary task with crops during training. The authors’ approach ensures consistent multi-instance results in both tasks and incorporates a temporal activation mutual learning module (TAML) for aligning temporal anomaly activations between snippets and crops, improving the overall quality of snippet representations. Additionally, a snippet feature discrimination enhancement module (SFDE) refines the snippet features further. Tested across various datasets, the authors’ method shows remarkable performance, notably achieving a frame-level AUC of 85.78% on the UCF-Crime dataset, while reducing computational costs.
{"title":"Fusing crops representation into snippet via mutual learning for weakly supervised surveillance anomaly detection","authors":"Bohua Zhang, Jianru Xue","doi":"10.1049/cvi2.12289","DOIUrl":"10.1049/cvi2.12289","url":null,"abstract":"<p>In recent years, the challenge of detecting anomalies in real-world surveillance videos using weakly supervised data has emerged. Traditional methods, utilising multi-instance learning (MIL) with video snippets, struggle with background noise and tend to overlook subtle anomalies. To tackle this, the authors propose a novel approach that crops snippets to create multiple instances with less noise, separately evaluates them and then fuses these evaluations for more precise anomaly detection. This method, however, leads to higher computational demands, especially during inference. Addressing this, our solution employs mutual learning to guide snippet feature training using these low-noise crops. The authors integrate multiple instance learning (MIL) for the primary task with snippets as inputs and multiple-multiple instance learning (MMIL) for an auxiliary task with crops during training. The authors’ approach ensures consistent multi-instance results in both tasks and incorporates a temporal activation mutual learning module (TAML) for aligning temporal anomaly activations between snippets and crops, improving the overall quality of snippet representations. Additionally, a snippet feature discrimination enhancement module (SFDE) refines the snippet features further. Tested across various datasets, the authors’ method shows remarkable performance, notably achieving a frame-level AUC of 85.78% on the UCF-Crime dataset, while reducing computational costs.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1112-1126"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12289","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141684297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.
{"title":"FastFaceCLIP: A lightweight text-driven high-quality face image manipulation","authors":"Jiaqi Ren, Junping Qin, Qianli Ma, Yin Cao","doi":"10.1049/cvi2.12295","DOIUrl":"10.1049/cvi2.12295","url":null,"abstract":"<p>Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"950-967"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141687557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anomaly Detection, also known as outlier detection, is critical in domains such as network security, intrusion detection, and fraud detection. One popular approach to anomaly detection is using autoencoders, which are trained to reconstruct input by minimising reconstruction error with the neural network. However, these methods usually suffer from the trade-off between normal reconstruction fidelity and abnormal reconstruction distinguishability, which damages the performance. The authors find that the above trade-off can be better mitigated by imposing constraints on the latent space of images. To this end, the authors propose a new Dual Adversarial Network (DualAD) that consists of a Feature Constraint (FC) module and a reconstruction module. The method incorporates the FC module during the reconstruction training process to impose constraints on the latent space of images, thereby yielding feature representations more conducive to anomaly detection. Additionally, the authors employ dual adversarial learning to model the distribution of normal data. On the one hand, adversarial learning was implemented during the reconstruction process to obtain higher-quality reconstruction samples, thereby preventing the effects of blurred image reconstructions on model performance. On the other hand, the authors utilise adversarial training of the FC module and the reconstruction module to achieve superior feature representation, making anomalies more distinguishable at the feature level. During the inference phase, the authors perform anomaly detection simultaneously in the pixel and latent spaces to identify abnormal patterns more comprehensively. Experiments on three data sets CIFAR10, MNIST, and FashionMNIST demonstrate the validity of the authors’ work. Results show that constraints on the latent space and adversarial learning can improve detection performance.
{"title":"DualAD: Dual adversarial network for image anomaly detection⋆","authors":"Yonghao Wan, Aimin Feng","doi":"10.1049/cvi2.12297","DOIUrl":"https://doi.org/10.1049/cvi2.12297","url":null,"abstract":"<p>Anomaly Detection, also known as outlier detection, is critical in domains such as network security, intrusion detection, and fraud detection. One popular approach to anomaly detection is using autoencoders, which are trained to reconstruct input by minimising reconstruction error with the neural network. However, these methods usually suffer from the trade-off between normal reconstruction fidelity and abnormal reconstruction distinguishability, which damages the performance. The authors find that the above trade-off can be better mitigated by imposing constraints on the latent space of images. To this end, the authors propose a new Dual Adversarial Network (DualAD) that consists of a Feature Constraint (FC) module and a reconstruction module. The method incorporates the FC module during the reconstruction training process to impose constraints on the latent space of images, thereby yielding feature representations more conducive to anomaly detection. Additionally, the authors employ dual adversarial learning to model the distribution of normal data. On the one hand, adversarial learning was implemented during the reconstruction process to obtain higher-quality reconstruction samples, thereby preventing the effects of blurred image reconstructions on model performance. On the other hand, the authors utilise adversarial training of the FC module and the reconstruction module to achieve superior feature representation, making anomalies more distinguishable at the feature level. During the inference phase, the authors perform anomaly detection simultaneously in the pixel and latent spaces to identify abnormal patterns more comprehensively. Experiments on three data sets CIFAR10, MNIST, and FashionMNIST demonstrate the validity of the authors’ work. Results show that constraints on the latent space and adversarial learning can improve detection performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1138-1148"},"PeriodicalIF":1.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vehicle transportation of hazardous chemicals is one of the important mobile hazards in modern logistics, and its unsafe factors bring serious threats to people's lives, property and environmental safety. Although the current object detection algorithm has certain applications in the detection of hazardous chemical vehicles, due to the complexity of the transportation environment, the small size and low resolution of the vehicle target etc., object detection becomes more difficult in the face of a complex background. In order to solve these problems, the authors propose an improved algorithm based on YOLOv5 to enhance the detection accuracy and efficiency of hazardous chemical vehicles. Firstly, in order to better capture the details and semantic information of hazardous chemical vehicles, the algorithm solves the problem of mismatch between the receptive field of the detector and the target object by introducing the receptive field expansion block into the backbone network, so as to improve the ability of the model to capture the detailed information of hazardous chemical vehicles. Secondly, in order to improve the ability of the model to express the characteristics of hazardous chemical vehicles, the authors introduce a separable attention mechanism in the multi-scale target detection stage, and enhances the prediction ability of the model by combining the object detection head and attention mechanism coherently in the feature layer of scale perception, the spatial location of spatial perception and the output channel of task perception. Experimental results show that the improved model significantly surpasses the baseline model in terms of accuracy and achieves more accurate object detection. At the same time, the model also has a certain improvement in inference speed and achieves faster inference ability.
{"title":"SAM-Y: Attention-enhanced hazardous vehicle object detection algorithm","authors":"Shanshan Wang, Bushi Liu, Pengcheng Zhu, Xianchun Meng, Bolun Chen, Wei Shao, Liqing Chen","doi":"10.1049/cvi2.12293","DOIUrl":"https://doi.org/10.1049/cvi2.12293","url":null,"abstract":"<p>Vehicle transportation of hazardous chemicals is one of the important mobile hazards in modern logistics, and its unsafe factors bring serious threats to people's lives, property and environmental safety. Although the current object detection algorithm has certain applications in the detection of hazardous chemical vehicles, due to the complexity of the transportation environment, the small size and low resolution of the vehicle target etc., object detection becomes more difficult in the face of a complex background. In order to solve these problems, the authors propose an improved algorithm based on YOLOv5 to enhance the detection accuracy and efficiency of hazardous chemical vehicles. Firstly, in order to better capture the details and semantic information of hazardous chemical vehicles, the algorithm solves the problem of mismatch between the receptive field of the detector and the target object by introducing the receptive field expansion block into the backbone network, so as to improve the ability of the model to capture the detailed information of hazardous chemical vehicles. Secondly, in order to improve the ability of the model to express the characteristics of hazardous chemical vehicles, the authors introduce a separable attention mechanism in the multi-scale target detection stage, and enhances the prediction ability of the model by combining the object detection head and attention mechanism coherently in the feature layer of scale perception, the spatial location of spatial perception and the output channel of task perception. Experimental results show that the improved model significantly surpasses the baseline model in terms of accuracy and achieves more accurate object detection. At the same time, the model also has a certain improvement in inference speed and achieves faster inference ability.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1149-1161"},"PeriodicalIF":1.5,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12293","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the richness of natural image semantics, natural image colourisation is a challenging problem. Existing methods often suffer from semantic confusion due to insufficient semantic understanding, resulting in unreasonable colour assignments, especially at the edges of objects. This phenomenon is referred to as colour bleeding. The authors have found that using the self-attention mechanism benefits the model's understanding and recognition of object semantics. However, this leads to another problem in colourisation, namely dull colour. With this in mind, a Position-Spatial Attention Network(PSANet) is proposed to address the colour bleeding and the dull colour. Firstly, a novel new attention module called position-spatial attention module (PSAM) is introduced. Through the proposed PSAM module, the model enhances the semantic understanding of images while solving the dull colour problem caused by self-attention. Then, in order to further prevent colour bleeding on object boundaries, a gradient-aware loss is proposed. Lastly, the colour bleeding phenomenon is further improved by the combined effect of gradient-aware loss and edge-aware loss. Experimental results show that this method can reduce colour bleeding largely while maintaining good perceptual quality.
{"title":"PSANet: Automatic colourisation using position-spatial attention for natural images","authors":"Peng-Jie Zhu, Yuan-Yuan Pu, Qiuxia Yang, Siqi Li, Zheng-Peng Zhao, Hao Wu, Dan Xu","doi":"10.1049/cvi2.12291","DOIUrl":"https://doi.org/10.1049/cvi2.12291","url":null,"abstract":"<p>Due to the richness of natural image semantics, natural image colourisation is a challenging problem. Existing methods often suffer from semantic confusion due to insufficient semantic understanding, resulting in unreasonable colour assignments, especially at the edges of objects. This phenomenon is referred to as colour bleeding. The authors have found that using the self-attention mechanism benefits the model's understanding and recognition of object semantics. However, this leads to another problem in colourisation, namely dull colour. With this in mind, a Position-Spatial Attention Network(PSANet) is proposed to address the colour bleeding and the dull colour. Firstly, a novel new attention module called position-spatial attention module (PSAM) is introduced. Through the proposed PSAM module, the model enhances the semantic understanding of images while solving the dull colour problem caused by self-attention. Then, in order to further prevent colour bleeding on object boundaries, a gradient-aware loss is proposed. Lastly, the colour bleeding phenomenon is further improved by the combined effect of gradient-aware loss and edge-aware loss. Experimental results show that this method can reduce colour bleeding largely while maintaining good perceptual quality.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"922-934"},"PeriodicalIF":1.5,"publicationDate":"2024-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning-based face recognition models have demonstrated remarkable performance in benchmark tests, and knowledge distillation technology has been frequently accustomed to obtain high-precision real-time face recognition models specifically designed for mobile and embedded devices. However, in recent years, the knowledge distillation methods for face recognition, which mainly focus on feature or logit knowledge distillation techniques, neglect the attention mechanism that play an important role in the domain of neural networks. An innovation cross-stage connection review path of the attention cosine similarity knowledge distillation method that unites the attention mechanism with review knowledge distillation method is proposed. This method transfers the attention map obtained from the teacher network to the student through a cross-stage connection path. The efficacy and excellence of the proposed algorithm are demonstrated in popular benchmark tests.
{"title":"Knowledge distillation of face recognition via attention cosine similarity review","authors":"Zhuo Wang, SuWen Zhao, WanYi Guo","doi":"10.1049/cvi2.12288","DOIUrl":"https://doi.org/10.1049/cvi2.12288","url":null,"abstract":"<p>Deep learning-based face recognition models have demonstrated remarkable performance in benchmark tests, and knowledge distillation technology has been frequently accustomed to obtain high-precision real-time face recognition models specifically designed for mobile and embedded devices. However, in recent years, the knowledge distillation methods for face recognition, which mainly focus on feature or logit knowledge distillation techniques, neglect the attention mechanism that play an important role in the domain of neural networks. An innovation cross-stage connection review path of the attention cosine similarity knowledge distillation method that unites the attention mechanism with review knowledge distillation method is proposed. This method transfers the attention map obtained from the teacher network to the student through a cross-stage connection path. The efficacy and excellence of the proposed algorithm are demonstrated in popular benchmark tests.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"875-887"},"PeriodicalIF":1.5,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12288","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}