Jihong Ouyang, Zhengjie Zhang, Qingyi Meng, Jinjin Chi
Active domain adaptation (active DA) provides an effective solution by selectively labelling a limited number of target samples to significantly enhance adaptation performance. However, existing active DA methods often struggle in real-world scenarios where, due to data privacy concerns, only a pre-trained source model is available, rather than the source samples. To address this issue, we propose a novel method called the structure-based uncertainty estimation model (SUEM) for source-free active domain adaptation (SFADA). To be specific, we introduce an innovative active sample selection strategy that combines both uncertainty and diversity sampling to identify the most informative samples. We assess the uncertainty in target samples using structure-wise probabilities and implement a diversity selection method to minimise redundancy. For the selected samples, we not only apply standard-supervised loss but also conduct interpolation consistency training to further explore the structural information of the target domain. Extensive experiments across four widely used datasets demonstrate that our method is comparable to or outperforms current UDA and active DA methods.
主动域自适应(Active domain adaptation, Active DA)是一种有效的解决方案,它可以选择性地标记有限数量的目标样本,从而显著提高自适应性能。然而,现有的主动数据分析方法在现实场景中经常遇到困难,由于数据隐私问题,只有预训练的源模型可用,而不是源样本。为了解决这一问题,我们提出了一种基于结构的不确定性估计模型(SUEM),用于无源主动域自适应(SFADA)。具体来说,我们引入了一种创新的主动样本选择策略,该策略结合了不确定性和多样性采样来识别最具信息量的样本。我们使用结构概率评估目标样本的不确定性,并实现多样性选择方法以最小化冗余。对于选择的样本,我们不仅应用标准监督损失,还进行插值一致性训练,进一步挖掘目标域的结构信息。在四个广泛使用的数据集上进行的大量实验表明,我们的方法与当前的UDA和主动DA方法相当或优于后者。
{"title":"Structure-Based Uncertainty Estimation for Source-Free Active Domain Adaptation","authors":"Jihong Ouyang, Zhengjie Zhang, Qingyi Meng, Jinjin Chi","doi":"10.1049/cvi2.70020","DOIUrl":"10.1049/cvi2.70020","url":null,"abstract":"<p>Active domain adaptation (active DA) provides an effective solution by selectively labelling a limited number of target samples to significantly enhance adaptation performance. However, existing active DA methods often struggle in real-world scenarios where, due to data privacy concerns, only a pre-trained source model is available, rather than the source samples. To address this issue, we propose a novel method called the structure-based uncertainty estimation model (SUEM) for source-free active domain adaptation (SFADA). To be specific, we introduce an innovative active sample selection strategy that combines both uncertainty and diversity sampling to identify the most informative samples. We assess the uncertainty in target samples using structure-wise probabilities and implement a diversity selection method to minimise redundancy. For the selected samples, we not only apply standard-supervised loss but also conduct interpolation consistency training to further explore the structural information of the target domain. Extensive experiments across four widely used datasets demonstrate that our method is comparable to or outperforms current UDA and active DA methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70020","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143840855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skeleton-based action recognition using Graph Convolutional Networks (GCNs) has achieved remarkable performance, but recognising ambiguous actions, such as ‘waving’ and ‘saluting’, remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and Temporal Convolutional Networks (TCNs), where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasise local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, the authors propose a lightweight plug-and-play module called Synchronised and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronised Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details, enhancing the model's ability to classify ambiguous actions. Experimental results on NTU RGB + D 60, NTU RGB + D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at https://github.com/HaoHuang2003/SFHead.
{"title":"Synchronised and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition","authors":"Hao Huang, Yujie Lin, Siyu Chen, Haiyang Liu","doi":"10.1049/cvi2.70016","DOIUrl":"10.1049/cvi2.70016","url":null,"abstract":"<p>Skeleton-based action recognition using Graph Convolutional Networks (GCNs) has achieved remarkable performance, but recognising ambiguous actions, such as ‘waving’ and ‘saluting’, remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and Temporal Convolutional Networks (TCNs), where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasise local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, the authors propose a lightweight plug-and-play module called Synchronised and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronised Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details, enhancing the model's ability to classify ambiguous actions. Experimental results on NTU RGB + D 60, NTU RGB + D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at https://github.com/HaoHuang2003/SFHead.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143835867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Diffusion Probabilistic Model (DM) has emerged as a powerful generative model in the field of image synthesis, capable of producing high-quality and realistic images. However, training DM requires a large and diverse dataset, which can be challenging to obtain. This limitation weakens the model's generalisation and robustness when training data is limited. To address this issue, EDG-CDM, an innovative encoder-guided conditional diffusion model was proposed for image synthesis with limited data. Firstly, the authors pre-train the encoder by introducing noise to capture the distribution of image features and generate the condition vector through contrastive learning and KL divergence. Next, the encoder undergoes further training with classification to integrate image class information, providing more favourable and versatile conditions for the diffusion model. Subsequently, the encoder is connected to the diffusion model, which is trained using all available data with encoder-provided conditions. Finally, the authors evaluate EDG-CDM on various public datasets with limited data, conducting extensive experiments and comparing our results with state-of-the-art methods using metrics such as Fréchet Inception Distance and Inception Score. Our experiments demonstrate that EDG-CDM outperforms existing models by consistently achieving the lowest FID scores and the highest IS scores, highlighting its effectiveness in generating high-quality and diverse images with limited training data. These results underscore the significance of EDG-CDM in advancing image synthesis techniques under data-constrained scenarios.
{"title":"EDG-CDM: A New Encoder-Guided Conditional Diffusion Model-Based Image Synthesis Method for Limited Data","authors":"Haopeng Lei, Hao Yin, Kaijun Liang, Mingwen Wang, Jinshan Zeng, Guoliang Luo","doi":"10.1049/cvi2.70018","DOIUrl":"10.1049/cvi2.70018","url":null,"abstract":"<p>The Diffusion Probabilistic Model (DM) has emerged as a powerful generative model in the field of image synthesis, capable of producing high-quality and realistic images. However, training DM requires a large and diverse dataset, which can be challenging to obtain. This limitation weakens the model's generalisation and robustness when training data is limited. To address this issue, EDG-CDM, an innovative encoder-guided conditional diffusion model was proposed for image synthesis with limited data. Firstly, the authors pre-train the encoder by introducing noise to capture the distribution of image features and generate the condition vector through contrastive learning and KL divergence. Next, the encoder undergoes further training with classification to integrate image class information, providing more favourable and versatile conditions for the diffusion model. Subsequently, the encoder is connected to the diffusion model, which is trained using all available data with encoder-provided conditions. Finally, the authors evaluate EDG-CDM on various public datasets with limited data, conducting extensive experiments and comparing our results with state-of-the-art methods using metrics such as Fréchet Inception Distance and Inception Score. Our experiments demonstrate that EDG-CDM outperforms existing models by consistently achieving the lowest FID scores and the highest IS scores, highlighting its effectiveness in generating high-quality and diverse images with limited training data. These results underscore the significance of EDG-CDM in advancing image synthesis techniques under data-constrained scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143801593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT) and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost and gradient activity. We offer insights that we have not yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that ViT performs the best on inference speed and computational cost, whereas LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.
{"title":"Performance of Computer Vision Algorithms for Fine-Grained Classification Using Crowdsourced Insect Images","authors":"Rita Pucci, Vincent J. Kalkman, Dan Stowell","doi":"10.1049/cvi2.70006","DOIUrl":"10.1049/cvi2.70006","url":null,"abstract":"<p>With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT) and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost and gradient activity. We offer insights that we have not yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that ViT performs the best on inference speed and computational cost, whereas LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70006","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143778248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camouflaged object detection (COD) aims to identify and segment objects that closely resemble and are seamlessly integrated into their surrounding environments, making it a challenging task in computer vision. COD is constrained by the limited availability of training data and annotated samples, and most carefully designed COD models exhibit diminished performance under low-data conditions. In recent years, there has been increasing interest in leveraging foundation models, which have demonstrated robust general capabilities and superior generalisation performance, to address COD challenges. This work proposes a knowledge-guided domain adaptation (KGDA) approach to tackle the data scarcity problem in COD. The method utilises the knowledge descriptions generated by multimodal large language models (MLLMs) for camouflaged images, aiming to enhance the model's comprehension of semantic objects and camouflaged scenes through highly abstract and generalised knowledge representations. To resolve ambiguities and errors in the generated text descriptions, a multi-level knowledge aggregation (MLKG) module is devised. This module consolidates consistent semantic knowledge and forms multi-level semantic knowledge features. To incorporate semantic knowledge into the visual foundation model, the authors introduce a knowledge-guided semantic enhancement adaptor (KSEA) that integrates the semantic knowledge of camouflaged objects while preserving the original knowledge of the foundation model. Extensive experiments demonstrate that our method surpasses 19 state-of-the-art approaches and exhibits strong generalisation capabilities even with limited annotated data.
{"title":"Foundation Model Based Camouflaged Object Detection","authors":"Zefeng Chen, Zhijiang Li, Yunqi Xue, Li Zhang","doi":"10.1049/cvi2.70009","DOIUrl":"10.1049/cvi2.70009","url":null,"abstract":"<p>Camouflaged object detection (COD) aims to identify and segment objects that closely resemble and are seamlessly integrated into their surrounding environments, making it a challenging task in computer vision. COD is constrained by the limited availability of training data and annotated samples, and most carefully designed COD models exhibit diminished performance under low-data conditions. In recent years, there has been increasing interest in leveraging foundation models, which have demonstrated robust general capabilities and superior generalisation performance, to address COD challenges. This work proposes a knowledge-guided domain adaptation (KGDA) approach to tackle the data scarcity problem in COD. The method utilises the knowledge descriptions generated by multimodal large language models (MLLMs) for camouflaged images, aiming to enhance the model's comprehension of semantic objects and camouflaged scenes through highly abstract and generalised knowledge representations. To resolve ambiguities and errors in the generated text descriptions, a multi-level knowledge aggregation (MLKG) module is devised. This module consolidates consistent semantic knowledge and forms multi-level semantic knowledge features. To incorporate semantic knowledge into the visual foundation model, the authors introduce a knowledge-guided semantic enhancement adaptor (KSEA) that integrates the semantic knowledge of camouflaged objects while preserving the original knowledge of the foundation model. Extensive experiments demonstrate that our method surpasses 19 state-of-the-art approaches and exhibits strong generalisation capabilities even with limited annotated data.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70009","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143749464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rosie Finnegan, Joseph Metcalfe, Sara Sharifzadeh, Fabio Caraffini, Xianghua Xie, Alberto Hornero, Nicholas W. Synes
This study presents a novel approach to crop mapping using remotely sensed satellite images. It addresses the significant classification modelling challenges, including (1) the requirements for extensive labelled data and (2) the complex optimisation problem for selection of appropriate temporal windows in the absence of prior knowledge of cultivation calendars. We compare the lightweight Dynamic Time Warping (DTW) classification method with the heavily supervised Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) using high-resolution multispectral optical satellite imagery (3 m/pixel). Our approach integrates effective practical preprocessing steps, including data augmentation and a data-driven optimisation strategy for the temporal window, even in the presence of numerous crop classes. Our findings demonstrate that DTW, despite its lower data demands, can match the performance of CNN-LSTM through our effective preprocessing steps while significantly improving runtime. These results demonstrate that both CNN-LSTM and DTW can achieve deployment-level accuracy and underscore the potential of DTW as a viable alternative to more resource-intensive models. The results also prove the effectiveness of temporal windowing for improving runtime and accuracy of a crop classification study, even with no prior knowledge of planting timeframes.
{"title":"Temporal Optimisation of Satellite Image-Based Crop Mapping: A Comparison of Deep Time Series and Semi-Supervised Time Warping Strategies","authors":"Rosie Finnegan, Joseph Metcalfe, Sara Sharifzadeh, Fabio Caraffini, Xianghua Xie, Alberto Hornero, Nicholas W. Synes","doi":"10.1049/cvi2.70014","DOIUrl":"10.1049/cvi2.70014","url":null,"abstract":"<p>This study presents a novel approach to crop mapping using remotely sensed satellite images. It addresses the significant classification modelling challenges, including (1) the requirements for extensive labelled data and (2) the complex optimisation problem for selection of appropriate temporal windows in the absence of prior knowledge of cultivation calendars. We compare the lightweight Dynamic Time Warping (DTW) classification method with the heavily supervised Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) using high-resolution multispectral optical satellite imagery (3 m/pixel). Our approach integrates effective practical preprocessing steps, including data augmentation and a data-driven optimisation strategy for the temporal window, even in the presence of numerous crop classes. Our findings demonstrate that DTW, despite its lower data demands, can match the performance of CNN-LSTM through our effective preprocessing steps while significantly improving runtime. These results demonstrate that both CNN-LSTM and DTW can achieve deployment-level accuracy and underscore the potential of DTW as a viable alternative to more resource-intensive models. The results also prove the effectiveness of temporal windowing for improving runtime and accuracy of a crop classification study, even with no prior knowledge of planting timeframes.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70014","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143707264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haiyan Long, Hai Chen, Mengyao Xu, Chonghao Zhang, Fulan Qian
3D object detection is one of the current popular hotspots by perceiving the surrounding environment through LiDAR and camera sensors to recognise the category and location of objects in the scene. Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples. Although some approaches have begun to investigate the robustness of 3D object detection models, they are currently generating adversarial examples in a white-box setting and there is a lack of research into generating transferable adversarial examples in a black-box setting. In this paper, a non-end-to-end attack algorithm was proposed for LiDAR pipelines that crafts transferable adversarial examples against 3D object detection. Specifically, the method generates adversarial examples by restraining features with high contribution to downstream tasks and amplifying features with low contribution to downstream tasks in the feature space. Extensive experiments validate that the method produces more transferable adversarial point clouds, for example, the method generates adversarial point clouds in the nuScenes dataset that are about 10