The Diffusion Probabilistic Model (DM) has emerged as a powerful generative model in the field of image synthesis, capable of producing high-quality and realistic images. However, training DM requires a large and diverse dataset, which can be challenging to obtain. This limitation weakens the model's generalisation and robustness when training data is limited. To address this issue, EDG-CDM, an innovative encoder-guided conditional diffusion model was proposed for image synthesis with limited data. Firstly, the authors pre-train the encoder by introducing noise to capture the distribution of image features and generate the condition vector through contrastive learning and KL divergence. Next, the encoder undergoes further training with classification to integrate image class information, providing more favourable and versatile conditions for the diffusion model. Subsequently, the encoder is connected to the diffusion model, which is trained using all available data with encoder-provided conditions. Finally, the authors evaluate EDG-CDM on various public datasets with limited data, conducting extensive experiments and comparing our results with state-of-the-art methods using metrics such as Fréchet Inception Distance and Inception Score. Our experiments demonstrate that EDG-CDM outperforms existing models by consistently achieving the lowest FID scores and the highest IS scores, highlighting its effectiveness in generating high-quality and diverse images with limited training data. These results underscore the significance of EDG-CDM in advancing image synthesis techniques under data-constrained scenarios.
{"title":"EDG-CDM: A New Encoder-Guided Conditional Diffusion Model-Based Image Synthesis Method for Limited Data","authors":"Haopeng Lei, Hao Yin, Kaijun Liang, Mingwen Wang, Jinshan Zeng, Guoliang Luo","doi":"10.1049/cvi2.70018","DOIUrl":"10.1049/cvi2.70018","url":null,"abstract":"<p>The Diffusion Probabilistic Model (DM) has emerged as a powerful generative model in the field of image synthesis, capable of producing high-quality and realistic images. However, training DM requires a large and diverse dataset, which can be challenging to obtain. This limitation weakens the model's generalisation and robustness when training data is limited. To address this issue, EDG-CDM, an innovative encoder-guided conditional diffusion model was proposed for image synthesis with limited data. Firstly, the authors pre-train the encoder by introducing noise to capture the distribution of image features and generate the condition vector through contrastive learning and KL divergence. Next, the encoder undergoes further training with classification to integrate image class information, providing more favourable and versatile conditions for the diffusion model. Subsequently, the encoder is connected to the diffusion model, which is trained using all available data with encoder-provided conditions. Finally, the authors evaluate EDG-CDM on various public datasets with limited data, conducting extensive experiments and comparing our results with state-of-the-art methods using metrics such as Fréchet Inception Distance and Inception Score. Our experiments demonstrate that EDG-CDM outperforms existing models by consistently achieving the lowest FID scores and the highest IS scores, highlighting its effectiveness in generating high-quality and diverse images with limited training data. These results underscore the significance of EDG-CDM in advancing image synthesis techniques under data-constrained scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143801593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT) and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost and gradient activity. We offer insights that we have not yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that ViT performs the best on inference speed and computational cost, whereas LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.
{"title":"Performance of Computer Vision Algorithms for Fine-Grained Classification Using Crowdsourced Insect Images","authors":"Rita Pucci, Vincent J. Kalkman, Dan Stowell","doi":"10.1049/cvi2.70006","DOIUrl":"10.1049/cvi2.70006","url":null,"abstract":"<p>With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT) and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost and gradient activity. We offer insights that we have not yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that ViT performs the best on inference speed and computational cost, whereas LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70006","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143778248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camouflaged object detection (COD) aims to identify and segment objects that closely resemble and are seamlessly integrated into their surrounding environments, making it a challenging task in computer vision. COD is constrained by the limited availability of training data and annotated samples, and most carefully designed COD models exhibit diminished performance under low-data conditions. In recent years, there has been increasing interest in leveraging foundation models, which have demonstrated robust general capabilities and superior generalisation performance, to address COD challenges. This work proposes a knowledge-guided domain adaptation (KGDA) approach to tackle the data scarcity problem in COD. The method utilises the knowledge descriptions generated by multimodal large language models (MLLMs) for camouflaged images, aiming to enhance the model's comprehension of semantic objects and camouflaged scenes through highly abstract and generalised knowledge representations. To resolve ambiguities and errors in the generated text descriptions, a multi-level knowledge aggregation (MLKG) module is devised. This module consolidates consistent semantic knowledge and forms multi-level semantic knowledge features. To incorporate semantic knowledge into the visual foundation model, the authors introduce a knowledge-guided semantic enhancement adaptor (KSEA) that integrates the semantic knowledge of camouflaged objects while preserving the original knowledge of the foundation model. Extensive experiments demonstrate that our method surpasses 19 state-of-the-art approaches and exhibits strong generalisation capabilities even with limited annotated data.
{"title":"Foundation Model Based Camouflaged Object Detection","authors":"Zefeng Chen, Zhijiang Li, Yunqi Xue, Li Zhang","doi":"10.1049/cvi2.70009","DOIUrl":"10.1049/cvi2.70009","url":null,"abstract":"<p>Camouflaged object detection (COD) aims to identify and segment objects that closely resemble and are seamlessly integrated into their surrounding environments, making it a challenging task in computer vision. COD is constrained by the limited availability of training data and annotated samples, and most carefully designed COD models exhibit diminished performance under low-data conditions. In recent years, there has been increasing interest in leveraging foundation models, which have demonstrated robust general capabilities and superior generalisation performance, to address COD challenges. This work proposes a knowledge-guided domain adaptation (KGDA) approach to tackle the data scarcity problem in COD. The method utilises the knowledge descriptions generated by multimodal large language models (MLLMs) for camouflaged images, aiming to enhance the model's comprehension of semantic objects and camouflaged scenes through highly abstract and generalised knowledge representations. To resolve ambiguities and errors in the generated text descriptions, a multi-level knowledge aggregation (MLKG) module is devised. This module consolidates consistent semantic knowledge and forms multi-level semantic knowledge features. To incorporate semantic knowledge into the visual foundation model, the authors introduce a knowledge-guided semantic enhancement adaptor (KSEA) that integrates the semantic knowledge of camouflaged objects while preserving the original knowledge of the foundation model. Extensive experiments demonstrate that our method surpasses 19 state-of-the-art approaches and exhibits strong generalisation capabilities even with limited annotated data.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70009","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143749464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rosie Finnegan, Joseph Metcalfe, Sara Sharifzadeh, Fabio Caraffini, Xianghua Xie, Alberto Hornero, Nicholas W. Synes
This study presents a novel approach to crop mapping using remotely sensed satellite images. It addresses the significant classification modelling challenges, including (1) the requirements for extensive labelled data and (2) the complex optimisation problem for selection of appropriate temporal windows in the absence of prior knowledge of cultivation calendars. We compare the lightweight Dynamic Time Warping (DTW) classification method with the heavily supervised Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) using high-resolution multispectral optical satellite imagery (3 m/pixel). Our approach integrates effective practical preprocessing steps, including data augmentation and a data-driven optimisation strategy for the temporal window, even in the presence of numerous crop classes. Our findings demonstrate that DTW, despite its lower data demands, can match the performance of CNN-LSTM through our effective preprocessing steps while significantly improving runtime. These results demonstrate that both CNN-LSTM and DTW can achieve deployment-level accuracy and underscore the potential of DTW as a viable alternative to more resource-intensive models. The results also prove the effectiveness of temporal windowing for improving runtime and accuracy of a crop classification study, even with no prior knowledge of planting timeframes.
{"title":"Temporal Optimisation of Satellite Image-Based Crop Mapping: A Comparison of Deep Time Series and Semi-Supervised Time Warping Strategies","authors":"Rosie Finnegan, Joseph Metcalfe, Sara Sharifzadeh, Fabio Caraffini, Xianghua Xie, Alberto Hornero, Nicholas W. Synes","doi":"10.1049/cvi2.70014","DOIUrl":"10.1049/cvi2.70014","url":null,"abstract":"<p>This study presents a novel approach to crop mapping using remotely sensed satellite images. It addresses the significant classification modelling challenges, including (1) the requirements for extensive labelled data and (2) the complex optimisation problem for selection of appropriate temporal windows in the absence of prior knowledge of cultivation calendars. We compare the lightweight Dynamic Time Warping (DTW) classification method with the heavily supervised Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) using high-resolution multispectral optical satellite imagery (3 m/pixel). Our approach integrates effective practical preprocessing steps, including data augmentation and a data-driven optimisation strategy for the temporal window, even in the presence of numerous crop classes. Our findings demonstrate that DTW, despite its lower data demands, can match the performance of CNN-LSTM through our effective preprocessing steps while significantly improving runtime. These results demonstrate that both CNN-LSTM and DTW can achieve deployment-level accuracy and underscore the potential of DTW as a viable alternative to more resource-intensive models. The results also prove the effectiveness of temporal windowing for improving runtime and accuracy of a crop classification study, even with no prior knowledge of planting timeframes.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70014","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143707264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haiyan Long, Hai Chen, Mengyao Xu, Chonghao Zhang, Fulan Qian
3D object detection is one of the current popular hotspots by perceiving the surrounding environment through LiDAR and camera sensors to recognise the category and location of objects in the scene. Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples. Although some approaches have begun to investigate the robustness of 3D object detection models, they are currently generating adversarial examples in a white-box setting and there is a lack of research into generating transferable adversarial examples in a black-box setting. In this paper, a non-end-to-end attack algorithm was proposed for LiDAR pipelines that crafts transferable adversarial examples against 3D object detection. Specifically, the method generates adversarial examples by restraining features with high contribution to downstream tasks and amplifying features with low contribution to downstream tasks in the feature space. Extensive experiments validate that the method produces more transferable adversarial point clouds, for example, the method generates adversarial point clouds in the nuScenes dataset that are about 10