首页 > 最新文献

IET Computer Vision最新文献

英文 中文
Efficient class-agnostic obstacle detection for UAV-assisted waterway inspection systems
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-25 DOI: 10.1049/cvi2.12319
Pablo Alonso, Jon Ander Íñiguez de Gordoa, Juan Diego Ortega, Marcos Nieto

Ensuring the safety of water airport runways is essential for the correct operation of seaplane flights. Among other tasks, airport operators must identify and remove various objects that may have drifted into the runway area. In this paper, the authors propose a complete and embedded-friendly waterway obstacle detection pipeline that runs on a camera-equipped drone. This system uses a class-agnostic version of the YOLOv7 detector, which is capable of detecting objects regardless of its class. Additionally, through the usage of the GPS data of the drone and camera parameters, the location of the objects are pinpointed with 0.58 m Distance Root Mean Square. In our own annotated dataset, the system is capable of generating alerts for detected objects with a recall of 0.833 and a precision of 1.

{"title":"Efficient class-agnostic obstacle detection for UAV-assisted waterway inspection systems","authors":"Pablo Alonso,&nbsp;Jon Ander Íñiguez de Gordoa,&nbsp;Juan Diego Ortega,&nbsp;Marcos Nieto","doi":"10.1049/cvi2.12319","DOIUrl":"https://doi.org/10.1049/cvi2.12319","url":null,"abstract":"<p>Ensuring the safety of water airport runways is essential for the correct operation of seaplane flights. Among other tasks, airport operators must identify and remove various objects that may have drifted into the runway area. In this paper, the authors propose a complete and embedded-friendly waterway obstacle detection pipeline that runs on a camera-equipped drone. This system uses a class-agnostic version of the YOLOv7 detector, which is capable of detecting objects regardless of its class. Additionally, through the usage of the GPS data of the drone and camera parameters, the location of the objects are pinpointed with 0.58 m Distance Root Mean Square. In our own annotated dataset, the system is capable of generating alerts for detected objects with a recall of 0.833 and a precision of 1.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1087-1096"},"PeriodicalIF":1.5,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12319","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
To crop or not to crop: Comparing whole-image and cropped classification on a large dataset of camera trap images
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-24 DOI: 10.1049/cvi2.12318
Tomer Gadot, Ștefan Istrate, Hyungwon Kim, Dan Morris, Sara Beery, Tanya Birch, Jorge Ahumada

Camera traps facilitate non-invasive wildlife monitoring, but their widespread adoption has created a data processing bottleneck: a camera trap survey can create millions of images, and the labour required to review those images strains the resources of conservation organisations. AI is a promising approach for accelerating image review, but AI tools for camera trap data are imperfect; in particular, classifying small animals remains difficult, and accuracy falls off outside the ecosystems in which a model was trained. It has been proposed that incorporating an object detector into an image analysis pipeline may help address these challenges, but the benefit of object detection has not been systematically evaluated in the literature. In this work, the authors assess the hypothesis that classifying animals cropped from camera trap images using a species-agnostic detector yields better accuracy than classifying whole images. We find that incorporating an object detection stage into an image classification pipeline yields a macro-average F1 improvement of around 25% on a large, long-tailed dataset; this improvement is reproducible on a large public dataset and a smaller public benchmark dataset. The authors describe a classification architecture that performs well for both whole and detector-cropped images, and demonstrate that this architecture yields state-of-the-art benchmark accuracy.

{"title":"To crop or not to crop: Comparing whole-image and cropped classification on a large dataset of camera trap images","authors":"Tomer Gadot,&nbsp;Ștefan Istrate,&nbsp;Hyungwon Kim,&nbsp;Dan Morris,&nbsp;Sara Beery,&nbsp;Tanya Birch,&nbsp;Jorge Ahumada","doi":"10.1049/cvi2.12318","DOIUrl":"https://doi.org/10.1049/cvi2.12318","url":null,"abstract":"<p>Camera traps facilitate non-invasive wildlife monitoring, but their widespread adoption has created a data processing bottleneck: a camera trap survey can create millions of images, and the labour required to review those images strains the resources of conservation organisations. AI is a promising approach for accelerating image review, but AI tools for camera trap data are imperfect; in particular, classifying small animals remains difficult, and accuracy falls off outside the ecosystems in which a model was trained. It has been proposed that incorporating an object detector into an image analysis pipeline may help address these challenges, but the benefit of object detection has not been systematically evaluated in the literature. In this work, the authors assess the hypothesis that classifying animals cropped from camera trap images using a species-agnostic detector yields better accuracy than classifying whole images. We find that incorporating an object detection stage into an image classification pipeline yields a macro-average F1 improvement of around 25% on a large, long-tailed dataset; this improvement is reproducible on a large public dataset and a smaller public benchmark dataset. The authors describe a classification architecture that performs well for both whole and detector-cropped images, and demonstrate that this architecture yields state-of-the-art benchmark accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1193-1208"},"PeriodicalIF":1.5,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12318","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive research on light field imaging: Theory and application
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-22 DOI: 10.1049/cvi2.12321
Fei Liu, Yunlong Wang, Qing Yang, Shubo Zhou, Kunbo Zhang

Computational photography is a combination of novel optical designs and processing methods to capture high-dimensional visual information. As an emerged promising technique, light field (LF) imaging measures the lighting, reflectance, focus, geometry and viewpoint in the free space, which has been widely explored for depth estimation, view synthesis, refocus, rendering, 3D displays, microscopy and other applications in computer vision in the past decades. In this paper, the authors present a comprehensive research survey on the LF imaging theory, technology and application. Firstly, the LF imaging process based on a MicroLens Array structure is derived, that is MLA-LF. Subsequently, the innovations of LF imaging technology are presented in terms of the imaging prototype, consumer LF camera and LF displays in Virtual Reality (VR) and Augmented Reality (AR). Finally the applications and challenges of LF imaging integrating with deep learning models are analysed, which consist of depth estimation, saliency detection, semantic segmentation, de-occlusion and defocus deblurring in recent years. It is believed that this paper will be a good reference for the future research on LF imaging technology in Artificial Intelligence era.

{"title":"A comprehensive research on light field imaging: Theory and application","authors":"Fei Liu,&nbsp;Yunlong Wang,&nbsp;Qing Yang,&nbsp;Shubo Zhou,&nbsp;Kunbo Zhang","doi":"10.1049/cvi2.12321","DOIUrl":"https://doi.org/10.1049/cvi2.12321","url":null,"abstract":"<p>Computational photography is a combination of novel optical designs and processing methods to capture high-dimensional visual information. As an emerged promising technique, light field (LF) imaging measures the lighting, reflectance, focus, geometry and viewpoint in the free space, which has been widely explored for depth estimation, view synthesis, refocus, rendering, 3D displays, microscopy and other applications in computer vision in the past decades. In this paper, the authors present a comprehensive research survey on the LF imaging theory, technology and application. Firstly, the LF imaging process based on a MicroLens Array structure is derived, that is MLA-LF. Subsequently, the innovations of LF imaging technology are presented in terms of the imaging prototype, consumer LF camera and LF displays in Virtual Reality (VR) and Augmented Reality (AR). Finally the applications and challenges of LF imaging integrating with deep learning models are analysed, which consist of depth estimation, saliency detection, semantic segmentation, de-occlusion and defocus deblurring in recent years. It is believed that this paper will be a good reference for the future research on LF imaging technology in Artificial Intelligence era.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1269-1284"},"PeriodicalIF":1.5,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12321","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DEUFormer: High-precision semantic segmentation for urban remote sensing images
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-12 DOI: 10.1049/cvi2.12313
Xinqi Jia, Xiaoyong Song, Lei Rao, Guangyu Fan, Songlin Cheng, Niansheng Chen

Urban remote sensing image semantic segmentation has a wide range of applications, such as urban planning, resource exploration, intelligent transportation, and other scenarios. Although UNetFormer performs well by introducing the self-attention mechanism of Transformer, it still faces challenges arising from relatively low segmentation accuracy and significant edge segmentation errors. To this end, this paper proposes DEUFormer by employing a special weighted sum method to fuse the features of the encoder and the decoder, thus capturing both local details and global context information. Moreover, an Enhanced Feature Refinement Head is designed to finely re-weight features on the channel dimension and narrow the semantic gap between shallow and deep features, thereby enhancing multi-scale feature extraction. Additionally, an Edge-Guided Context Module is introduced to enhance edge areas through effective edge detection, which can improve edge information extraction. Experimental results show that DEUFormer achieves an average Mean Intersection over Union (mIoU) of 53.8% on the LoveDA dataset and 69.1% on the UAVid dataset. Notably, the mIoU of buildings in the LoveDA dataset is 5.0% higher than that of UNetFormer. The proposed model outperforms methods such as UNetFormer on multiple datasets, which demonstrates its effectiveness.

{"title":"DEUFormer: High-precision semantic segmentation for urban remote sensing images","authors":"Xinqi Jia,&nbsp;Xiaoyong Song,&nbsp;Lei Rao,&nbsp;Guangyu Fan,&nbsp;Songlin Cheng,&nbsp;Niansheng Chen","doi":"10.1049/cvi2.12313","DOIUrl":"https://doi.org/10.1049/cvi2.12313","url":null,"abstract":"<p>Urban remote sensing image semantic segmentation has a wide range of applications, such as urban planning, resource exploration, intelligent transportation, and other scenarios. Although UNetFormer performs well by introducing the self-attention mechanism of Transformer, it still faces challenges arising from relatively low segmentation accuracy and significant edge segmentation errors. To this end, this paper proposes DEUFormer by employing a special weighted sum method to fuse the features of the encoder and the decoder, thus capturing both local details and global context information. Moreover, an Enhanced Feature Refinement Head is designed to finely re-weight features on the channel dimension and narrow the semantic gap between shallow and deep features, thereby enhancing multi-scale feature extraction. Additionally, an Edge-Guided Context Module is introduced to enhance edge areas through effective edge detection, which can improve edge information extraction. Experimental results show that DEUFormer achieves an average Mean Intersection over Union (mIoU) of 53.8% on the LoveDA dataset and 69.1% on the UAVid dataset. Notably, the mIoU of buildings in the LoveDA dataset is 5.0% higher than that of UNetFormer. The proposed model outperforms methods such as UNetFormer on multiple datasets, which demonstrates its effectiveness.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1209-1222"},"PeriodicalIF":1.5,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12313","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient transformer tracking with adaptive attention
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-07 DOI: 10.1049/cvi2.12315
Dingkun Xiao, Zhenzhong Wei, Guangjun Zhang

Recently, several trackers utilising Transformer architecture have shown significant performance improvement. However, the high computational cost of multi-head attention, a core component in the Transformer, has limited real-time running speed, which is crucial for tracking tasks. Additionally, the global mechanism of multi-head attention makes it susceptible to distractors with similar semantic information to the target. To address these issues, the authors propose a novel adaptive attention that enhances features through the spatial sparse attention mechanism with less than 1/4 of the computational complexity of multi-head attention. Our adaptive attention sets a perception range around each element in the feature map based on the target scale in the previous tracking result and adaptively searches for the information of interest. This allows the module to focus on the target region rather than background distractors. Based on adaptive attention, the authors build an efficient transformer tracking framework. It can perform deep interaction between search and template features to activate target information and aggregate multi-level interaction features to enhance the representation ability. The evaluation results on seven benchmarks show that the authors’ tracker achieves outstanding performance with a speed of 43 fps and significant advantages in hard circumstances.

{"title":"Efficient transformer tracking with adaptive attention","authors":"Dingkun Xiao,&nbsp;Zhenzhong Wei,&nbsp;Guangjun Zhang","doi":"10.1049/cvi2.12315","DOIUrl":"https://doi.org/10.1049/cvi2.12315","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <p>Recently, several trackers utilising Transformer architecture have shown significant performance improvement. However, the high computational cost of multi-head attention, a core component in the Transformer, has limited real-time running speed, which is crucial for tracking tasks. Additionally, the global mechanism of multi-head attention makes it susceptible to distractors with similar semantic information to the target. To address these issues, the authors propose a novel adaptive attention that enhances features through the spatial sparse attention mechanism with less than 1/4 of the computational complexity of multi-head attention. Our adaptive attention sets a perception range around each element in the feature map based on the target scale in the previous tracking result and adaptively searches for the information of interest. This allows the module to focus on the target region rather than background distractors. Based on adaptive attention, the authors build an efficient transformer tracking framework. It can perform deep interaction between search and template features to activate target information and aggregate multi-level interaction features to enhance the representation ability. The evaluation results on seven benchmarks show that the authors’ tracker achieves outstanding performance with a speed of 43 fps and significant advantages in hard circumstances.</p>\u0000 </section>\u0000 </div>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1338-1350"},"PeriodicalIF":1.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12315","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143248922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-scale feature extraction for energy-efficient object detection in remote sensing images
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-30 DOI: 10.1049/cvi2.12317
Di Wu, Hongning Liu, Jiawei Xu, Fei Xie

Object detection in remote sensing images aims to interpret images to obtain information on the category and location of potential targets, which is of great importance in traffic detection, marine supervision, and space reconnaissance. However, the complex backgrounds and large scale variations in remote sensing images present significant challenges. Traditional methods relied mainly on image filtering or feature descriptor methods to extract features, resulting in underperformance. Deep learning methods, especially one-stage detectors, for example, the Real-Time Object Detector (RTMDet) offers advanced solutions with efficient network architectures. Nevertheless, difficulty in feature extraction from complex backgrounds and target localisation in scale variations images limits detection accuracy. In this paper, an improved detector based on RTMDet, called the Multi-Scale Feature Extraction-assist RTMDet (MRTMDet), is proposed which address limitations through enhancement feature extraction and fusion networks. At the core of MRTMDet is a new backbone network MobileViT++ and a feature fusion network SFC-FPN, which enhances the model's ability to capture global and multi-scale features by carefully designing a hybrid feature processing unit of CNN and a transformer based on vision transformer (ViT) and poly-scale convolution (PSConv), respectively. The experiment in DIOR-R demonstrated that MRTMDet achieves competitive performance of 62.2% mAP, balancing precision with a lightweight design.

{"title":"Multi-scale feature extraction for energy-efficient object detection in remote sensing images","authors":"Di Wu,&nbsp;Hongning Liu,&nbsp;Jiawei Xu,&nbsp;Fei Xie","doi":"10.1049/cvi2.12317","DOIUrl":"https://doi.org/10.1049/cvi2.12317","url":null,"abstract":"<p>Object detection in remote sensing images aims to interpret images to obtain information on the category and location of potential targets, which is of great importance in traffic detection, marine supervision, and space reconnaissance. However, the complex backgrounds and large scale variations in remote sensing images present significant challenges. Traditional methods relied mainly on image filtering or feature descriptor methods to extract features, resulting in underperformance. Deep learning methods, especially one-stage detectors, for example, the Real-Time Object Detector (RTMDet) offers advanced solutions with efficient network architectures. Nevertheless, difficulty in feature extraction from complex backgrounds and target localisation in scale variations images limits detection accuracy. In this paper, an improved detector based on RTMDet, called the Multi-Scale Feature Extraction-assist RTMDet (MRTMDet), is proposed which address limitations through enhancement feature extraction and fusion networks. At the core of MRTMDet is a new backbone network MobileViT++ and a feature fusion network SFC-FPN, which enhances the model's ability to capture global and multi-scale features by carefully designing a hybrid feature processing unit of CNN and a transformer based on vision transformer (ViT) and poly-scale convolution (PSConv), respectively. The experiment in DIOR-R demonstrated that MRTMDet achieves competitive performance of 62.2% mAP, balancing precision with a lightweight design.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1223-1234"},"PeriodicalIF":1.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12317","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey on person and vehicle re-identification
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-28 DOI: 10.1049/cvi2.12316
Zhaofa Wang, Liyang Wang, Zhiping Shi, Miaomiao Zhang, Qichuan Geng, Na Jiang

Person/vehicle re-identification aims to use technologies such as cross-camera retrieval to associate the same person (same vehicle) in the surveillance videos at different locations, different times, and images captured by different cameras so as to achieve cross-surveillance image matching, person retrieval and trajectory tracking. It plays an extremely important role in the fields of intelligent security, criminal investigation etc. In recent years, the rapid development of deep learning technology has significantly propelled the advancement of re-identification (Re-ID) technology. An increasing number of technical methods have emerged, aiming to enhance Re-ID performance. This paper summarises four popular research areas in the current field of re-identification, focusing on the current research hotspots. These areas include the multi-task learning domain, the generalisation learning domain, the cross-modality domain, and the optimisation learning domain. Specifically, the paper analyses various challenges faced within these domains and elaborates on different deep learning frameworks and networks that address these challenges. A comparative analysis of re-identification tasks from various classification perspectives is provided, introducing mainstream research directions and current achievements. Finally, insights into future development trends are presented.

{"title":"A survey on person and vehicle re-identification","authors":"Zhaofa Wang,&nbsp;Liyang Wang,&nbsp;Zhiping Shi,&nbsp;Miaomiao Zhang,&nbsp;Qichuan Geng,&nbsp;Na Jiang","doi":"10.1049/cvi2.12316","DOIUrl":"https://doi.org/10.1049/cvi2.12316","url":null,"abstract":"<p>Person/vehicle re-identification aims to use technologies such as cross-camera retrieval to associate the same person (same vehicle) in the surveillance videos at different locations, different times, and images captured by different cameras so as to achieve cross-surveillance image matching, person retrieval and trajectory tracking. It plays an extremely important role in the fields of intelligent security, criminal investigation etc. In recent years, the rapid development of deep learning technology has significantly propelled the advancement of re-identification (Re-ID) technology. An increasing number of technical methods have emerged, aiming to enhance Re-ID performance. This paper summarises four popular research areas in the current field of re-identification, focusing on the current research hotspots. These areas include the multi-task learning domain, the generalisation learning domain, the cross-modality domain, and the optimisation learning domain. Specifically, the paper analyses various challenges faced within these domains and elaborates on different deep learning frameworks and networks that address these challenges. A comparative analysis of re-identification tasks from various classification perspectives is provided, introducing mainstream research directions and current achievements. Finally, insights into future development trends are presented.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1235-1268"},"PeriodicalIF":1.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12316","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Occluded object 6D pose estimation using foreground probability compensation
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-17 DOI: 10.1049/cvi2.12314
Meihui Ren, Junying Jia, Xin Lu

6D object pose estimation usually refers to acquiring the 6D pose information of 3D objects in the sensor coordinate system using computer vision techniques. However, the task faces numerous challenges due to the complexity of natural scenes. One of the most significant challenges is occlusion, which is an unavoidable situation in 3D scenes and poses a significant obstacle in real-world applications. To tackle this issue, we propose a novel 6D pose estimation algorithm based on RGB-D images, aiming for enhanced robustness in occluded environments. Our approach follows the basic architecture of keypoint-based pose estimation algorithms. To better leverage complementary information of RGB-D data, we introduce a novel foreground probability-guided sampling strategy at the network's input stage. This strategy mitigates the sampling ratio imbalance between foreground and background points due to smaller foreground objects in occluded environments. Moreover, considering the impact of occlusion on semantic segmentation networks, we introduce a new object segmentation module. This module utilises traditional image processing techniques to compensate for severe semantic segmentation errors of deep learning networks. We evaluate our algorithm using the Occlusion LineMOD public dataset. Experimental results demonstrate that our method is more robust in occlusion environments compared to existing state-of-the-art algorithms. It maintains stable performance even in scenarios with no or low occlusion.

{"title":"Occluded object 6D pose estimation using foreground probability compensation","authors":"Meihui Ren,&nbsp;Junying Jia,&nbsp;Xin Lu","doi":"10.1049/cvi2.12314","DOIUrl":"https://doi.org/10.1049/cvi2.12314","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <p>6D object pose estimation usually refers to acquiring the 6D pose information of 3D objects in the sensor coordinate system using computer vision techniques. However, the task faces numerous challenges due to the complexity of natural scenes. One of the most significant challenges is occlusion, which is an unavoidable situation in 3D scenes and poses a significant obstacle in real-world applications. To tackle this issue, we propose a novel 6D pose estimation algorithm based on RGB-D images, aiming for enhanced robustness in occluded environments. Our approach follows the basic architecture of keypoint-based pose estimation algorithms. To better leverage complementary information of RGB-D data, we introduce a novel foreground probability-guided sampling strategy at the network's input stage. This strategy mitigates the sampling ratio imbalance between foreground and background points due to smaller foreground objects in occluded environments. Moreover, considering the impact of occlusion on semantic segmentation networks, we introduce a new object segmentation module. This module utilises traditional image processing techniques to compensate for severe semantic segmentation errors of deep learning networks. We evaluate our algorithm using the Occlusion LineMOD public dataset. Experimental results demonstrate that our method is more robust in occlusion environments compared to existing state-of-the-art algorithms. It maintains stable performance even in scenarios with no or low occlusion.</p>\u0000 </section>\u0000 </div>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1325-1337"},"PeriodicalIF":1.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12314","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time semantic segmentation network for crops and weeds based on multi-branch structure
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-01 DOI: 10.1049/cvi2.12311
Yufan Liu, Muhua Liu, Xuhui Zhao, Junlong Zhu, Lin Wang, Hao Ma, Mingchuan Zhang

Weed recognition is an inevitable problem in smart agriculture, and to realise efficient weed recognition, complex background, insufficient feature information, varying target sizes and overlapping crops and weeds are the main problems to be solved. To address these problems, the authors propose a real-time semantic segmentation network based on a multi-branch structure for recognising crops and weeds. First, a new backbone network for capturing feature information between crops and weeds of different sizes is constructed. Second, the authors propose a weight refinement fusion (WRF) module to enhance the feature extraction ability of crops and weeds and reduce the interference caused by the complex background. Finally, a Semantic Guided Fusion is devised to enhance the interaction of information between crops and weeds and reduce the interference caused by overlapping goals. The experimental results demonstrate that the proposed network can balance speed and accuracy. Specifically, the 0.713 Mean IoU (MIoU), 0.802 MIoU, 0.746 MIoU and 0.906 MIoU can be achieved on the sugar beet (BoniRob) dataset, synthetic BoniRob dataset, CWFID dataset and self-labelled wheat dataset, respectively.

{"title":"Real-time semantic segmentation network for crops and weeds based on multi-branch structure","authors":"Yufan Liu,&nbsp;Muhua Liu,&nbsp;Xuhui Zhao,&nbsp;Junlong Zhu,&nbsp;Lin Wang,&nbsp;Hao Ma,&nbsp;Mingchuan Zhang","doi":"10.1049/cvi2.12311","DOIUrl":"https://doi.org/10.1049/cvi2.12311","url":null,"abstract":"<p>Weed recognition is an inevitable problem in smart agriculture, and to realise efficient weed recognition, complex background, insufficient feature information, varying target sizes and overlapping crops and weeds are the main problems to be solved. To address these problems, the authors propose a real-time semantic segmentation network based on a multi-branch structure for recognising crops and weeds. First, a new backbone network for capturing feature information between crops and weeds of different sizes is constructed. Second, the authors propose a weight refinement fusion (WRF) module to enhance the feature extraction ability of crops and weeds and reduce the interference caused by the complex background. Finally, a Semantic Guided Fusion is devised to enhance the interaction of information between crops and weeds and reduce the interference caused by overlapping goals. The experimental results demonstrate that the proposed network can balance speed and accuracy. Specifically, the 0.713 Mean IoU (MIoU), 0.802 MIoU, 0.746 MIoU and 0.906 MIoU can be achieved on the sugar beet (BoniRob) dataset, synthetic BoniRob dataset, CWFID dataset and self-labelled wheat dataset, respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1313-1324"},"PeriodicalIF":1.5,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12311","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143248083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging modality-specific and shared features for RGB-T salient object detection
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-25 DOI: 10.1049/cvi2.12307
Shuo Wang, Gang Yang, Qiqi Xu, Xun Dai

Most of the existing RGB-T salient object detection methods are usually based on dual-stream encoding single-stream decoding network architecture. These models always rely on the quality of fusion features, which often focus on modality-shared features and overlook modality-specific features, thus failing to fully utilise the rich information contained in multi-modality data. To this end, a modality separate tri-stream net (MSTNet), which consists of a tri-stream encoding (TSE) structure and a tri-stream decoding (TSD) structure is proposed. The TSE explicitly separates and extracts the modality-shared and modality-specific features to improve the utilisation of multi-modality data. In addition, based on the hybrid-attention and cross-attention mechanism, we design an enhanced complementary fusion module (ECF), which fully considers the complementarity between the features to be fused and realises high-quality feature fusion. Furthermore, in TSD, the quality of uni-modality features is ensured under the constraint of supervision. Finally, to make full use of the rich multi-level and multi-scale decoding features contained in TSD, the authors design the adaptive multi-scale decoding module and the multi-stream feature aggregation module to improve the decoding capability. Extensive experiments on three public datasets show that the MSTNet outperforms 14 state-of-the-art methods, demonstrating that this method can extract and utilise the multi-modality information more adequately and extract more complete and rich features, thus improving the model's performance. The code will be released at https://github.com/JOOOOKII/MSTNet.

{"title":"Leveraging modality-specific and shared features for RGB-T salient object detection","authors":"Shuo Wang,&nbsp;Gang Yang,&nbsp;Qiqi Xu,&nbsp;Xun Dai","doi":"10.1049/cvi2.12307","DOIUrl":"https://doi.org/10.1049/cvi2.12307","url":null,"abstract":"<p>Most of the existing RGB-T salient object detection methods are usually based on dual-stream encoding single-stream decoding network architecture. These models always rely on the quality of fusion features, which often focus on modality-shared features and overlook modality-specific features, thus failing to fully utilise the rich information contained in multi-modality data. To this end, a modality separate tri-stream net (MSTNet), which consists of a tri-stream encoding (TSE) structure and a tri-stream decoding (TSD) structure is proposed. The TSE explicitly separates and extracts the modality-shared and modality-specific features to improve the utilisation of multi-modality data. In addition, based on the hybrid-attention and cross-attention mechanism, we design an enhanced complementary fusion module (ECF), which fully considers the complementarity between the features to be fused and realises high-quality feature fusion. Furthermore, in TSD, the quality of uni-modality features is ensured under the constraint of supervision. Finally, to make full use of the rich multi-level and multi-scale decoding features contained in TSD, the authors design the adaptive multi-scale decoding module and the multi-stream feature aggregation module to improve the decoding capability. Extensive experiments on three public datasets show that the MSTNet outperforms 14 state-of-the-art methods, demonstrating that this method can extract and utilise the multi-modality information more adequately and extract more complete and rich features, thus improving the model's performance. The code will be released at https://github.com/JOOOOKII/MSTNet.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1285-1299"},"PeriodicalIF":1.5,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12307","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IET Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1