Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2025.105435

Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan

{"title":"Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection","authors":"Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan","doi":"10.1016/j.imavis.2025.105435","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at <span><span>https://github.com/UESTC-nnLab/STC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105435"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562500023X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at https://github.com/UESTC-nnLab/STC.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.

期刊最新文献

Editorial Board Early progression detection from MCI to AD using multi-view MRI for enhanced assisted living An edge-aware high-resolution framework for camouflaged object detection MUNet: A lightweight Mamba-based Under-Display Camera restoration network Adaptive scale matching for remote sensing object detection based on aerial images