Pub Date : 2024-11-19DOI: 10.1109/TIP.2024.3497795
Qiang Qu;Xiaoming Chen;Yuk Ying Chung;Yiran Shen
Event-stream representation is the first step for many computer vision tasks using event cameras. It converts the asynchronous event-streams into a formatted structure so that conventional machine learning models can be applied easily. However, most of the state-of-the-art event-stream representations are manually designed and the quality of these representations cannot be guaranteed due to the noisy nature of event-streams. In this paper, we introduce a data-driven approach aiming at enhancing the quality of event-stream representations. Our approach commences with the introduction of a new event-stream representation based on spatial-temporal statistics, denoted as EvRep. Subsequently, we theoretically derive the intrinsic relationship between asynchronous event-streams and synchronous video frames. Building upon this theoretical relationship, we train a representation generator, RepGen, in a self-supervised learning manner accepting EvRep as input. Finally, the event-streams are converted to high-quality representations, termed as EvRepSL, by going through the learned RepGen (without the need of fine-tuning or retraining). Our methodology is rigorously validated through extensive evaluations on a variety of mainstream event-based classification and optical flow datasets (captured with various types of event cameras). The experimental results highlight not only our approach’s superior performance over existing event-stream representations but also its versatility, being agnostic to different event cameras and tasks.
{"title":"EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision","authors":"Qiang Qu;Xiaoming Chen;Yuk Ying Chung;Yiran Shen","doi":"10.1109/TIP.2024.3497795","DOIUrl":"10.1109/TIP.2024.3497795","url":null,"abstract":"Event-stream representation is the first step for many computer vision tasks using event cameras. It converts the asynchronous event-streams into a formatted structure so that conventional machine learning models can be applied easily. However, most of the state-of-the-art event-stream representations are manually designed and the quality of these representations cannot be guaranteed due to the noisy nature of event-streams. In this paper, we introduce a data-driven approach aiming at enhancing the quality of event-stream representations. Our approach commences with the introduction of a new event-stream representation based on spatial-temporal statistics, denoted as EvRep. Subsequently, we theoretically derive the intrinsic relationship between asynchronous event-streams and synchronous video frames. Building upon this theoretical relationship, we train a representation generator, RepGen, in a self-supervised learning manner accepting EvRep as input. Finally, the event-streams are converted to high-quality representations, termed as EvRepSL, by going through the learned RepGen (without the need of fine-tuning or retraining). Our methodology is rigorously validated through extensive evaluations on a variety of mainstream event-based classification and optical flow datasets (captured with various types of event cameras). The experimental results highlight not only our approach’s superior performance over existing event-stream representations but also its versatility, being agnostic to different event cameras and tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6579-6591"},"PeriodicalIF":0.0,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142673314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-14DOI: 10.1109/TIP.2024.3494600
Hongmin Liu;Canbin Zhang;Bin Fan;Jinglin Xu
Multi-object tracking (MOT) aims to estimate the bounding boxes and ID labels of objects in videos. The challenging issue in this task is to alleviate competitive learning between the detection and tracking subtasks, for which, two-stage Tracking-By-Detection (TBD) optimizes the two subtasks individually, and the single-stage Joint Detection and Tracking (JDT) adjusts the complex network architectures finely in an end-to-end pipeline. In this paper, we propose a new MOT method, i.e., Proposal Propagation via Diffusion Models, called Pro2Diff, which integrates a diffusion model into the proposal propagation in multi-object tracking, focusing on the model training process rather than complex network design. Specifically, using a generative approach, Pro2Diff generates a considerable number of noisy proposals for the tracking image sequence in the forward process, and subsequently, Pro2Diff learns the discrepancies between these noisy proposals and the actual bounding boxes of the tracked objects, gradually optimizing these noisy proposals to obtain the initial sequence of real tracked objects. By introducing the denoising diffusion process into multi-object tracking, we have made three further important findings: 1) Generative methods can effectively handle multi-object tracking tasks; 2) Without the need to modify the model structure, we propose self-conditional proposal propagation to enhance model performance effectively during inference; 3) By adjusting the numbers of proposals and iterations appropriately for different tracking sequences, the optimal performance of the model can be achieved. Extensive experimental results on MOT17 and DanceTrack datasets demonstrate that Pro2Diff outperforms current end-to-end multi-object tracking methods. We achieve 61.9 HOTA on DanceTrack and 57.6 HOTA on MOT17, reaching the competitive result of the JDT approach.
多目标跟踪(MOT)旨在估计视频中物体的边界框和 ID 标签。在这项任务中,具有挑战性的问题是如何缓解检测和跟踪子任务之间的竞争性学习,为此,两阶段跟踪检测(Tracking-By-Detection,TBD)分别对这两个子任务进行优化,而单阶段联合检测和跟踪(Joint Detection and Tracking,JDT)则在端到端流水线中对复杂的网络架构进行精细调整。在本文中,我们提出了一种新的 MOT 方法,即通过扩散模型进行提议传播(Proposal Propagation via Diffusion Models),称为 Pro2Diff,它将扩散模型集成到多目标跟踪的提议传播中,重点关注模型训练过程而非复杂的网络设计。具体来说,Pro2Diff 采用生成式方法,在前向过程中为跟踪图像序列生成相当数量的噪声提议,随后,Pro2Diff 学习这些噪声提议与实际跟踪对象边界框之间的差异,逐步优化这些噪声提议,从而获得真实跟踪对象的初始序列。通过在多目标跟踪中引入去噪扩散过程,我们又有了三个重要发现:1)生成式方法可以有效地处理多目标跟踪任务;2)无需修改模型结构,我们提出了自条件提案传播法,可以在推理过程中有效地提高模型性能;3)通过针对不同的跟踪序列适当调整提案数和迭代数,可以实现模型的最佳性能。在 MOT17 和 DanceTrack 数据集上的大量实验结果表明,Pro2Diff 优于目前的端到端多目标跟踪方法。我们在 DanceTrack 上获得了 61.9 HOTA,在 MOT17 上获得了 57.6 HOTA,达到了 JDT 方法的竞争结果。
{"title":"Pro2Diff: Proposal Propagation for Multi-Object Tracking via the Diffusion Model","authors":"Hongmin Liu;Canbin Zhang;Bin Fan;Jinglin Xu","doi":"10.1109/TIP.2024.3494600","DOIUrl":"10.1109/TIP.2024.3494600","url":null,"abstract":"Multi-object tracking (MOT) aims to estimate the bounding boxes and ID labels of objects in videos. The challenging issue in this task is to alleviate competitive learning between the detection and tracking subtasks, for which, two-stage Tracking-By-Detection (TBD) optimizes the two subtasks individually, and the single-stage Joint Detection and Tracking (JDT) adjusts the complex network architectures finely in an end-to-end pipeline. In this paper, we propose a new MOT method, i.e., Proposal Propagation via Diffusion Models, called Pro2Diff, which integrates a diffusion model into the proposal propagation in multi-object tracking, focusing on the model training process rather than complex network design. Specifically, using a generative approach, Pro2Diff generates a considerable number of noisy proposals for the tracking image sequence in the forward process, and subsequently, Pro2Diff learns the discrepancies between these noisy proposals and the actual bounding boxes of the tracked objects, gradually optimizing these noisy proposals to obtain the initial sequence of real tracked objects. By introducing the denoising diffusion process into multi-object tracking, we have made three further important findings: 1) Generative methods can effectively handle multi-object tracking tasks; 2) Without the need to modify the model structure, we propose self-conditional proposal propagation to enhance model performance effectively during inference; 3) By adjusting the numbers of proposals and iterations appropriately for different tracking sequences, the optimal performance of the model can be achieved. Extensive experimental results on MOT17 and DanceTrack datasets demonstrate that Pro2Diff outperforms current end-to-end multi-object tracking methods. We achieve 61.9 HOTA on DanceTrack and 57.6 HOTA on MOT17, reaching the competitive result of the JDT approach.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6508-6520"},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Band-to-Band Registration (BBR) is a pre-requisite image processing operation essential for specific remote sensing multispectral sensors. BBR aims to align spectral wavelength channels at sub-pixel level accuracy over each other. The paper presents a novel BBR technique utilizing Co-occurrence Scale Space (CSS) for feature point detection and Spatial Confined RANSAC (SC-RANSAC) for removing outlier matched control points. Additionally, the Segmented Affine Transformation (SAT) model reduces distortion and ensures consistent BBR. The methodology developed is evaluated with Nano-MX multispectral images onboard the Indian Nano Satellite (INS-2B) covering diverse landscapes. BBR performance using the proposed method is also verified visually at a 4X zoom level on satellite scenes dominated by cloud pixels. The band misregistration effect on the Normalized Difference Vegetation Index (NDVI) from INS-2B is analyzed and cross-validated with the closest acquisition Landsat-9 OLI NDVI map before and after BBR correction. The experimental evaluation shows that the proposed BBR approach outperforms the state-of-the-art image registration techniques.
{"title":"Enhanced Multispectral Band-to-Band Registration Using Co-Occurrence Scale Space and Spatial Confined RANSAC Guided Segmented Affine Transformation","authors":"Indranil Misra;Mukesh Kumar Rohil;S. Manthira Moorthi;Debajyoti Dhar","doi":"10.1109/TIP.2024.3494555","DOIUrl":"10.1109/TIP.2024.3494555","url":null,"abstract":"Band-to-Band Registration (BBR) is a pre-requisite image processing operation essential for specific remote sensing multispectral sensors. BBR aims to align spectral wavelength channels at sub-pixel level accuracy over each other. The paper presents a novel BBR technique utilizing Co-occurrence Scale Space (CSS) for feature point detection and Spatial Confined RANSAC (SC-RANSAC) for removing outlier matched control points. Additionally, the Segmented Affine Transformation (SAT) model reduces distortion and ensures consistent BBR. The methodology developed is evaluated with Nano-MX multispectral images onboard the Indian Nano Satellite (INS-2B) covering diverse landscapes. BBR performance using the proposed method is also verified visually at a 4X zoom level on satellite scenes dominated by cloud pixels. The band misregistration effect on the Normalized Difference Vegetation Index (NDVI) from INS-2B is analyzed and cross-validated with the closest acquisition Landsat-9 OLI NDVI map before and after BBR correction. The experimental evaluation shows that the proposed BBR approach outperforms the state-of-the-art image registration techniques.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6521-6534"},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-12DOI: 10.1109/TIP.2024.3492724
Huan Liu;Wei Li;Xiang-Gen Xia;Mengmeng Zhang;Zhengqi Guo;Lujie Song
Hyperspectral images (HSIs), with hundreds of narrow spectral bands, are increasingly used for ground object classification in remote sensing. However, many HSI classification models operate pixel-by-pixel, limiting the utilization of spatial information and resulting in increased inference time for the whole image. This paper proposes SegHSI, an effective and efficient end-to-end HSI segmentation model, alongside a novel training strategy. SegHSI adopts a head-free structure with cluster attention modules and spatial-aware feedforward networks (SA-FFN) for multiscale spatial encoding. Cluster attention encodes pixels through constructed clusters within the HSI, while SA-FFN integrates depth-wise convolution to enhance spatial context. Our training strategy utilizes a student-teacher model framework that combines labeled pixel class information with consistency learning on unlabeled pixels. Experiments on three public HSI datasets demonstrate that SegHSI not only surpasses other state-of-the-art models in segmentation accuracy but also achieves inference time at the scale of seconds, even reaching sub-second speeds for full-image classification. Code is available at https://github.com/huanliu233/SegHSI