IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第2页

Optimal Graph Learning-Based Label Propagation for Cross-Domain Image Classification

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-24 DOI: 10.1109/TIP.2025.3526380

Wei Wang;Mengzhu Wang;Chao Huang;Cong Wang;Jie Mu;Feiping Nie;Xiaochun Cao

Label propagation (LP) is a popular semi-supervised learning technique that propagates labels from a training dataset to a test one using a similarity graph, assuming that nearby samples should have similar labels. However, the recent cross-domain problem assumes that training (source domain) and test data sets (target domain) follow different distributions, which may unexpectedly degrade the performance of LP due to small similarity weights connecting the two domains. To address this problem, we propose optimal graph learning-based label propagation (OGL2P), which optimizes one cross-domain graph and two intra-domain graphs to connect the two domains and preserve domain-specific structures, respectively. During label propagation, the cross-domain graph draws two labels close if they are nearby in feature space and from different domains, while the intra-domain graph pulls two labels close if they are nearby in feature space and from the same domain. This makes label propagation more insensitive to cross-domain problems. During graph embedding, we optimize the three graphs using features and labels in the embedded subspace to extract locally discriminative and domain-invariant features and make the graph construction process robust to noise in the original feature space. Notably, as a more relaxed constraint, locally discriminative and domain-invariant can somewhat alleviate the contradiction between discriminability and domain-invariance. Finally, we conduct extensive experiments on five cross-domain image classification datasets to verify that OGL2P outperforms some state-of-the-art cross-domain approaches.

{"title":"Optimal Graph Learning-Based Label Propagation for Cross-Domain Image Classification","authors":"Wei Wang;Mengzhu Wang;Chao Huang;Cong Wang;Jie Mu;Feiping Nie;Xiaochun Cao","doi":"10.1109/TIP.2025.3526380","DOIUrl":"10.1109/TIP.2025.3526380","url":null,"abstract":"Label propagation (LP) is a popular semi-supervised learning technique that propagates labels from a training dataset to a test one using a similarity graph, assuming that nearby samples should have similar labels. However, the recent cross-domain problem assumes that training (source domain) and test data sets (target domain) follow different distributions, which may unexpectedly degrade the performance of LP due to small similarity weights connecting the two domains. To address this problem, we propose optimal graph learning-based label propagation (OGL2P), which optimizes one cross-domain graph and two intra-domain graphs to connect the two domains and preserve domain-specific structures, respectively. During label propagation, the cross-domain graph draws two labels close if they are nearby in feature space and from different domains, while the intra-domain graph pulls two labels close if they are nearby in feature space and from the same domain. This makes label propagation more insensitive to cross-domain problems. During graph embedding, we optimize the three graphs using features and labels in the embedded subspace to extract locally discriminative and domain-invariant features and make the graph construction process robust to noise in the original feature space. Notably, as a more relaxed constraint, locally discriminative and domain-invariant can somewhat alleviate the contradiction between discriminability and domain-invariance. Finally, we conduct extensive experiments on five cross-domain image classification datasets to verify that OGL2P outperforms some state-of-the-art cross-domain approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1529-1544"},"PeriodicalIF":0.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143486026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RTF: Recursive TransFusion for Multi-Modal Image Synthesis

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-24 DOI: 10.1109/TIP.2025.3541877

Bing Cao;Guoliang Qi;Jiaming Zhao;Pengfei Zhu;Qinghua Hu;Xinbo Gao

Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at https://github.com/guoliangq/RTF.

{"title":"RTF: Recursive TransFusion for Multi-Modal Image Synthesis","authors":"Bing Cao;Guoliang Qi;Jiaming Zhao;Pengfei Zhu;Qinghua Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3541877","DOIUrl":"10.1109/TIP.2025.3541877","url":null,"abstract":"Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at <uri>https://github.com/guoliangq/RTF</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1573-1587"},"PeriodicalIF":0.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143486023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MaCon: A Generic Self-Supervised Framework for Unsupervised Multimodal Change Detection

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-24 DOI: 10.1109/TIP.2025.3542276

Jian Wang;Li Yan;Jianbing Yang;Hong Xie;Qiangqiang Yuan;Pengcheng Wei;Zhao Gao;Ce Zhang;Peter M. Atkinson

Change detection(CD) is important for Earth observation, emergency response and time-series understanding. Recently, data availability in various modalities has increased rapidly, and multimodal change detection (MCD) is gaining prominence. Given the scarcity of datasets and labels for MCD, unsupervised approaches are more practical for MCD. However, previous methods typically either merely reduce the gap between multimodal data through transformation or feed the original multimodal data directly into the discriminant network for difference extraction. The former faces challenges in extracting precise difference features. The latter contains the pronounced intrinsic distinction between the original multimodal data; direct extraction and comparison of features usually introduce significant noise, thereby compromising the quality of the resultant difference image. In this article, we proposed the MaCon framework to synergistically distill the common and discrepancy representations. The MaCon framework unifies mask reconstruction (MR) and contrastive learning (CL) self-supervised paradigms, where the MR serves the purpose of transformation while CL focuses on discrimination. Moreover, we presented an optimal sampling strategy in the CL architecture, enabling the CL subnetwork to extract more distinguishable discrepancy representations. Furthermore, we developed an effective silent attention mechanism that not only enhances contrast in output representations but stabilizes the training. Experimental results on both multimodal and monomodal datasets demonstrate that the MaCon framework effectively distills the intrinsic common representations between varied modalities and manifests state-of-the-art performance across both multimodal and monomodal CD. Such findings imply that the MaCon possesses the potential to serve as a unified framework in the CD and relevant fields. Source code will be publicly available once the article is accepted.

{"title":"MaCon: A Generic Self-Supervised Framework for Unsupervised Multimodal Change Detection","authors":"Jian Wang;Li Yan;Jianbing Yang;Hong Xie;Qiangqiang Yuan;Pengcheng Wei;Zhao Gao;Ce Zhang;Peter M. Atkinson","doi":"10.1109/TIP.2025.3542276","DOIUrl":"10.1109/TIP.2025.3542276","url":null,"abstract":"Change detection(CD) is important for Earth observation, emergency response and time-series understanding. Recently, data availability in various modalities has increased rapidly, and multimodal change detection (MCD) is gaining prominence. Given the scarcity of datasets and labels for MCD, unsupervised approaches are more practical for MCD. However, previous methods typically either merely reduce the gap between multimodal data through transformation or feed the original multimodal data directly into the discriminant network for difference extraction. The former faces challenges in extracting precise difference features. The latter contains the pronounced intrinsic distinction between the original multimodal data; direct extraction and comparison of features usually introduce significant noise, thereby compromising the quality of the resultant difference image. In this article, we proposed the MaCon framework to synergistically distill the common and discrepancy representations. The MaCon framework unifies mask reconstruction (MR) and contrastive learning (CL) self-supervised paradigms, where the MR serves the purpose of transformation while CL focuses on discrimination. Moreover, we presented an optimal sampling strategy in the CL architecture, enabling the CL subnetwork to extract more distinguishable discrepancy representations. Furthermore, we developed an effective silent attention mechanism that not only enhances contrast in output representations but stabilizes the training. Experimental results on both multimodal and monomodal datasets demonstrate that the MaCon framework effectively distills the intrinsic common representations between varied modalities and manifests state-of-the-art performance across both multimodal and monomodal CD. Such findings imply that the MaCon possesses the potential to serve as a unified framework in the CD and relevant fields. Source code will be publicly available once the article is accepted.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1485-1500"},"PeriodicalIF":0.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143486024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Who, What, and Where: Composite-Semantics Instance Search for Story Videos

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-21 DOI: 10.1109/TIP.2025.3542272

Jiahao Guo;Ankang Lu;Zhengqian Wu;Zhongyuan Wang;Chao Liang

Who, What and Where (3W)are the three core elements of storytelling, and accurately identifying the 3W semantics is critical to understanding the story in a video. This paper studies the 3W composite-semantics video Instance Search (INS) problem, which aims to find video shots about a specific person doing a concrete action in a particular location. The popular Complete-Decomposition (CD) methods divide a composite-semantics query into multiple single-semantics queries, which are likely to yield inaccurate or incomplete retrieval results due to neglecting important semantic correlations. Recent Non-Decomposition (ND) methods utilize Vision Language Model (VLM) to directly measure the similarity between textual query and video content. However, the accuracy is limited by VLM’s immature capability to recognize fine-grained objects. To address the above challenges, we propose a video structure-aware Partial-Decomposition (PD) method. Its core idea is to partially decompose the 3W INS problem into three semantic-correlated 2W INS problems i.e., person-action INS, action-location INS, and location-person INS. Thereafter, we respectively model the correlations between pairs of semantics at frames, shots and scenes of story videos. With the help of the spatial consistency and temporal continuity contained in the unique hierarchical structure of story videos, we can finally obtain identity-matching, logic-consistent, and content-coherent 3W INS results. To validate the effectiveness of the proposed method, we specifically build three large-scale 3W INS datasets based on three TV series Eastenders, Friends and The Big Bang Theory, totally comprising over 670K video shots spanning 700 hours. Extensive experiments show that the proposed PD method surpasses the current state-of-the-art CD and ND methods for 3W INS in story videos.

{"title":"Who, What, and Where: Composite-Semantics Instance Search for Story Videos","authors":"Jiahao Guo;Ankang Lu;Zhengqian Wu;Zhongyuan Wang;Chao Liang","doi":"10.1109/TIP.2025.3542272","DOIUrl":"10.1109/TIP.2025.3542272","url":null,"abstract":"Who, What and Where (3W)are the three core elements of storytelling, and accurately identifying the 3W semantics is critical to understanding the story in a video. This paper studies the 3W composite-semantics video Instance Search (INS) problem, which aims to find video shots about a specific person doing a concrete action in a particular location. The popular Complete-Decomposition (CD) methods divide a composite-semantics query into multiple single-semantics queries, which are likely to yield inaccurate or incomplete retrieval results due to neglecting important semantic correlations. Recent Non-Decomposition (ND) methods utilize Vision Language Model (VLM) to directly measure the similarity between textual query and video content. However, the accuracy is limited by VLM’s immature capability to recognize fine-grained objects. To address the above challenges, we propose a video structure-aware Partial-Decomposition (PD) method. Its core idea is to partially decompose the 3W INS problem into three semantic-correlated 2W INS problems i.e., person-action INS, action-location INS, and location-person INS. Thereafter, we respectively model the correlations between pairs of semantics at frames, shots and scenes of story videos. With the help of the spatial consistency and temporal continuity contained in the unique hierarchical structure of story videos, we can finally obtain identity-matching, logic-consistent, and content-coherent 3W INS results. To validate the effectiveness of the proposed method, we specifically build three large-scale 3W INS datasets based on three TV series Eastenders, Friends and The Big Bang Theory, totally comprising over 670K video shots spanning 700 hours. Extensive experiments show that the proposed PD method surpasses the current state-of-the-art CD and ND methods for 3W INS in story videos.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1412-1426"},"PeriodicalIF":0.0,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143470597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Underwater Image Quality Assessment With Explicit Degradation Awareness Embedding

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-20 DOI: 10.1109/TIP.2025.3539477

Qiuping Jiang;Yuese Gu;Zongwei Wu;Chongyi Li;Huan Xiong;Feng Shao;Zhihua Wang

Underwater Image Quality Assessment (UIQA) is currently an area of intensive research interest. Existing deep learning-based UIQA models always learn a deep neural network to directly map the input degraded underwater image into a final quality score via end-to-end training. However, a wide variety of image contents or distortion types may correspond to the same quality score, making it challenging to train such a deep model merely with a single subjective quality score as supervision. An intuitive idea to solve this problem is to exploit more detailed degradation-aware information as supplementary guidance to facilitate model learning. In this paper, we devise a novel deep UIQA model with Explicit Degradation Awareness embedding, i.e., EDANet. To train the EDANet, a two-stage training strategy is adopted. First, a tailored Degradation Information Discovery subnetwork (DIDNet) is pre-trained to infer a residual map between the input degraded underwater image and its pseudoreference counterpart. The inferred residual map explicitly characterizes the local degradation of the input underwater image. The intermediate feature representations on the decoder side of DIDNet are then embedded into the Degradation-guided Quality Evaluation subnetwork (DQENet), which significantly enhances the feature characterization capability with higher degradation awareness for quality prediction. The superiority of our EDANet against 18 state-of-the-art methods has been well demonstrated by extensive comparisons on two benchmark datasets. The source code of our EDANet is available at https://github.com/yia-yuese/EDANet.

{"title":"Deep Underwater Image Quality Assessment With Explicit Degradation Awareness Embedding","authors":"Qiuping Jiang;Yuese Gu;Zongwei Wu;Chongyi Li;Huan Xiong;Feng Shao;Zhihua Wang","doi":"10.1109/TIP.2025.3539477","DOIUrl":"10.1109/TIP.2025.3539477","url":null,"abstract":"Underwater Image Quality Assessment (UIQA) is currently an area of intensive research interest. Existing deep learning-based UIQA models always learn a deep neural network to directly map the input degraded underwater image into a final quality score via end-to-end training. However, a wide variety of image contents or distortion types may correspond to the same quality score, making it challenging to train such a deep model merely with a single subjective quality score as supervision. An intuitive idea to solve this problem is to exploit more detailed degradation-aware information as supplementary guidance to facilitate model learning. In this paper, we devise a novel deep UIQA model with Explicit Degradation Awareness embedding, i.e., EDANet. To train the EDANet, a two-stage training strategy is adopted. First, a tailored Degradation Information Discovery subnetwork (DIDNet) is pre-trained to infer a residual map between the input degraded underwater image and its pseudoreference counterpart. The inferred residual map explicitly characterizes the local degradation of the input underwater image. The intermediate feature representations on the decoder side of DIDNet are then embedded into the Degradation-guided Quality Evaluation subnetwork (DQENet), which significantly enhances the feature characterization capability with higher degradation awareness for quality prediction. The superiority of our EDANet against 18 state-of-the-art methods has been well demonstrated by extensive comparisons on two benchmark datasets. The source code of our EDANet is available at <uri>https://github.com/yia-yuese/EDANet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1297-1310"},"PeriodicalIF":0.0,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Omnidirectional Image Quality Captioning: A Large-Scale Database and a New Model

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-20 DOI: 10.1109/TIP.2025.3539468

Jiebin Yan;Ziwen Tan;Yuming Fang;Junjie Chen;Wenhui Jiang;Zhou Wang

The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at https://github.com/WenJuing/IQCaption360.

{"title":"Omnidirectional Image Quality Captioning: A Large-Scale Database and a New Model","authors":"Jiebin Yan;Ziwen Tan;Yuming Fang;Junjie Chen;Wenhui Jiang;Zhou Wang","doi":"10.1109/TIP.2025.3539468","DOIUrl":"10.1109/TIP.2025.3539468","url":null,"abstract":"The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at <uri>https://github.com/WenJuing/IQCaption360</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1326-1339"},"PeriodicalIF":0.0,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

These Maps Are Made by Propagation: Adapting Deep Stereo Networks to Road Scenarios With Decisive Disparity Diffusion

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-19 DOI: 10.1109/TIP.2025.3540283

Chuang-Wei Liu;Yikang Zhang;Qijun Chen;Ioannis Pitas;Rui Fan

Stereo matching has emerged as a cost-effective solution for road surface 3D reconstruction, garnering significant attention towards improving both computational efficiency and accuracy. This article introduces decisive disparity diffusion (D3Stereo), marking the first exploration of dense deep feature matching that adapts pre-trained deep convolutional neural networks (DCNNs) to previously unseen road scenarios. A pyramid of cost volumes is initially created using various levels of learned representations. Subsequently, a novel recursive bilateral filtering algorithm is employed to aggregate these costs. A key innovation of D3Stereo lies in its alternating decisive disparity diffusion strategy, wherein intra-scale diffusion is employed to complete sparse disparity images, while inter-scale inheritance provides valuable prior information for higher resolutions. Extensive experiments conducted on our created UDTIRI-Stereo and Stereo-Road datasets underscore the effectiveness of D3Stereo strategy in adapting pre-trained DCNNs and its superior performance compared to all other explicit programming-based algorithms designed specifically for road surface 3D reconstruction. Additional experiments conducted on the Middlebury dataset with backbone DCNNs pre-trained on the ImageNet database further validate the versatility of D3Stereo strategy in tackling general stereo matching problems. Our source code and supplementary material are publicly available at https://mias.group/D3-Stereo.

{"title":"These Maps Are Made by Propagation: Adapting Deep Stereo Networks to Road Scenarios With Decisive Disparity Diffusion","authors":"Chuang-Wei Liu;Yikang Zhang;Qijun Chen;Ioannis Pitas;Rui Fan","doi":"10.1109/TIP.2025.3540283","DOIUrl":"10.1109/TIP.2025.3540283","url":null,"abstract":"Stereo matching has emerged as a cost-effective solution for road surface 3D reconstruction, garnering significant attention towards improving both computational efficiency and accuracy. This article introduces decisive disparity diffusion (D3Stereo), marking the first exploration of dense deep feature matching that adapts pre-trained deep convolutional neural networks (DCNNs) to previously unseen road scenarios. A pyramid of cost volumes is initially created using various levels of learned representations. Subsequently, a novel recursive bilateral filtering algorithm is employed to aggregate these costs. A key innovation of D3Stereo lies in its alternating decisive disparity diffusion strategy, wherein intra-scale diffusion is employed to complete sparse disparity images, while inter-scale inheritance provides valuable prior information for higher resolutions. Extensive experiments conducted on our created UDTIRI-Stereo and Stereo-Road datasets underscore the effectiveness of D3Stereo strategy in adapting pre-trained DCNNs and its superior performance compared to all other explicit programming-based algorithms designed specifically for road surface 3D reconstruction. Additional experiments conducted on the Middlebury dataset with backbone DCNNs pre-trained on the ImageNet database further validate the versatility of D3Stereo strategy in tackling general stereo matching problems. Our source code and supplementary material are publicly available at <uri>https://mias.group/D3-Stereo</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1516-1528"},"PeriodicalIF":0.0,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MaeFuse: Transferring Omni Features With Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-19 DOI: 10.1109/TIP.2025.3541562

Jiayang Li;Junjun Jiang;Pengwei Liang;Jiayi Ma;Liqiang Nie

In this paper, we introduce MaeFuse, a novel autoencoder model designed for Infrared and Visible Image Fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain high-level visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. Instead of being driven by downstream tasks, our model called MaeFuse utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion performance. The proposed method can facilitate the comprehensive integration of feature vectors from both infrared and visible modalities, thus preserving the rich details inherent in each modal. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets. The code is available at https://github.com/Henry-Lee-real/MaeFuse.

{"title":"MaeFuse: Transferring Omni Features With Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training","authors":"Jiayang Li;Junjun Jiang;Pengwei Liang;Jiayi Ma;Liqiang Nie","doi":"10.1109/TIP.2025.3541562","DOIUrl":"10.1109/TIP.2025.3541562","url":null,"abstract":"In this paper, we introduce MaeFuse, a novel autoencoder model designed for Infrared and Visible Image Fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain high-level visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. Instead of being driven by downstream tasks, our model called MaeFuse utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion performance. The proposed method can facilitate the comprehensive integration of feature vectors from both infrared and visible modalities, thus preserving the rich details inherent in each modal. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets. The code is available at <uri>https://github.com/Henry-Lee-real/MaeFuse</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1340-1353"},"PeriodicalIF":0.0,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BGPSeg: Boundary-Guided Primitive Instance Segmentation of Point Clouds

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-19 DOI: 10.1109/TIP.2025.3540586

Zheng Fang;Chuanqing Zhuang;Zhengda Lu;Yiqun Wang;Lupeng Liu;Jun Xiao

Point cloud primitive instance segmentation is critical for understanding the geometric shapes of man-made objects. Existing learning-based methods mainly focus on learning high-dimensional feature representations of points and further perform clustering or region growing to obtain corresponding primitive instances. However, these features generally cannot accurately represent the discriminability between instances, especially near the boundaries or in regions with small differences in geometric properties. This limitation often leads to over- or under-segmentation of geometric primitives. On the other hand, the boundaries of different primitives are the direct features that distinguish them and thus utilizing boundary information to guide feature learning and clustering is crucial for this task. In this paper, we propose a novel framework BGPSeg for point cloud primitive instance segmentation that utilizes boundary-guided feature extraction and clustering. Specifically, we first introduce a boundary-guided feature extractor with the additional input of a boundary probability map, which utilizes boundary-guided sampling and a boundary transformer to enhance feature discrimination among points crossing geometric boundaries. Furthermore, we propose a boundary-guided primitive clustering module, which combines boundary clues and geometric feature discrimination for clustering to further improve the segmentation performance. Finally, we demonstrate the effectiveness of our BGPSeg with a series of comparison and ablation experiments while achieving the state-of-the-art primitive instance segmentation. Our code is available at https://github.com/fz-20/BGPSeg.

{"title":"BGPSeg: Boundary-Guided Primitive Instance Segmentation of Point Clouds","authors":"Zheng Fang;Chuanqing Zhuang;Zhengda Lu;Yiqun Wang;Lupeng Liu;Jun Xiao","doi":"10.1109/TIP.2025.3540586","DOIUrl":"10.1109/TIP.2025.3540586","url":null,"abstract":"Point cloud primitive instance segmentation is critical for understanding the geometric shapes of man-made objects. Existing learning-based methods mainly focus on learning high-dimensional feature representations of points and further perform clustering or region growing to obtain corresponding primitive instances. However, these features generally cannot accurately represent the discriminability between instances, especially near the boundaries or in regions with small differences in geometric properties. This limitation often leads to over- or under-segmentation of geometric primitives. On the other hand, the boundaries of different primitives are the direct features that distinguish them and thus utilizing boundary information to guide feature learning and clustering is crucial for this task. In this paper, we propose a novel framework BGPSeg for point cloud primitive instance segmentation that utilizes boundary-guided feature extraction and clustering. Specifically, we first introduce a boundary-guided feature extractor with the additional input of a boundary probability map, which utilizes boundary-guided sampling and a boundary transformer to enhance feature discrimination among points crossing geometric boundaries. Furthermore, we propose a boundary-guided primitive clustering module, which combines boundary clues and geometric feature discrimination for clustering to further improve the segmentation performance. Finally, we demonstrate the effectiveness of our BGPSeg with a series of comparison and ablation experiments while achieving the state-of-the-art primitive instance segmentation. Our code is available at <uri>https://github.com/fz-20/BGPSeg</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1454-1468"},"PeriodicalIF":0.0,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Commonality Feature Representation Learning for Unsupervised Multimodal Change Detection

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-02-17 DOI: 10.1109/TIP.2025.3539461

Tongfei Liu;Mingyang Zhang;Maoguo Gong;Qingfu Zhang;Fenlong Jiang;Hanhong Zheng;Di Lu

The main challenge of multimodal change detection (MCD) is that multimodal bitemporal images (MBIs) cannot be compared directly to identify changes. To overcome this problem, this paper proposes a novel commonality feature representation learning (CFRL) and constructs a CFRL-based unsupervised MCD framework. The CFRL is composed of a Siamese-based encoder and two decoders. First, the Siamese-based encoder can map original MBIs in the same feature space for extracting the representative features of each modality. Then, the two decoders are used to reconstruct the original MBIs by regressing themselves, respectively. Meanwhile, we swap the decoders to reconstruct the pseudo-MBIs to conduct modality alignment. Subsequently, all reconstructed images are input to the Siamese-based encoder again to map them in a same feature space, by which representative features are obtained. On this basis, latent commonality features between MBIs can be extracted by minimizing the distance between these representative features. These latent commonality features are comparable and can be used to identify changes. Notably, the proposed CFRL can be performed simultaneously in two modalities corresponding to MBIs. Therefore, two change magnitude images (CMIs) can be generated simultaneously by measuring the difference between the commonality features of MBIs. Finally, a simple threshold algorithm or a clustering algorithm can be employed to divide CMIs into binary change maps. Extensive experiments on six publicly available MCD datasets show that the proposed CFRL-based framework can achieve superior performance compared with other state-of-the-art approaches.

{"title":"Commonality Feature Representation Learning for Unsupervised Multimodal Change Detection","authors":"Tongfei Liu;Mingyang Zhang;Maoguo Gong;Qingfu Zhang;Fenlong Jiang;Hanhong Zheng;Di Lu","doi":"10.1109/TIP.2025.3539461","DOIUrl":"10.1109/TIP.2025.3539461","url":null,"abstract":"The main challenge of multimodal change detection (MCD) is that multimodal bitemporal images (MBIs) cannot be compared directly to identify changes. To overcome this problem, this paper proposes a novel commonality feature representation learning (CFRL) and constructs a CFRL-based unsupervised MCD framework. The CFRL is composed of a Siamese-based encoder and two decoders. First, the Siamese-based encoder can map original MBIs in the same feature space for extracting the representative features of each modality. Then, the two decoders are used to reconstruct the original MBIs by regressing themselves, respectively. Meanwhile, we swap the decoders to reconstruct the pseudo-MBIs to conduct modality alignment. Subsequently, all reconstructed images are input to the Siamese-based encoder again to map them in a same feature space, by which representative features are obtained. On this basis, latent commonality features between MBIs can be extracted by minimizing the distance between these representative features. These latent commonality features are comparable and can be used to identify changes. Notably, the proposed CFRL can be performed simultaneously in two modalities corresponding to MBIs. Therefore, two change magnitude images (CMIs) can be generated simultaneously by measuring the difference between the commonality features of MBIs. Finally, a simple threshold algorithm or a clustering algorithm can be employed to divide CMIs into binary change maps. Extensive experiments on six publicly available MCD datasets show that the proposed CFRL-based framework can achieve superior performance compared with other state-of-the-art approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1219-1233"},"PeriodicalIF":0.0,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143443343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0