Pub Date : 2025-01-15DOI: 10.1109/TIP.2025.3527369
Yanwei Zheng;Bowen Huang;Zekai Chen;Dongxiao Yu
Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.
{"title":"Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects","authors":"Yanwei Zheng;Bowen Huang;Zekai Chen;Dongxiao Yu","doi":"10.1109/TIP.2025.3527369","DOIUrl":"10.1109/TIP.2025.3527369","url":null,"abstract":"Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"581-593"},"PeriodicalIF":0.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142986395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-14DOI: 10.1109/TIP.2025.3526054
Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai
Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at https://github.com/huangqiuyu/PFMLP.
{"title":"A Pyramid Fusion MLP for Dense Prediction","authors":"Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai","doi":"10.1109/TIP.2025.3526054","DOIUrl":"10.1109/TIP.2025.3526054","url":null,"abstract":"Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at <uri>https://github.com/huangqiuyu/PFMLP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"455-467"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-14DOI: 10.1109/TIP.2025.3527372
Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan
Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at https://github.com/Lx-Bao/IFENet.
{"title":"IFENet: Interaction, Fusion, and Enhancement Network for V-D-T Salient Object Detection","authors":"Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan","doi":"10.1109/TIP.2025.3527372","DOIUrl":"10.1109/TIP.2025.3527372","url":null,"abstract":"Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at <uri>https://github.com/Lx-Bao/IFENet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"483-494"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High Dynamic Range (HDR) images present unique challenges for Learned Image Compression (LIC) due to their complex domain distribution compared to Low Dynamic Range (LDR) images. In coding practice, HDR-oriented LIC typically adopts preprocessing steps (e.g., perceptual quantization and tone mapping operation) to align the distributions between LDR and HDR images, which inevitably comes at the expense of perceptual quality. To address this challenge, we rethink the HDR imaging process which involves fusing multiple exposure LDR images to create an HDR image and propose a novel HDR image compression paradigm, Unifying Imaging and Compression (HDR-UIC). The key innovation lies in establishing a seamless pipeline from image capture to delivery and enabling end-to-end training and optimization. Specifically, a Mixture-ATtention (MAT)-based compression backbone merges LDR features while simultaneously generating a compact representation. Meanwhile, the Reference-guided Misalignment-aware feature Enhancement (RME) module mitigates ghosting artifacts caused by misalignment in the LDR branches, maintaining fidelity without introducing additional information. Furthermore, we introduce an Appearance Redundancy Removal (ARR) module to optimize coding resource allocation among LDR features, thereby enhancing the final HDR compression performance. Extensive experimental results demonstrate the efficacy of our approach, showing significant improvements over existing state-of-the-art HDR compression schemes. Our code is available at: https://github.com/plf1999/HDR-UIC.
{"title":"Breaking Boundaries: Unifying Imaging and Compression for HDR Image Compression","authors":"Xuelin Shen;Linfeng Pan;Zhangkai Ni;Yulin He;Wenhan Yang;Shiqi Wang;Sam Kwong","doi":"10.1109/TIP.2025.3527365","DOIUrl":"10.1109/TIP.2025.3527365","url":null,"abstract":"High Dynamic Range (HDR) images present unique challenges for Learned Image Compression (LIC) due to their complex domain distribution compared to Low Dynamic Range (LDR) images. In coding practice, HDR-oriented LIC typically adopts preprocessing steps (e.g., perceptual quantization and tone mapping operation) to align the distributions between LDR and HDR images, which inevitably comes at the expense of perceptual quality. To address this challenge, we rethink the HDR imaging process which involves fusing multiple exposure LDR images to create an HDR image and propose a novel HDR image compression paradigm, Unifying Imaging and Compression (HDR-UIC). The key innovation lies in establishing a seamless pipeline from image capture to delivery and enabling end-to-end training and optimization. Specifically, a Mixture-ATtention (MAT)-based compression backbone merges LDR features while simultaneously generating a compact representation. Meanwhile, the Reference-guided Misalignment-aware feature Enhancement (RME) module mitigates ghosting artifacts caused by misalignment in the LDR branches, maintaining fidelity without introducing additional information. Furthermore, we introduce an Appearance Redundancy Removal (ARR) module to optimize coding resource allocation among LDR features, thereby enhancing the final HDR compression performance. Extensive experimental results demonstrate the efficacy of our approach, showing significant improvements over existing state-of-the-art HDR compression schemes. Our code is available at: <uri>https://github.com/plf1999/HDR-UIC</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"510-521"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Acoustic resolution photoacoustic microscopy (AR-PAM) is a novel medical imaging modality, which can be used for both structural and functional imaging in deep bio-tissue. However, the imaging resolution is degraded and structural details are lost since its dependency on acoustic focusing, which significantly constrains its scope of applications in medical and clinical scenarios. To address the above issue, model-based approaches incorporating traditional analytical prior terms have been employed, making it challenging to capture finer details of anatomical bio-structures. In this paper, we proposed an innovative prior named group sparsity prior for simultaneous reconstruction, which utilizes the non-local structural similarity between patches extracted from internal AR-PAM images. The local image details and resolution are improved while artifacts are also introduced. To mitigate the artifacts introduced by patch-based reconstruction methods, we further integrate an external image dataset as an extra information provider and consolidate the group sparsity prior with a deep denoiser prior. In this way, complementary information can be exploited to improve reconstruction results. Extensive experiments are conducted to enhance the simulated and in vivo AR-PAM imaging results. Specifically, in the simulated images, the mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) values have increased from 16.36 dB and 0.46 to 27.62 dB and 0.92, respectively. The in vivo reconstructed results also demonstrate the proposed method achieves superior local and global perceptual qualities, the metrics of signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) have significantly increased from 10.59 and 8.61 to 30.83 and 27.54, respectively. Additionally, reconstruction fidelity is validated with the optical resolution photoacoustic microscopy (OR-PAM) data as reference image.
{"title":"Acoustic Resolution Photoacoustic Microscopy Imaging Enhancement: Integration of Group Sparsity With Deep Denoiser Prior","authors":"Zhengyuan Zhang;Zuozhou Pan;Zhuoyi Lin;Arunima Sharma;Chia-Wen Lin;Manojit Pramanik;Yuanjin Zheng","doi":"10.1109/TIP.2025.3526065","DOIUrl":"10.1109/TIP.2025.3526065","url":null,"abstract":"Acoustic resolution photoacoustic microscopy (AR-PAM) is a novel medical imaging modality, which can be used for both structural and functional imaging in deep bio-tissue. However, the imaging resolution is degraded and structural details are lost since its dependency on acoustic focusing, which significantly constrains its scope of applications in medical and clinical scenarios. To address the above issue, model-based approaches incorporating traditional analytical prior terms have been employed, making it challenging to capture finer details of anatomical bio-structures. In this paper, we proposed an innovative prior named group sparsity prior for simultaneous reconstruction, which utilizes the non-local structural similarity between patches extracted from internal AR-PAM images. The local image details and resolution are improved while artifacts are also introduced. To mitigate the artifacts introduced by patch-based reconstruction methods, we further integrate an external image dataset as an extra information provider and consolidate the group sparsity prior with a deep denoiser prior. In this way, complementary information can be exploited to improve reconstruction results. Extensive experiments are conducted to enhance the simulated and in vivo AR-PAM imaging results. Specifically, in the simulated images, the mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) values have increased from 16.36 dB and 0.46 to 27.62 dB and 0.92, respectively. The in vivo reconstructed results also demonstrate the proposed method achieves superior local and global perceptual qualities, the metrics of signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) have significantly increased from 10.59 and 8.61 to 30.83 and 27.54, respectively. Additionally, reconstruction fidelity is validated with the optical resolution photoacoustic microscopy (OR-PAM) data as reference image.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"522-537"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1109/TIP.2025.3526064
Wenqi Han;Wen Jiang;Jie Geng;Wang Miao
The feature fusion of optical and Synthetic Aperture Radar (SAR) images is widely used for semantic segmentation of multimodal remote sensing images. It leverages information from two different sensors to enhance the analytical capabilities of land cover. However, the imaging characteristics of optical and SAR data are vastly different, and noise interference makes the fusion of multimodal data information challenging. Furthermore, in practical remote sensing applications, there are typically only a limited number of labeled samples available, with most pixels needing to be labeled. Semi-supervised learning has the potential to improve model performance in scenarios with limited labeled data. However, in remote sensing applications, the quality of pseudo-labels is frequently compromised, particularly in challenging regions such as blurred edges and areas with class confusion. This degradation in label quality can have a detrimental effect on the model’s overall performance. In this paper, we introduce the Difference-complementary Learning and Label Reassignment (DLLR) network for multimodal semi-supervised semantic segmentation of remote sensing images. Our proposed DLLR framework leverages asymmetric masking to create information discrepancies between the optical and SAR modalities, and employs a difference-guided complementary learning strategy to enable mutual learning. Subsequently, we introduce a multi-level label reassignment strategy, treating the label assignment problem as an optimal transport optimization task to allocate pixels to classes with higher precision for unlabeled pixels, thereby enhancing the quality of pseudo-label annotations. Finally, we introduce a multimodal consistency cross pseudo-supervision strategy to improve pseudo-label utilization. We evaluate our method on two multimodal remote sensing datasets, namely, the WHU-OPT-SAR and EErDS-OPT-SAR datasets. Experimental results demonstrate that our proposed DLLR model outperforms other relevant deep networks in terms of accuracy in multimodal semantic segmentation.
{"title":"Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images","authors":"Wenqi Han;Wen Jiang;Jie Geng;Wang Miao","doi":"10.1109/TIP.2025.3526064","DOIUrl":"10.1109/TIP.2025.3526064","url":null,"abstract":"The feature fusion of optical and Synthetic Aperture Radar (SAR) images is widely used for semantic segmentation of multimodal remote sensing images. It leverages information from two different sensors to enhance the analytical capabilities of land cover. However, the imaging characteristics of optical and SAR data are vastly different, and noise interference makes the fusion of multimodal data information challenging. Furthermore, in practical remote sensing applications, there are typically only a limited number of labeled samples available, with most pixels needing to be labeled. Semi-supervised learning has the potential to improve model performance in scenarios with limited labeled data. However, in remote sensing applications, the quality of pseudo-labels is frequently compromised, particularly in challenging regions such as blurred edges and areas with class confusion. This degradation in label quality can have a detrimental effect on the model’s overall performance. In this paper, we introduce the Difference-complementary Learning and Label Reassignment (DLLR) network for multimodal semi-supervised semantic segmentation of remote sensing images. Our proposed DLLR framework leverages asymmetric masking to create information discrepancies between the optical and SAR modalities, and employs a difference-guided complementary learning strategy to enable mutual learning. Subsequently, we introduce a multi-level label reassignment strategy, treating the label assignment problem as an optimal transport optimization task to allocate pixels to classes with higher precision for unlabeled pixels, thereby enhancing the quality of pseudo-label annotations. Finally, we introduce a multimodal consistency cross pseudo-supervision strategy to improve pseudo-label utilization. We evaluate our method on two multimodal remote sensing datasets, namely, the WHU-OPT-SAR and EErDS-OPT-SAR datasets. Experimental results demonstrate that our proposed DLLR model outperforms other relevant deep networks in terms of accuracy in multimodal semantic segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"566-580"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1109/TIP.2025.3526051
Haoyu Li;Hao Wu;Badong Chen
Reconstructing visual stimuli from functional Magnetic Resonance Imaging (fMRI) enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies. Code can be available on https://github.com/HaoyyLi/NeuralDiffuser.
{"title":"NeuralDiffuser: Neuroscience-Inspired Diffusion Guidance for fMRI Visual Reconstruction","authors":"Haoyu Li;Hao Wu;Badong Chen","doi":"10.1109/TIP.2025.3526051","DOIUrl":"10.1109/TIP.2025.3526051","url":null,"abstract":"Reconstructing visual stimuli from functional Magnetic Resonance Imaging (fMRI) enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies. Code can be available on <uri>https://github.com/HaoyyLi/NeuralDiffuser</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"552-565"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1109/TIP.2025.3526056
Jian Wang;Fan Li;Song Lv;Lijun He;Chao Shen
Vision-based 3D object detection, a cost-effective alternative to LiDAR-based solutions, plays a crucial role in modern autonomous driving systems. Meanwhile, deep models have been proven susceptible to adversarial examples, and attacking detection models can lead to serious driving consequences. Most previous adversarial attacks targeted 2D detectors by placing the patch in a specific region within the object’s bounding box in the image, allowing it to evade detection. However, attacking 3D detector is more difficult because the adversary may be observed from different viewpoints and distances, and there is a lack of effective methods to differentiably render the 3D space poster onto the image. In this paper, we propose a novel attack setting where a carefully crafted adversarial poster (looks like meaningless graffiti) is learned and pasted on the road surface, inducing the vision-based 3D detectors to perceive a non-existent object. We show that even a single 2D poster is sufficient to deceive the 3D detector with the desired attack effect, and the poster is universal, which is effective across various scenes, viewpoints, and distances. To generate the poster, an image-3D applying algorithm is devised to establish the pixel-wise mapping relationship between the image area and the 3D space poster so that the poster can be optimized through standard backpropagation. Moreover, a ground-truth masked optimization strategy is presented to effectively learn the poster without interference from scene objects. Extensive results including real-world experiments validate the effectiveness of our adversarial attack. The transferability and defense strategy are also investigated to comprehensively understand the proposed attack.
{"title":"Physically Realizable Adversarial Creating Attack Against Vision-Based BEV Space 3D Object Detection","authors":"Jian Wang;Fan Li;Song Lv;Lijun He;Chao Shen","doi":"10.1109/TIP.2025.3526056","DOIUrl":"10.1109/TIP.2025.3526056","url":null,"abstract":"Vision-based 3D object detection, a cost-effective alternative to LiDAR-based solutions, plays a crucial role in modern autonomous driving systems. Meanwhile, deep models have been proven susceptible to adversarial examples, and attacking detection models can lead to serious driving consequences. Most previous adversarial attacks targeted 2D detectors by placing the patch in a specific region within the object’s bounding box in the image, allowing it to evade detection. However, attacking 3D detector is more difficult because the adversary may be observed from different viewpoints and distances, and there is a lack of effective methods to differentiably render the 3D space poster onto the image. In this paper, we propose a novel attack setting where a carefully crafted adversarial poster (looks like meaningless graffiti) is learned and pasted on the road surface, inducing the vision-based 3D detectors to perceive a non-existent object. We show that even a single 2D poster is sufficient to deceive the 3D detector with the desired attack effect, and the poster is universal, which is effective across various scenes, viewpoints, and distances. To generate the poster, an image-3D applying algorithm is devised to establish the pixel-wise mapping relationship between the image area and the 3D space poster so that the poster can be optimized through standard backpropagation. Moreover, a ground-truth masked optimization strategy is presented to effectively learn the poster without interference from scene objects. Extensive results including real-world experiments validate the effectiveness of our adversarial attack. The transferability and defense strategy are also investigated to comprehensively understand the proposed attack.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"538-551"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1109/TIP.2024.3523801
Nir Yellinek;Leonid Karlinsky;Raja Giryes
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects’ attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model’s success or failure.
{"title":"3VL: Using Trees to Improve Vision-Language Models’ Interpretability","authors":"Nir Yellinek;Leonid Karlinsky;Raja Giryes","doi":"10.1109/TIP.2024.3523801","DOIUrl":"10.1109/TIP.2024.3523801","url":null,"abstract":"Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects’ attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model’s success or failure.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"495-509"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1109/TIP.2024.3460568
{"title":"IEEE Transactions on Image Processing publication information","authors":"","doi":"10.1109/TIP.2024.3460568","DOIUrl":"10.1109/TIP.2024.3460568","url":null,"abstract":"","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"C2-C2"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10829516","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}