Recent advancements have suggested that neural radiance fields (NeRFs) show great potential in 3D style transfer. However, most existing NeRF-based style transfer methods still face considerable challenges in generating stylized images that simultaneously preserve clear scene textures and maintain strong cross-view consistency. To address these limitations, in this paper, we propose a novel transformer-guided approach for 3D scene style transfer. Specifically, we first design a transformer-based style transfer network to capture long-range dependencies and generate 2D stylized images with initial consistency, which serve as supervision for the 3D stylized generation. To enable fine-grained control over style, we propose a latent style vector as a conditional feature and design a style network that projects this style information into the 3D space. We further develop a merge network that integrates style features with scene geometry to render 3D stylized images that are both visually coherent and stylistically consistent. In addition, we propose a texture consistency loss to preserve scene structure and enhance texture fidelity across views. Extensive quantitative and qualitative experimental results demonstrate that our proposed approach outperforms many state-of-the-art methods in terms of visual perception, image quality and multi-view consistency. Our code and more results are available at: https://github.com/PaiDii/TGTC-Style.git.
{"title":"Texture-Consistent 3D Scene Style Transfer via Transformer-Guided Neural Radiance Fields.","authors":"Wudi Chen,Zhiyuan Zha,Shigang Wang,Liaqat Ali,Bihan Wen,Xin Yuan,Jiantao Zhou,Ce Zhu","doi":"10.1109/tip.2025.3626892","DOIUrl":"https://doi.org/10.1109/tip.2025.3626892","url":null,"abstract":"Recent advancements have suggested that neural radiance fields (NeRFs) show great potential in 3D style transfer. However, most existing NeRF-based style transfer methods still face considerable challenges in generating stylized images that simultaneously preserve clear scene textures and maintain strong cross-view consistency. To address these limitations, in this paper, we propose a novel transformer-guided approach for 3D scene style transfer. Specifically, we first design a transformer-based style transfer network to capture long-range dependencies and generate 2D stylized images with initial consistency, which serve as supervision for the 3D stylized generation. To enable fine-grained control over style, we propose a latent style vector as a conditional feature and design a style network that projects this style information into the 3D space. We further develop a merge network that integrates style features with scene geometry to render 3D stylized images that are both visually coherent and stylistically consistent. In addition, we propose a texture consistency loss to preserve scene structure and enhance texture fidelity across views. Extensive quantitative and qualitative experimental results demonstrate that our proposed approach outperforms many state-of-the-art methods in terms of visual perception, image quality and multi-view consistency. Our code and more results are available at: https://github.com/PaiDii/TGTC-Style.git.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145440686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1109/tip.2025.3623259
Zhiqin Zhu,Zimeng Zhang,Guanqiu Qi,Yuanyuan Li,Pan Yang,Yu Liu
3D medical images are volumetric data that provide spatial continuity and multi-dimensional information. These features provide rich anatomical context. However, their anisotropy may result in reduced image detail along certain directions. This can cause blurring or distortion between slices. In addition, global or local intensity inhomogeneities are often observed. This may be due to limitations of the imaging equipment, inappropriate scanning parameters, or variations in the patient's anatomy. This inhomogeneity may blur lesion boundaries and may also mask true features, causing the model to focus on irrelevant regions. Therefore, a probability map-guided network for 3D volumetric medical image segmentation (3D-PMGNet) is proposed. The probability maps generated from the intermediate features are used as supervisory signals to guide the segmentation process. A new probability map reconstruction method is designed, combining dynamic thresholding with local adaptive smoothing. This enhances the reliability of high-response regions while suppressing low-response noise. A learnable channel-wise temperature coefficient is introduced to adjust the probability distribution to make it closer to the true distribution; in addition, a feature fusion method based on dynamic prompt encoding is developed. The response strength of the main feature maps is dynamically adjusted, and this adjustment is achieved through the spatial position encoding derived from the probability maps. The proposed method has been evaluated on four datasets. Experimental results show that the proposed method outperforms state-of-the-art 3D medical image segmentation methods. The source codes have been publicly released at https://github.com/ZHANGZIMENG01/3D-PMGNet.
{"title":"Probability Map-Guided Network for 3D Volumetric Medical Image Segmentation.","authors":"Zhiqin Zhu,Zimeng Zhang,Guanqiu Qi,Yuanyuan Li,Pan Yang,Yu Liu","doi":"10.1109/tip.2025.3623259","DOIUrl":"https://doi.org/10.1109/tip.2025.3623259","url":null,"abstract":"3D medical images are volumetric data that provide spatial continuity and multi-dimensional information. These features provide rich anatomical context. However, their anisotropy may result in reduced image detail along certain directions. This can cause blurring or distortion between slices. In addition, global or local intensity inhomogeneities are often observed. This may be due to limitations of the imaging equipment, inappropriate scanning parameters, or variations in the patient's anatomy. This inhomogeneity may blur lesion boundaries and may also mask true features, causing the model to focus on irrelevant regions. Therefore, a probability map-guided network for 3D volumetric medical image segmentation (3D-PMGNet) is proposed. The probability maps generated from the intermediate features are used as supervisory signals to guide the segmentation process. A new probability map reconstruction method is designed, combining dynamic thresholding with local adaptive smoothing. This enhances the reliability of high-response regions while suppressing low-response noise. A learnable channel-wise temperature coefficient is introduced to adjust the probability distribution to make it closer to the true distribution; in addition, a feature fusion method based on dynamic prompt encoding is developed. The response strength of the main feature maps is dynamically adjusted, and this adjustment is achieved through the spatial position encoding derived from the probability maps. The proposed method has been evaluated on four datasets. Experimental results show that the proposed method outperforms state-of-the-art 3D medical image segmentation methods. The source codes have been publicly released at https://github.com/ZHANGZIMENG01/3D-PMGNet.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145440835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/tip.2025.3625750
Krishna Srikar Durbha, Alan C. Bovik
{"title":"Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity","authors":"Krishna Srikar Durbha, Alan C. Bovik","doi":"10.1109/tip.2025.3625750","DOIUrl":"https://doi.org/10.1109/tip.2025.3625750","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"126 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145412124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/tip.2025.3625380
Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj
{"title":"Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios","authors":"Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj","doi":"10.1109/tip.2025.3625380","DOIUrl":"https://doi.org/10.1109/tip.2025.3625380","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"12 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145404441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep unfolding has emerged as a powerful solution for Multi-modal Image Super-Resolution (MISR) through strategic integration of cross-modal priors in network architecture. However, current deep unfolding approaches rely on first-order optimization, which exhibit limitations in learning efficiency and reconstruction accuracy. In this paper, to overcome these limitations, we propose a novel Semi-smooth Newton driven Unfolding network for MISR, namely SNUM-Net. Specifically, we first develop a Semi-smooth Newton-driven MISR (SNM) algorithm that establishes a theoretical foundation for our approach. Then, we unfold the iterative solution of SNM into a novel network. To the best of our knowledge, the SNUM-Net is the first successful attempt to design a deep unfolding MISR network based on second-order optimization algorithm. Compared to existing methods, the SNUM-Net demonstrates three main advantages. 1) Universal paradigm: the SNUM-Net provides a unified paradigm for diverse MISR tasks without requiring scenario-specific constraints; 2) Explainable framework: the network preserves a mathematical correspondence with the SNM algorithm, ensuring that the topological relationships between modules are well explainable; 3) Superior performance: comprehensive evaluations across 10 datasets spanning 3 MISR tasks demonstrate the network's exceptional reconstruction accuracy and generalization capability. The software codes are available at https://github.com/pandazcx/SNUM-Net.
{"title":"Deep Semi-smooth Newton-driven Unfolding Network for Multi-modal Image Super-Resolution.","authors":"Chenxiao Zhang,Xin Deng,Jingyi Xu,Yongxuan Dou,Mai Xu","doi":"10.1109/tip.2025.3625429","DOIUrl":"https://doi.org/10.1109/tip.2025.3625429","url":null,"abstract":"Deep unfolding has emerged as a powerful solution for Multi-modal Image Super-Resolution (MISR) through strategic integration of cross-modal priors in network architecture. However, current deep unfolding approaches rely on first-order optimization, which exhibit limitations in learning efficiency and reconstruction accuracy. In this paper, to overcome these limitations, we propose a novel Semi-smooth Newton driven Unfolding network for MISR, namely SNUM-Net. Specifically, we first develop a Semi-smooth Newton-driven MISR (SNM) algorithm that establishes a theoretical foundation for our approach. Then, we unfold the iterative solution of SNM into a novel network. To the best of our knowledge, the SNUM-Net is the first successful attempt to design a deep unfolding MISR network based on second-order optimization algorithm. Compared to existing methods, the SNUM-Net demonstrates three main advantages. 1) Universal paradigm: the SNUM-Net provides a unified paradigm for diverse MISR tasks without requiring scenario-specific constraints; 2) Explainable framework: the network preserves a mathematical correspondence with the SNM algorithm, ensuring that the topological relationships between modules are well explainable; 3) Superior performance: comprehensive evaluations across 10 datasets spanning 3 MISR tasks demonstrate the network's exceptional reconstruction accuracy and generalization capability. The software codes are available at https://github.com/pandazcx/SNUM-Net.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"4 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145403891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1109/tip.2025.3624626
Jingxuan Zhang,Zhihua Chen,Lei Dai
Dataset distillation (DD) aims to accelerate the training speed of neural networks (NNs) by synthesizing a reduced dataset. NNs trained on the smaller dataset are expected to obtain almost the same test set accuracy as they do on the larger one. Previous DD research treated the obtained distilled dataset as a regular dataset for training, neglecting the overfitting issue caused by the limited number of original distilled images. In this paper, we propose a new DD paradigm. Specifically, in the deployment stage, distilled images are augmented by amplifying their local information since the teacher network can produce diverse supervision signals when receiving inputs from different regions. Efficient and diverse augmentation methods for each distilled image are devised, while ensuring the authenticity of augmented samples. Additionally, to alleviate the increased training cost caused by data augmentation, we design a bi-directional dynamic dataset pruning technique to prune the original distilled dataset and augmented distilled dataset. A new pruning strategy and scheduling are proposed based on experimental findings. Experiments on 9 benchmark datasets (CIFAR10, CIFAR100, ImageWoof, ImageCat, ImageFruit, ImageNette, ImageNet10, ImageNet100 and ImageNet1K) demonstrate the effectiveness of our approach. For instance, on the ImageNet1K dataset with a ResNet18 architecture and 50 distilled images per class, our algorithm surpasses the second-ranked MiniMax algorithm by 7.6%, achieving a distilled accuracy of 66.2%.
{"title":"Unleashing the Power of Each Distilled Image.","authors":"Jingxuan Zhang,Zhihua Chen,Lei Dai","doi":"10.1109/tip.2025.3624626","DOIUrl":"https://doi.org/10.1109/tip.2025.3624626","url":null,"abstract":"Dataset distillation (DD) aims to accelerate the training speed of neural networks (NNs) by synthesizing a reduced dataset. NNs trained on the smaller dataset are expected to obtain almost the same test set accuracy as they do on the larger one. Previous DD research treated the obtained distilled dataset as a regular dataset for training, neglecting the overfitting issue caused by the limited number of original distilled images. In this paper, we propose a new DD paradigm. Specifically, in the deployment stage, distilled images are augmented by amplifying their local information since the teacher network can produce diverse supervision signals when receiving inputs from different regions. Efficient and diverse augmentation methods for each distilled image are devised, while ensuring the authenticity of augmented samples. Additionally, to alleviate the increased training cost caused by data augmentation, we design a bi-directional dynamic dataset pruning technique to prune the original distilled dataset and augmented distilled dataset. A new pruning strategy and scheduling are proposed based on experimental findings. Experiments on 9 benchmark datasets (CIFAR10, CIFAR100, ImageWoof, ImageCat, ImageFruit, ImageNette, ImageNet10, ImageNet100 and ImageNet1K) demonstrate the effectiveness of our approach. For instance, on the ImageNet1K dataset with a ResNet18 architecture and 50 distilled images per class, our algorithm surpasses the second-ranked MiniMax algorithm by 7.6%, achieving a distilled accuracy of 66.2%.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"15 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1109/tip.2025.3623392
Deok-Hyun Ahn,YongJin Jo,DongBum Kim,Gi-Pyo Nam,Jae-Ho Han,Haksub Kim
In surveillance environments, detecting anomalies requires understanding the contextual dynamics of the environment, human behaviors, and movements within a scene. Effective anomaly detection must address both the where and what of events, but existing approaches such as unimodal action-based methods or LLM-integrated multimodal frameworks have limitations. These methods either rely on implicit scene information, making it difficult to localize where anomalies occur, or fail to adapt to surveillance specific challenges such as view changes, subtle actions, low light conditions, and crowded scenes. As a result, these challenges hinder accurate detection of what occurs. To overcome these limitations, our system takes advantage of features from a lightweight scene classification model to discern where an event occurs, acquiring explicit location-based context. To identify what events occur, it focuses on atomic actions, which remain underexplored in this field and are better suited to interpreting intricate abnormal behaviors than conventional abstract action features. To achieve robust anomaly detection, the proposed Temporal-Semantic Relationship Network (TSRN) models spatio-temporal relationships among multimodal features and employs a Segment-selective Focal Margin loss (SFML) to effectively address class imbalance, outperforming conventional MIL-based methods. Compared to existing methods, experimental results demonstrate that our system significantly reduces false alarms while maintaining robustness across diverse scenarios. Quantitative and qualitative evaluations on public datasets validate the practical effectiveness of the proposed method for real-world surveillance applications.
{"title":"Where and What: Contextual Dynamics-aware Anomaly Detection in Surveillance Videos.","authors":"Deok-Hyun Ahn,YongJin Jo,DongBum Kim,Gi-Pyo Nam,Jae-Ho Han,Haksub Kim","doi":"10.1109/tip.2025.3623392","DOIUrl":"https://doi.org/10.1109/tip.2025.3623392","url":null,"abstract":"In surveillance environments, detecting anomalies requires understanding the contextual dynamics of the environment, human behaviors, and movements within a scene. Effective anomaly detection must address both the where and what of events, but existing approaches such as unimodal action-based methods or LLM-integrated multimodal frameworks have limitations. These methods either rely on implicit scene information, making it difficult to localize where anomalies occur, or fail to adapt to surveillance specific challenges such as view changes, subtle actions, low light conditions, and crowded scenes. As a result, these challenges hinder accurate detection of what occurs. To overcome these limitations, our system takes advantage of features from a lightweight scene classification model to discern where an event occurs, acquiring explicit location-based context. To identify what events occur, it focuses on atomic actions, which remain underexplored in this field and are better suited to interpreting intricate abnormal behaviors than conventional abstract action features. To achieve robust anomaly detection, the proposed Temporal-Semantic Relationship Network (TSRN) models spatio-temporal relationships among multimodal features and employs a Segment-selective Focal Margin loss (SFML) to effectively address class imbalance, outperforming conventional MIL-based methods. Compared to existing methods, experimental results demonstrate that our system significantly reduces false alarms while maintaining robustness across diverse scenarios. Quantitative and qualitative evaluations on public datasets validate the practical effectiveness of the proposed method for real-world surveillance applications.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"19 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meta-metric learning has demonstrated strong performance in coarse-grained few-shot situations. However, despite their simplicity and availability, these metametrics are limited in effectively handling fine-grained few-shot scenarios. Fine-Grained Few-Shot Classification (FGFSC) presents significant challenges to the network's ability to extract subtle features. Equipped with the symmetrical binocular perception system and complex neural networks in the brain, humans inherently possess exceptional and resilient meta-learning abilities, facilitating superior management of fine-grained few-shot scenarios. In this paper, inspired by the human binocular visual system, we pioneer the first human-like meta-metric paradigm: Binocular Singular Hellinger Metametric (BinoHeM). Functionally, BinoHeM incorporates advanced symmetric binocular feature encoding and recognition mechanisms. Structurally, it integrates two binocular sensing feature encoders, a singular Hellinger metametric, and two collaborative identification mechanisms. Building on this foundation, we introduce two innovative metametric variants: BinoHeM-KDL and BinoHeM-MTL. These are grounded in two advanced training mechanisms: knowledge distillation learning (KDL) and meta-transfer learning (MTL), respectively. Furthermore, we showcase the high accuracy and robust generalization capabilities of our approaches on four representative FGFSC benchmarks. Extensive comparative and ablation experiments have validated the efficiency and superiority of our paradigm over other state-of-the-art algorithms. Our code is publicly available at: https://github.com/ChaofeiQI/BinoHeM.
{"title":"BinoHeM: Binocular Singular Hellinger Metametric for Fine-Grained Few-Shot Classification.","authors":"Chaofei Qi,Chao Ye,Weiyang Lin,Zhitai Liu,Jianbin Qiu","doi":"10.1109/tip.2025.3623379","DOIUrl":"https://doi.org/10.1109/tip.2025.3623379","url":null,"abstract":"Meta-metric learning has demonstrated strong performance in coarse-grained few-shot situations. However, despite their simplicity and availability, these metametrics are limited in effectively handling fine-grained few-shot scenarios. Fine-Grained Few-Shot Classification (FGFSC) presents significant challenges to the network's ability to extract subtle features. Equipped with the symmetrical binocular perception system and complex neural networks in the brain, humans inherently possess exceptional and resilient meta-learning abilities, facilitating superior management of fine-grained few-shot scenarios. In this paper, inspired by the human binocular visual system, we pioneer the first human-like meta-metric paradigm: Binocular Singular Hellinger Metametric (BinoHeM). Functionally, BinoHeM incorporates advanced symmetric binocular feature encoding and recognition mechanisms. Structurally, it integrates two binocular sensing feature encoders, a singular Hellinger metametric, and two collaborative identification mechanisms. Building on this foundation, we introduce two innovative metametric variants: BinoHeM-KDL and BinoHeM-MTL. These are grounded in two advanced training mechanisms: knowledge distillation learning (KDL) and meta-transfer learning (MTL), respectively. Furthermore, we showcase the high accuracy and robust generalization capabilities of our approaches on four representative FGFSC benchmarks. Extensive comparative and ablation experiments have validated the efficiency and superiority of our paradigm over other state-of-the-art algorithms. Our code is publicly available at: https://github.com/ChaofeiQI/BinoHeM.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"68 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}