Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct a stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.
{"title":"ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts.","authors":"Xumeng Han,Longhui Wei,Zhiyang Dou,Yingfei Sun,Zhenjun Han,Qi Tian","doi":"10.1109/tip.2025.3626887","DOIUrl":"https://doi.org/10.1109/tip.2025.3626887","url":null,"abstract":"Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct a stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"86 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145440721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements have suggested that neural radiance fields (NeRFs) show great potential in 3D style transfer. However, most existing NeRF-based style transfer methods still face considerable challenges in generating stylized images that simultaneously preserve clear scene textures and maintain strong cross-view consistency. To address these limitations, in this paper, we propose a novel transformer-guided approach for 3D scene style transfer. Specifically, we first design a transformer-based style transfer network to capture long-range dependencies and generate 2D stylized images with initial consistency, which serve as supervision for the 3D stylized generation. To enable fine-grained control over style, we propose a latent style vector as a conditional feature and design a style network that projects this style information into the 3D space. We further develop a merge network that integrates style features with scene geometry to render 3D stylized images that are both visually coherent and stylistically consistent. In addition, we propose a texture consistency loss to preserve scene structure and enhance texture fidelity across views. Extensive quantitative and qualitative experimental results demonstrate that our proposed approach outperforms many state-of-the-art methods in terms of visual perception, image quality and multi-view consistency. Our code and more results are available at: https://github.com/PaiDii/TGTC-Style.git.
{"title":"Texture-Consistent 3D Scene Style Transfer via Transformer-Guided Neural Radiance Fields.","authors":"Wudi Chen,Zhiyuan Zha,Shigang Wang,Liaqat Ali,Bihan Wen,Xin Yuan,Jiantao Zhou,Ce Zhu","doi":"10.1109/tip.2025.3626892","DOIUrl":"https://doi.org/10.1109/tip.2025.3626892","url":null,"abstract":"Recent advancements have suggested that neural radiance fields (NeRFs) show great potential in 3D style transfer. However, most existing NeRF-based style transfer methods still face considerable challenges in generating stylized images that simultaneously preserve clear scene textures and maintain strong cross-view consistency. To address these limitations, in this paper, we propose a novel transformer-guided approach for 3D scene style transfer. Specifically, we first design a transformer-based style transfer network to capture long-range dependencies and generate 2D stylized images with initial consistency, which serve as supervision for the 3D stylized generation. To enable fine-grained control over style, we propose a latent style vector as a conditional feature and design a style network that projects this style information into the 3D space. We further develop a merge network that integrates style features with scene geometry to render 3D stylized images that are both visually coherent and stylistically consistent. In addition, we propose a texture consistency loss to preserve scene structure and enhance texture fidelity across views. Extensive quantitative and qualitative experimental results demonstrate that our proposed approach outperforms many state-of-the-art methods in terms of visual perception, image quality and multi-view consistency. Our code and more results are available at: https://github.com/PaiDii/TGTC-Style.git.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145440686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1109/tip.2025.3623259
Zhiqin Zhu,Zimeng Zhang,Guanqiu Qi,Yuanyuan Li,Pan Yang,Yu Liu
3D medical images are volumetric data that provide spatial continuity and multi-dimensional information. These features provide rich anatomical context. However, their anisotropy may result in reduced image detail along certain directions. This can cause blurring or distortion between slices. In addition, global or local intensity inhomogeneities are often observed. This may be due to limitations of the imaging equipment, inappropriate scanning parameters, or variations in the patient's anatomy. This inhomogeneity may blur lesion boundaries and may also mask true features, causing the model to focus on irrelevant regions. Therefore, a probability map-guided network for 3D volumetric medical image segmentation (3D-PMGNet) is proposed. The probability maps generated from the intermediate features are used as supervisory signals to guide the segmentation process. A new probability map reconstruction method is designed, combining dynamic thresholding with local adaptive smoothing. This enhances the reliability of high-response regions while suppressing low-response noise. A learnable channel-wise temperature coefficient is introduced to adjust the probability distribution to make it closer to the true distribution; in addition, a feature fusion method based on dynamic prompt encoding is developed. The response strength of the main feature maps is dynamically adjusted, and this adjustment is achieved through the spatial position encoding derived from the probability maps. The proposed method has been evaluated on four datasets. Experimental results show that the proposed method outperforms state-of-the-art 3D medical image segmentation methods. The source codes have been publicly released at https://github.com/ZHANGZIMENG01/3D-PMGNet.
{"title":"Probability Map-Guided Network for 3D Volumetric Medical Image Segmentation.","authors":"Zhiqin Zhu,Zimeng Zhang,Guanqiu Qi,Yuanyuan Li,Pan Yang,Yu Liu","doi":"10.1109/tip.2025.3623259","DOIUrl":"https://doi.org/10.1109/tip.2025.3623259","url":null,"abstract":"3D medical images are volumetric data that provide spatial continuity and multi-dimensional information. These features provide rich anatomical context. However, their anisotropy may result in reduced image detail along certain directions. This can cause blurring or distortion between slices. In addition, global or local intensity inhomogeneities are often observed. This may be due to limitations of the imaging equipment, inappropriate scanning parameters, or variations in the patient's anatomy. This inhomogeneity may blur lesion boundaries and may also mask true features, causing the model to focus on irrelevant regions. Therefore, a probability map-guided network for 3D volumetric medical image segmentation (3D-PMGNet) is proposed. The probability maps generated from the intermediate features are used as supervisory signals to guide the segmentation process. A new probability map reconstruction method is designed, combining dynamic thresholding with local adaptive smoothing. This enhances the reliability of high-response regions while suppressing low-response noise. A learnable channel-wise temperature coefficient is introduced to adjust the probability distribution to make it closer to the true distribution; in addition, a feature fusion method based on dynamic prompt encoding is developed. The response strength of the main feature maps is dynamically adjusted, and this adjustment is achieved through the spatial position encoding derived from the probability maps. The proposed method has been evaluated on four datasets. Experimental results show that the proposed method outperforms state-of-the-art 3D medical image segmentation methods. The source codes have been publicly released at https://github.com/ZHANGZIMENG01/3D-PMGNet.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145440835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/tip.2025.3625750
Krishna Srikar Durbha, Alan C. Bovik
{"title":"Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity","authors":"Krishna Srikar Durbha, Alan C. Bovik","doi":"10.1109/tip.2025.3625750","DOIUrl":"https://doi.org/10.1109/tip.2025.3625750","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"126 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145412124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/tip.2025.3625380
Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj
{"title":"Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios","authors":"Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj","doi":"10.1109/tip.2025.3625380","DOIUrl":"https://doi.org/10.1109/tip.2025.3625380","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"12 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145404441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep unfolding has emerged as a powerful solution for Multi-modal Image Super-Resolution (MISR) through strategic integration of cross-modal priors in network architecture. However, current deep unfolding approaches rely on first-order optimization, which exhibit limitations in learning efficiency and reconstruction accuracy. In this paper, to overcome these limitations, we propose a novel Semi-smooth Newton driven Unfolding network for MISR, namely SNUM-Net. Specifically, we first develop a Semi-smooth Newton-driven MISR (SNM) algorithm that establishes a theoretical foundation for our approach. Then, we unfold the iterative solution of SNM into a novel network. To the best of our knowledge, the SNUM-Net is the first successful attempt to design a deep unfolding MISR network based on second-order optimization algorithm. Compared to existing methods, the SNUM-Net demonstrates three main advantages. 1) Universal paradigm: the SNUM-Net provides a unified paradigm for diverse MISR tasks without requiring scenario-specific constraints; 2) Explainable framework: the network preserves a mathematical correspondence with the SNM algorithm, ensuring that the topological relationships between modules are well explainable; 3) Superior performance: comprehensive evaluations across 10 datasets spanning 3 MISR tasks demonstrate the network's exceptional reconstruction accuracy and generalization capability. The software codes are available at https://github.com/pandazcx/SNUM-Net.
{"title":"Deep Semi-smooth Newton-driven Unfolding Network for Multi-modal Image Super-Resolution.","authors":"Chenxiao Zhang,Xin Deng,Jingyi Xu,Yongxuan Dou,Mai Xu","doi":"10.1109/tip.2025.3625429","DOIUrl":"https://doi.org/10.1109/tip.2025.3625429","url":null,"abstract":"Deep unfolding has emerged as a powerful solution for Multi-modal Image Super-Resolution (MISR) through strategic integration of cross-modal priors in network architecture. However, current deep unfolding approaches rely on first-order optimization, which exhibit limitations in learning efficiency and reconstruction accuracy. In this paper, to overcome these limitations, we propose a novel Semi-smooth Newton driven Unfolding network for MISR, namely SNUM-Net. Specifically, we first develop a Semi-smooth Newton-driven MISR (SNM) algorithm that establishes a theoretical foundation for our approach. Then, we unfold the iterative solution of SNM into a novel network. To the best of our knowledge, the SNUM-Net is the first successful attempt to design a deep unfolding MISR network based on second-order optimization algorithm. Compared to existing methods, the SNUM-Net demonstrates three main advantages. 1) Universal paradigm: the SNUM-Net provides a unified paradigm for diverse MISR tasks without requiring scenario-specific constraints; 2) Explainable framework: the network preserves a mathematical correspondence with the SNM algorithm, ensuring that the topological relationships between modules are well explainable; 3) Superior performance: comprehensive evaluations across 10 datasets spanning 3 MISR tasks demonstrate the network's exceptional reconstruction accuracy and generalization capability. The software codes are available at https://github.com/pandazcx/SNUM-Net.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"4 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145403891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1109/tip.2025.3624626
Jingxuan Zhang,Zhihua Chen,Lei Dai
Dataset distillation (DD) aims to accelerate the training speed of neural networks (NNs) by synthesizing a reduced dataset. NNs trained on the smaller dataset are expected to obtain almost the same test set accuracy as they do on the larger one. Previous DD research treated the obtained distilled dataset as a regular dataset for training, neglecting the overfitting issue caused by the limited number of original distilled images. In this paper, we propose a new DD paradigm. Specifically, in the deployment stage, distilled images are augmented by amplifying their local information since the teacher network can produce diverse supervision signals when receiving inputs from different regions. Efficient and diverse augmentation methods for each distilled image are devised, while ensuring the authenticity of augmented samples. Additionally, to alleviate the increased training cost caused by data augmentation, we design a bi-directional dynamic dataset pruning technique to prune the original distilled dataset and augmented distilled dataset. A new pruning strategy and scheduling are proposed based on experimental findings. Experiments on 9 benchmark datasets (CIFAR10, CIFAR100, ImageWoof, ImageCat, ImageFruit, ImageNette, ImageNet10, ImageNet100 and ImageNet1K) demonstrate the effectiveness of our approach. For instance, on the ImageNet1K dataset with a ResNet18 architecture and 50 distilled images per class, our algorithm surpasses the second-ranked MiniMax algorithm by 7.6%, achieving a distilled accuracy of 66.2%.
{"title":"Unleashing the Power of Each Distilled Image.","authors":"Jingxuan Zhang,Zhihua Chen,Lei Dai","doi":"10.1109/tip.2025.3624626","DOIUrl":"https://doi.org/10.1109/tip.2025.3624626","url":null,"abstract":"Dataset distillation (DD) aims to accelerate the training speed of neural networks (NNs) by synthesizing a reduced dataset. NNs trained on the smaller dataset are expected to obtain almost the same test set accuracy as they do on the larger one. Previous DD research treated the obtained distilled dataset as a regular dataset for training, neglecting the overfitting issue caused by the limited number of original distilled images. In this paper, we propose a new DD paradigm. Specifically, in the deployment stage, distilled images are augmented by amplifying their local information since the teacher network can produce diverse supervision signals when receiving inputs from different regions. Efficient and diverse augmentation methods for each distilled image are devised, while ensuring the authenticity of augmented samples. Additionally, to alleviate the increased training cost caused by data augmentation, we design a bi-directional dynamic dataset pruning technique to prune the original distilled dataset and augmented distilled dataset. A new pruning strategy and scheduling are proposed based on experimental findings. Experiments on 9 benchmark datasets (CIFAR10, CIFAR100, ImageWoof, ImageCat, ImageFruit, ImageNette, ImageNet10, ImageNet100 and ImageNet1K) demonstrate the effectiveness of our approach. For instance, on the ImageNet1K dataset with a ResNet18 architecture and 50 distilled images per class, our algorithm surpasses the second-ranked MiniMax algorithm by 7.6%, achieving a distilled accuracy of 66.2%.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"15 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}