Pub Date : 2026-03-19DOI: 10.1109/tip.2026.3673957
Chia-Hsiang Lin,Si-Sheng Young
As most optical satellites remotely acquire multispectral images (MSIs) with limited spatial resolution, multispectral unmixing (MU) becomes a critical signal processing technology for analyzing the pure material spectra for high-precision classification and identification. Unlike the widely investigated hyperspectral unmixing (HU) problem, MU is much more challenging as it corresponds to the underdetermined blind source separation (BSS) problem, where the number of sources is larger than the number of available multispectral bands. In this article, we transform MU into its overdetermined counterpart (i.e., HU) by inventing a radically new quantum deep image prior (QDIP), which relies on the virtual band-splitting task conducted on the observed MSI for generating the virtual hyperspectral image (HSI). Then, we perform HU on the virtual HSI to obtain the virtual hyperspectral sources. Though HU is overdetermined, it still suffers from the ill-posed issue, for which we employ the convex geometry structure of the HSI pixels to customize a weighted simplex shrinkage (WSS) regularizer to mitigate the ill-posedness. Finally, the virtual hyperspectral sources are spectrally downsampled to obtain the desired multispectral sources. The proposed geometry/quantum-empowered MU (GQ-μ) algorithm can also effectively obtain the spatial abundance distribution map for each source, where the geometric WSS regularization is adaptively and automatically controlled based on the sparsity pattern of the abundance tensor. Simulation and real-world data experiments demonstrate the practicality of our unsupervised GQ-μ algorithm for the challenging MU task. Ablation study demonstrates the strength of QDIP, not achieved by classical DIP, and validates the mechanics-inspired WSS geometry regularizer. The associated code will be available at https://github.com/IHCLab/GQ-mu.
{"title":"Underdetermined Blind Source Separation via Weighted Simplex Shrinkage Regularization and Quantum Deep Image Prior.","authors":"Chia-Hsiang Lin,Si-Sheng Young","doi":"10.1109/tip.2026.3673957","DOIUrl":"https://doi.org/10.1109/tip.2026.3673957","url":null,"abstract":"As most optical satellites remotely acquire multispectral images (MSIs) with limited spatial resolution, multispectral unmixing (MU) becomes a critical signal processing technology for analyzing the pure material spectra for high-precision classification and identification. Unlike the widely investigated hyperspectral unmixing (HU) problem, MU is much more challenging as it corresponds to the underdetermined blind source separation (BSS) problem, where the number of sources is larger than the number of available multispectral bands. In this article, we transform MU into its overdetermined counterpart (i.e., HU) by inventing a radically new quantum deep image prior (QDIP), which relies on the virtual band-splitting task conducted on the observed MSI for generating the virtual hyperspectral image (HSI). Then, we perform HU on the virtual HSI to obtain the virtual hyperspectral sources. Though HU is overdetermined, it still suffers from the ill-posed issue, for which we employ the convex geometry structure of the HSI pixels to customize a weighted simplex shrinkage (WSS) regularizer to mitigate the ill-posedness. Finally, the virtual hyperspectral sources are spectrally downsampled to obtain the desired multispectral sources. The proposed geometry/quantum-empowered MU (GQ-μ) algorithm can also effectively obtain the spatial abundance distribution map for each source, where the geometric WSS regularization is adaptively and automatically controlled based on the sparsity pattern of the abundance tensor. Simulation and real-world data experiments demonstrate the practicality of our unsupervised GQ-μ algorithm for the challenging MU task. Ablation study demonstrates the strength of QDIP, not achieved by classical DIP, and validates the mechanics-inspired WSS geometry regularizer. The associated code will be available at https://github.com/IHCLab/GQ-mu.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"57 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147483711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-19DOI: 10.1109/tip.2026.3673293
Saidi Guo,Xinlong Liu,Qixin Lin,Weijie Cai,Guohua Zhao,Mingyi Wu,Qiujie Lv,Laurence T Yang
Accurate cross-modality cardiac image segmentation is essential for effectively diagnosing and treating heart disease. Different imaging modalities help to determine suitable pre-procedure planning. However, most methods face the difficulty of spatial-temporal confounding, where the anatomy element and modality element of cardiac images are intertwined across both spatial and temporal dimensions. It is derived from the imaging diversity and structure diversity of cardiac images. The spatial-temporal confounding hinders knowledge transfer between cardiac images on different modalities. In this paper, we propose a novel dynamic causal learning (DCL) to solve spatial-temporal confounding. The DCL explores multi-dimensional causal intervention to consider not only the causal relationship between images and labels, but also the causality in time dimension and space dimension. It integrates historical optimal interventions and facilitates the transfer of this knowledge across temporal contexts. In addition, the DCL utilizes the diffusion mechanism to further ensure that the extracted anatomy element remains causal invariant, improving model performance across multiple imaging modalities. Extensive experiments on cross-modality cardiac images (MR, CT, and US) demonstrate the effectiveness of the DCL (mean Dice = 0.951), outperforming other advanced segmentation methods. DCL is freely accessible at https://github.com/asdww0721ww/DCL.
{"title":"DCL: Dynamic Causal Learning for Cross-modality Cardiac Image Segmentation.","authors":"Saidi Guo,Xinlong Liu,Qixin Lin,Weijie Cai,Guohua Zhao,Mingyi Wu,Qiujie Lv,Laurence T Yang","doi":"10.1109/tip.2026.3673293","DOIUrl":"https://doi.org/10.1109/tip.2026.3673293","url":null,"abstract":"Accurate cross-modality cardiac image segmentation is essential for effectively diagnosing and treating heart disease. Different imaging modalities help to determine suitable pre-procedure planning. However, most methods face the difficulty of spatial-temporal confounding, where the anatomy element and modality element of cardiac images are intertwined across both spatial and temporal dimensions. It is derived from the imaging diversity and structure diversity of cardiac images. The spatial-temporal confounding hinders knowledge transfer between cardiac images on different modalities. In this paper, we propose a novel dynamic causal learning (DCL) to solve spatial-temporal confounding. The DCL explores multi-dimensional causal intervention to consider not only the causal relationship between images and labels, but also the causality in time dimension and space dimension. It integrates historical optimal interventions and facilitates the transfer of this knowledge across temporal contexts. In addition, the DCL utilizes the diffusion mechanism to further ensure that the extracted anatomy element remains causal invariant, improving model performance across multiple imaging modalities. Extensive experiments on cross-modality cardiac images (MR, CT, and US) demonstrate the effectiveness of the DCL (mean Dice = 0.951), outperforming other advanced segmentation methods. DCL is freely accessible at https://github.com/asdww0721ww/DCL.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"59 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147483710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-16DOI: 10.1109/tip.2026.3672375
Liu Yang,Kegen Chen,Qilong Wang,Zhengyi Xu,Shiqiao Gu,Qinghua Hu
Federated learning (FL) enables privacy-preserving collaboration among distributed clients, but practical deployments often face heterogeneous models and non-IID data, leading to degraded communication and personalization. In addition, real-world FL systems frequently encounter newly joined clients that require rapid adaptation and abnormal clients that may upload corrupted updates, further exacerbating instability and hindering global convergence. To address these challenges in image classification, we propose HFedDGHN, a Heterogeneous Federated Dynamic Graph HyperNetwork that jointly models inter-client relations and personalized parameter generation. Specifically, a graph structure learner adaptively captures client correlations to construct a dynamic collaboration graph, while a graph-convolutional hypernetwork generates model parameters for heterogeneous architectures, enabling implicit knowledge transfer without sharing local data or weights. Moreover, the framework naturally supports meta-learning-based generalization, allowing efficient adaptation to newly joined clients. Furthermore, the dynamic graph enhances robustness by isolating abnormal clients, as they tend to be excluded from most neighborhoods during adaptive graph construction. Extensive experiments across multiple benchmarks demonstrate that HFed-DGHN achieves superior accuracy compared to state-of-the-art personalized and heterogeneous FL methods, while naturally improving robustness and scalability in real-world deployments.
{"title":"Heterogeneous Federated Dynamic Graph HyperNetwork for Image Classification.","authors":"Liu Yang,Kegen Chen,Qilong Wang,Zhengyi Xu,Shiqiao Gu,Qinghua Hu","doi":"10.1109/tip.2026.3672375","DOIUrl":"https://doi.org/10.1109/tip.2026.3672375","url":null,"abstract":"Federated learning (FL) enables privacy-preserving collaboration among distributed clients, but practical deployments often face heterogeneous models and non-IID data, leading to degraded communication and personalization. In addition, real-world FL systems frequently encounter newly joined clients that require rapid adaptation and abnormal clients that may upload corrupted updates, further exacerbating instability and hindering global convergence. To address these challenges in image classification, we propose HFedDGHN, a Heterogeneous Federated Dynamic Graph HyperNetwork that jointly models inter-client relations and personalized parameter generation. Specifically, a graph structure learner adaptively captures client correlations to construct a dynamic collaboration graph, while a graph-convolutional hypernetwork generates model parameters for heterogeneous architectures, enabling implicit knowledge transfer without sharing local data or weights. Moreover, the framework naturally supports meta-learning-based generalization, allowing efficient adaptation to newly joined clients. Furthermore, the dynamic graph enhances robustness by isolating abnormal clients, as they tend to be excluded from most neighborhoods during adaptive graph construction. Extensive experiments across multiple benchmarks demonstrate that HFed-DGHN achieves superior accuracy compared to state-of-the-art personalized and heterogeneous FL methods, while naturally improving robustness and scalability in real-world deployments.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"414 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147465043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visible-Infrared Person Re-Identification (VI-ReID) that matches pedestrian images across visible and infrared modalities suffers from substantial modality discrepancies and intra-class variations. While existing methods typically address the modality gap via style alignment, they often lose identity-relevant semantics and overlook fine-grained inter-class nuances, such as body part contours and structural cues around the head, shoulders, or feet. To tackle these challenges, we propose an Identity-Compensated Style Distillation (ICSD) network that enforces cross-modality style consistency and enhances the discriminative power of modality-invariant features. Specifically, ICSD comprises two core components: (1) a Style Knowledge Distillation (SKD) module, which integrates Style Discrepancy Reduction (SDR) and Identity Knowledge Compensation (IKC) to align modality styles while preserving identity-relevant semantics; (2) an Identity Discrimination Amplification (IDA) module, which captures and enhances subtle inter-class differences by refining identity-specific cues, thereby facilitating more accurate discrimination between different pedestrians. Extensive experiments on three public benchmarks-SYSU-MM01, RegDB, and LLCM-demonstrate that ICSD consistently outperforms state-of-the-art methods, validating the effectiveness and complementarity of its components.
{"title":"Identity-Compensated Style Distillation for Visible-Infrared Person Re-Identification.","authors":"Yongguo Ling,Zihao Hu,Nan Pu,Zhun Zhong,Xudong Jiang","doi":"10.1109/tip.2026.3672369","DOIUrl":"https://doi.org/10.1109/tip.2026.3672369","url":null,"abstract":"Visible-Infrared Person Re-Identification (VI-ReID) that matches pedestrian images across visible and infrared modalities suffers from substantial modality discrepancies and intra-class variations. While existing methods typically address the modality gap via style alignment, they often lose identity-relevant semantics and overlook fine-grained inter-class nuances, such as body part contours and structural cues around the head, shoulders, or feet. To tackle these challenges, we propose an Identity-Compensated Style Distillation (ICSD) network that enforces cross-modality style consistency and enhances the discriminative power of modality-invariant features. Specifically, ICSD comprises two core components: (1) a Style Knowledge Distillation (SKD) module, which integrates Style Discrepancy Reduction (SDR) and Identity Knowledge Compensation (IKC) to align modality styles while preserving identity-relevant semantics; (2) an Identity Discrimination Amplification (IDA) module, which captures and enhances subtle inter-class differences by refining identity-specific cues, thereby facilitating more accurate discrimination between different pedestrians. Extensive experiments on three public benchmarks-SYSU-MM01, RegDB, and LLCM-demonstrate that ICSD consistently outperforms state-of-the-art methods, validating the effectiveness and complementarity of its components.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"189 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147465039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-16DOI: 10.1109/tip.2026.3672374
Yaosi Hu,Chang Wen Chen
The diffusion-based text-to-image generation has achieved remarkable progress and realistic content generation performance, greatly promoting the development in text-to-video generation. Although equipped with powerful image diffusion models, video generation modeling still requires massive labeled data and a high training resource cost. Recent, work has been focused on cost-effective video generation in a one-shot or few-shot manner based on the image diffusion model with minimum demand for video data and computing resources. However, these video generation models only support the generation of one single motion pattern/concept. This raises an important question: Can we improve generation freedom with a light training burden? In this paper, we explore a cost-effective video generation scheme for adaptive motion concepts by learning motion priors from a small set of video data. Specifically, we construct a learnable bank for motion concepts and propose the Dual-Semantic-guided Motion Attention module to locate the corresponding motion elements from the bank with the guidance of textual semantic and visual semantic. The extracted motion elements are inserted into video latents via lightweight motion injection layer, which is capable of integrating motion semantic effectively with much fewer parameters compared to the conventional temporal attention layer. In addition, we introduce a temporal-aware noise prior and an inter-frame consistency constraint to strengthen the learning of temporal dependency and improve video smoothness. Extensive experiments validate that the proposed method can learn motion priors adaptively from a small set of training videos to generate smooth videos that involve either single or multiple motion concepts. The results demonstrate that the proposed scheme achieves superior performance compared to existing few-shot video generation methods and even some large-scale video generation models. More information and results are available at https://youncy-hu.github.io/motionprior/.
{"title":"MotionPrior: Exploring Efficient Learning of Motion Concepts for Few-shot Video Generation.","authors":"Yaosi Hu,Chang Wen Chen","doi":"10.1109/tip.2026.3672374","DOIUrl":"https://doi.org/10.1109/tip.2026.3672374","url":null,"abstract":"The diffusion-based text-to-image generation has achieved remarkable progress and realistic content generation performance, greatly promoting the development in text-to-video generation. Although equipped with powerful image diffusion models, video generation modeling still requires massive labeled data and a high training resource cost. Recent, work has been focused on cost-effective video generation in a one-shot or few-shot manner based on the image diffusion model with minimum demand for video data and computing resources. However, these video generation models only support the generation of one single motion pattern/concept. This raises an important question: Can we improve generation freedom with a light training burden? In this paper, we explore a cost-effective video generation scheme for adaptive motion concepts by learning motion priors from a small set of video data. Specifically, we construct a learnable bank for motion concepts and propose the Dual-Semantic-guided Motion Attention module to locate the corresponding motion elements from the bank with the guidance of textual semantic and visual semantic. The extracted motion elements are inserted into video latents via lightweight motion injection layer, which is capable of integrating motion semantic effectively with much fewer parameters compared to the conventional temporal attention layer. In addition, we introduce a temporal-aware noise prior and an inter-frame consistency constraint to strengthen the learning of temporal dependency and improve video smoothness. Extensive experiments validate that the proposed method can learn motion priors adaptively from a small set of training videos to generate smooth videos that involve either single or multiple motion concepts. The results demonstrate that the proposed scheme achieves superior performance compared to existing few-shot video generation methods and even some large-scale video generation models. More information and results are available at https://youncy-hu.github.io/motionprior/.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"36 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147465040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Occluded person re-identification aims to address the identification challenges posed by pedestrians obscured by other individuals or objects. Existing methods often rely on incorporating pose or semantic information to improve model performance under occlusion. However, such information often depends on external models with inevitably cross-domain gaps, whose stability is limited in complex occlusion environments and prone to false results. In this paper, we propose a Transformer-based uncertainty-driven Gaussian model, termed as UD-Gaussian. Firstly, to enrich the detailed features of pedestrian images, a high-frequency enhancement module is introduced. The high-frequency components of the pedestrian image are extracted by Discrete Haar Wavelet Transform, and Top-K high-frequency patches are extracted to construct a graph Laplacian matrix to achieve high-frequency graph attention, which is fused with features learned from self-attention to enhance the high-frequency feature representation. Given the uncertainty in pedestrian feature learning induced by occlusion makes it challenging to obtain reliable and stable pedestrian features, we propose a probability distribution learning module. This module establishes a memory bank to build Gaussian distributions for each pedestrian identity and the entropy is introduced as a loss function to encourage the model to generate more deterministic and relatively independent probability distributions, thereby enhancing the discriminative ability of the model across different pedestrian identities. The high-frequency enhancement module provides a solid foundation for the probability distribution learning module, alleviating uncertainty caused by pedestrian images themselves. Experimental results on occluded and holistic person re-identification datasets demonstrate the superiority of the proposed method.
{"title":"UD-Gaussian: Uncertainty-Driven Gaussian Modeling for Occluded Person Re-identification.","authors":"Yanping Li,Yizhang Liu,Hongyun Zhang,Cairong Zhao,Zhihua Wei,Duoqian Miao","doi":"10.1109/tip.2026.3672380","DOIUrl":"https://doi.org/10.1109/tip.2026.3672380","url":null,"abstract":"Occluded person re-identification aims to address the identification challenges posed by pedestrians obscured by other individuals or objects. Existing methods often rely on incorporating pose or semantic information to improve model performance under occlusion. However, such information often depends on external models with inevitably cross-domain gaps, whose stability is limited in complex occlusion environments and prone to false results. In this paper, we propose a Transformer-based uncertainty-driven Gaussian model, termed as UD-Gaussian. Firstly, to enrich the detailed features of pedestrian images, a high-frequency enhancement module is introduced. The high-frequency components of the pedestrian image are extracted by Discrete Haar Wavelet Transform, and Top-K high-frequency patches are extracted to construct a graph Laplacian matrix to achieve high-frequency graph attention, which is fused with features learned from self-attention to enhance the high-frequency feature representation. Given the uncertainty in pedestrian feature learning induced by occlusion makes it challenging to obtain reliable and stable pedestrian features, we propose a probability distribution learning module. This module establishes a memory bank to build Gaussian distributions for each pedestrian identity and the entropy is introduced as a loss function to encourage the model to generate more deterministic and relatively independent probability distributions, thereby enhancing the discriminative ability of the model across different pedestrian identities. The high-frequency enhancement module provides a solid foundation for the probability distribution learning module, alleviating uncertainty caused by pedestrian images themselves. Experimental results on occluded and holistic person re-identification datasets demonstrate the superiority of the proposed method.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"9 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147465042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1109/tip.2026.3671595
Bosen Lin, Feng Gao, Yanwei Yu, Junyu Dong, Qian Du
{"title":"Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design","authors":"Bosen Lin, Feng Gao, Yanwei Yu, Junyu Dong, Qian Du","doi":"10.1109/tip.2026.3671595","DOIUrl":"https://doi.org/10.1109/tip.2026.3671595","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"80 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147439805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1109/tip.2026.3671594
Yanyan Wei,Yilin Zhang,Huan Zheng,Jiahuan Ren,Xiaogang Xu,Zenglin Shi,Zhao Zhang,Meng Wang
Existing image restoration and enhancement (IRE) methods suffer from three fundamental limitations: 1) they present a high technical barrier, requiring expert knowledge and lacking intuitive natural language control; 2) they are inflexible and poorly adaptable, as models are typically designed for single, specific degradations and fail on complex or mixed real-world scenarios; 3) they lack interactivity and ignore subjectivity, operating as "black-box" tools that cannot incorporate human feedback or understand nuanced user intentions. To overcome these challenges, we pioneer a novel paradigm: a Multi-Agent System (MAS) for interactive and adaptive image restoration. We design and implement a prototype system, Interactive and Adaptive Multi-Agent System (IAMAgent), which orchestrates a team of specialized agents to collaboratively solve complex IRE tasks. At its core, a Manager Agent, driven by a Large Language Model, interprets user commands, devises strategies, and allocates sub-tasks. It directs a Perception Agent for degradation diagnosis, a suite of specialized Execution Agents that encapsulate various low-level vision models, and a Critique Agent for automated quality assessment. This collaborative framework enables an innovative, language-driven, and human-in-the-loop optimization process. Our work is the first to introduce the MAS paradigm to the IRE domain, transforming it from a collection of static tools into a dynamic, user-centric, and intelligent system. We demonstrate that IAMAgent not only significantly enhances restoration performance and adaptability but also bridges the critical gap between high-level human intention and low-level vision tasks.
{"title":"IAMAgent: Towards An Interactive and Adaptive Multi-Agent System for Image Restoration.","authors":"Yanyan Wei,Yilin Zhang,Huan Zheng,Jiahuan Ren,Xiaogang Xu,Zenglin Shi,Zhao Zhang,Meng Wang","doi":"10.1109/tip.2026.3671594","DOIUrl":"https://doi.org/10.1109/tip.2026.3671594","url":null,"abstract":"Existing image restoration and enhancement (IRE) methods suffer from three fundamental limitations: 1) they present a high technical barrier, requiring expert knowledge and lacking intuitive natural language control; 2) they are inflexible and poorly adaptable, as models are typically designed for single, specific degradations and fail on complex or mixed real-world scenarios; 3) they lack interactivity and ignore subjectivity, operating as \"black-box\" tools that cannot incorporate human feedback or understand nuanced user intentions. To overcome these challenges, we pioneer a novel paradigm: a Multi-Agent System (MAS) for interactive and adaptive image restoration. We design and implement a prototype system, Interactive and Adaptive Multi-Agent System (IAMAgent), which orchestrates a team of specialized agents to collaboratively solve complex IRE tasks. At its core, a Manager Agent, driven by a Large Language Model, interprets user commands, devises strategies, and allocates sub-tasks. It directs a Perception Agent for degradation diagnosis, a suite of specialized Execution Agents that encapsulate various low-level vision models, and a Critique Agent for automated quality assessment. This collaborative framework enables an innovative, language-driven, and human-in-the-loop optimization process. Our work is the first to introduce the MAS paradigm to the IRE domain, transforming it from a collection of static tools into a dynamic, user-centric, and intelligent system. We demonstrate that IAMAgent not only significantly enhances restoration performance and adaptability but also bridges the critical gap between high-level human intention and low-level vision tasks.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"52 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147439233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}