With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.
{"title":"Foundation Model Empowered Real-Time Video Conference With Semantic Communications","authors":"Mingkai Chen;Wenbo Ma;Mujian Zeng;Xiaoming He;Jian Xiong;Lei Wang;Anwer Al-Dulaimi;Shahid Mumtaz","doi":"10.1109/TIP.2026.3659719","DOIUrl":"10.1109/TIP.2026.3659719","url":null,"abstract":"With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1740-1755"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3658010
Hao Yang;Yue Sun;Hui Xie;Lina Zhao;Chi Kin Lam;Qiang Zhao;Xiangyu Xiong;Kunyan Cai;Behdad Dashtbozorg;Chenggang Yan;Tao Tan
The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM
{"title":"Anatomy-Aware MR-Imaging-Only Radiotherapy","authors":"Hao Yang;Yue Sun;Hui Xie;Lina Zhao;Chi Kin Lam;Qiang Zhao;Xiangyu Xiong;Kunyan Cai;Behdad Dashtbozorg;Chenggang Yan;Tao Tan","doi":"10.1109/TIP.2026.3658010","DOIUrl":"10.1109/TIP.2026.3658010","url":null,"abstract":"The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: <uri>https://github.com/yhyumi123/RSAM</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1680-1695"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659302
Liang Wu;Jianjun Wang;Wei-Shi Zheng;Guangming Shi
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-$p$ norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729_DNTRKPCA_code
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications","authors":"Liang Wu;Jianjun Wang;Wei-Shi Zheng;Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"10.1109/TIP.2026.3659302","url":null,"abstract":"Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-<inline-formula> <tex-math>$p$ </tex-math></inline-formula> norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: <uri>https://www.researchgate.net/publication/397181729_DNTRKPCA_code</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1711-1726"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659733
Wang Xu;Yeqiang Qian;Yun-Fu Liu;Lei Tuo;Huiyong Chen;Ming Yang
In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor
{"title":"DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes","authors":"Wang Xu;Yeqiang Qian;Yun-Fu Liu;Lei Tuo;Huiyong Chen;Ming Yang","doi":"10.1109/TIP.2026.3659733","DOIUrl":"10.1109/TIP.2026.3659733","url":null,"abstract":"In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at <uri>https://github.com/WangXu-xxx/DrivingEditor</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1696-1710"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay;Eli Schwartz;Raja Giryes
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP
{"title":"Positional Encoding Image Prior","authors":"Nimrod Shabtay;Eli Schwartz;Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"10.1109/TIP.2026.3653206","url":null,"abstract":"In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page <monospace>nimrodshabtay.github.io/PIP</monospace>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2110-2121"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1109/TIP.2026.3659325
Shuping Zhao;Lunke Fei;Tingting Chai;Jie Wen;Bob Zhang;Jinrong Cui
Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.
{"title":"High-Confident Block Diagonal Analysis for Multi-View Palmprint Recognition in Unrestrained Environment","authors":"Shuping Zhao;Lunke Fei;Tingting Chai;Jie Wen;Bob Zhang;Jinrong Cui","doi":"10.1109/TIP.2026.3659325","DOIUrl":"10.1109/TIP.2026.3659325","url":null,"abstract":"Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1621-1635"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual $T$ ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional $F$ ast $A$ nomaly $R$ esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: https://github.com/BeJane/Semi_REST
工业异常检测(AD)的最新进展表明,在训练过程中加入少量异常样本可以显著提高准确性。然而,这种改进通常是以大量注释工作为代价的,这对于许多实际应用程序来说是不切实际的。在本文中,我们引入了一个新的框架,“弱监督残差T变换”(WeakREST),旨在实现高异常检测精度,同时最大限度地减少对人工注释的依赖。首先,我们将逐像素异常定位任务重新表述为逐块分类问题。其次,我们引入了一种基于残差的特征表示,称为“位置$F$ ast $ a $ normal $R$残差”(PosFAR),它可以更有效地捕获异常模式。为了利用这一特性,我们对Swin Transformer进行了调整,以增强异常检测和定位。此外,我们提出了一种弱标注方法,利用边界框和图像标签来定义异常区域。这种方法建立了半监督学习环境,减少了对精确像素级标签的依赖。为了进一步改进学习过程,我们开发了一种新的ResMixMatch算法,能够处理弱标签和基于残差的表示之间的相互作用。在基准数据集MVTec-AD上,我们的方法实现了83.0%的平均精度(AP),超过了之前在无监督设置下82.7%的最佳结果。在有监督的AD设置中,WeakREST达到了87.6%的AP,超过了之前最好的86.0%。值得注意的是,即使在使用较弱的注释(如边界框)时,WeakREST的性能也超过了依赖于逐像素监督的领先方法,实现了87.1%的AP,而MVTec-AD的最佳AP为86.0%。这种卓越的性能在其他成熟的AD数据集上也得到了一致的复制,包括MVTec 3D、KSDD2和Real-IAD。代码可从https://github.com/BeJane/Semi_REST获得
{"title":"Accurate Industrial Anomaly Detection and Localization Using Weakly-Supervised Residual Transformers","authors":"Hanxi Li;Jingqi Wu;Deyin Liu;Lin Yuanbo Wu;Hao Chen;Chunhua Shen","doi":"10.1109/TIP.2026.3659337","DOIUrl":"10.1109/TIP.2026.3659337","url":null,"abstract":"Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual <inline-formula> <tex-math>$T$ </tex-math></inline-formula>ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional <inline-formula> <tex-math>$F$ </tex-math></inline-formula>ast <inline-formula> <tex-math>$A$ </tex-math></inline-formula>nomaly <inline-formula> <tex-math>$R$ </tex-math></inline-formula>esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: <uri>https://github.com/BeJane/Semi_REST</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1551-1566"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via https://iiplab.net/ultrachip/
{"title":"Improving Unsupervised Ultrasonic Image Anomaly Detection via Frequency-Spatial Feature Filtering and Gaussian Mixture Modeling","authors":"Wenjing Zhang;Ke Lu;Jinbao Wang;Hao Liang;Can Gao;Jian Xue","doi":"10.1109/TIP.2026.3659292","DOIUrl":"10.1109/TIP.2026.3659292","url":null,"abstract":"Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via <uri>https://iiplab.net/ultrachip/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1567-1581"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.
{"title":"Attack-Augmented Mixing-Contrastive Skeletal Representation Learning","authors":"Binqian Xu;Xiangbo Shu;Jiachao Zhang;Rui Yan;Guo-Sen Xie","doi":"10.1109/TIP.2026.3659331","DOIUrl":"10.1109/TIP.2026.3659331","url":null,"abstract":"Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1521-1534"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1109/TIP.2026.3659334
Jonghyuk Park;Jae-Young Sim
Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.
{"title":"Complementary Mixture-of-Experts and Complementary Cross-Attention for Single Image Reflection Separation in the Wild","authors":"Jonghyuk Park;Jae-Young Sim","doi":"10.1109/TIP.2026.3659334","DOIUrl":"10.1109/TIP.2026.3659334","url":null,"abstract":"Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1607-1620"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}