首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Foundation Model Empowered Real-Time Video Conference With Semantic Communications 基础模型支持语义通信的实时视频会议。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659719
Mingkai Chen;Wenbo Ma;Mujian Zeng;Xiaoming He;Jian Xiong;Lei Wang;Anwer Al-Dulaimi;Shahid Mumtaz
With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.
随着实时视频会议的发展,交互式多媒体业务激增,导致业务流量激增。交互性将成为未来多媒体服务的主要特征之一,这对通信领域的计算机视觉技术提出了新的挑战。此外,视频中CV的许多方向,如识别、理解、显著性分割、编码等,如果没有整合,就不能满足交互多任务的需求。同时,随着基础模型的快速发展,我们采用面向任务的语义通信来处理它们。为此,我们提出了一种基于基础模型的实时视频会议(RTVCFM)框架,以满足多媒体业务的交互性需求。首先,在发送端,我们利用视频时间感知大语言模型(VTimeLLM)、迭代集成属性(IIA)和片段任意模型2 (SAM2)对交互视频进行因果理解和时空解耦,完成视频语义分割。其次,在传输中,我们提出了一种由信道状态信息(CSI)驱动的两阶段语义传输优化,该优化同样适用于实时视频中不对称语义信息的权重,从而在视频传输中实现低比特率和高语义保真度。第三,在接收端,RTVCFM利用前景背景融合扩散模型(DMFBF)对视频流进行全语义分割的多维融合,重构视频流;最后,仿真结果表明,RTVCFM可以实现高达95.6%的压缩比,同时保证了多尺度结构相似度指标(MS-SSIM)和结构相似度指标(SSIM)的98.73%和98.35%的高语义相似度,表明重构视频与原始视频相对相似。
{"title":"Foundation Model Empowered Real-Time Video Conference With Semantic Communications","authors":"Mingkai Chen;Wenbo Ma;Mujian Zeng;Xiaoming He;Jian Xiong;Lei Wang;Anwer Al-Dulaimi;Shahid Mumtaz","doi":"10.1109/TIP.2026.3659719","DOIUrl":"10.1109/TIP.2026.3659719","url":null,"abstract":"With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1740-1755"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anatomy-Aware MR-Imaging-Only Radiotherapy 解剖意识mr成像放射治疗。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3658010
Hao Yang;Yue Sun;Hui Xie;Lina Zhao;Chi Kin Lam;Qiang Zhao;Xiangyu Xiong;Kunyan Cai;Behdad Dashtbozorg;Chenggang Yan;Tao Tan
The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM
计算机断层图像的合成可以补充电子密度信息,消除核磁共振ct图像配准误差。因此,越来越多的MR-to-CT图像转换方法被提出用于MR-only放疗计划。然而,由于各区域解剖结构的巨大差异,传统方法往往需要每个模型进行独立的开发和使用。在本文中,我们提出了一个由提示符驱动的统一模型,该模型可以动态适应不同的解剖区域,并生成具有高结构一致性的CT图像。具体来说,它利用特定区域的注意机制,包括一个区域感知向量和一个动态门控因子,来实现多个解剖区域的mri到ct图像转换。在三个解剖部位数据集上的定性和定量结果表明,我们的模型比其他最先进的翻译模型产生更清晰、更详细的解剖CT图像。剂量学分析的结果也表明,我们提出的模型产生的图像的剂量分布更接近于真实的CT图像。因此,所提出的模型显示了跨多个解剖区域实现仅磁共振放射治疗的潜力。我们已经发布了RSAM模型的源代码。该存储库可供公众访问:https://github.com/yhyumi123/RSAM。
{"title":"Anatomy-Aware MR-Imaging-Only Radiotherapy","authors":"Hao Yang;Yue Sun;Hui Xie;Lina Zhao;Chi Kin Lam;Qiang Zhao;Xiangyu Xiong;Kunyan Cai;Behdad Dashtbozorg;Chenggang Yan;Tao Tan","doi":"10.1109/TIP.2026.3658010","DOIUrl":"10.1109/TIP.2026.3658010","url":null,"abstract":"The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: <uri>https://github.com/yhyumi123/RSAM</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1680-1695"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications 双非凸张量鲁棒核主成分分析及其可视化应用。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659302
Liang Wu;Jianjun Wang;Wei-Shi Zheng;Guangming Shi
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten- $p$ norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729_DNTRKPCA_code
张量鲁棒主成分分析(TRPCA)作为一种流行的线性低秩方法,已广泛应用于各种视觉任务。从线性潜变量模型出发,推导了低秩先验的数学过程。然而,对于信息丰富的非线性张量数据,其非线性结构可能会突破低秩假设,导致TRPCA的近似误差较大。基于非线性张量的潜在低维性,本文首次建立了非线性张量加稀疏张量分解问题的一般范式——张量鲁棒核主成分分析(TRKPCA)。为了有效地解决TRKPCA问题,设计了核化张量schattenp范数(kernel - tensor schattenp norm, KTSPN)和广义非凸正则化两种新的非凸正则化方法,其中KTSPN具有更强的理论支持,能够充分捕获非线性特征(即隐式低秩),而广义非凸正则化则保证了更稀疏的结构编码,保证了分离结果的鲁棒性。然后,通过整合它们的优势,我们提出了一种双非凸TRKPCA (DNTRKPCA)方法来实现我们的期望。最后,我们通过交替方向乘子法(ADMM)开发了一个有效的优化框架来实现所提出的非凸核方法。在合成数据和几个真实数据库上的实验结果表明,与其他最先进的正则化方法相比,我们的方法具有更高的竞争力。该代码已在我们的ResearchGate主页上发布:https://www.researchgate.net/publication/397181729 DNTRKPCA代码。
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications","authors":"Liang Wu;Jianjun Wang;Wei-Shi Zheng;Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"10.1109/TIP.2026.3659302","url":null,"abstract":"Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-<inline-formula> <tex-math>$p$ </tex-math></inline-formula> norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: <uri>https://www.researchgate.net/publication/397181729_DNTRKPCA_code</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1711-1726"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes DrivingEditor:用于动态自动驾驶场景重建和编辑的4D复合高斯飞溅。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659733
Wang Xu;Yeqiang Qian;Yun-Fu Liu;Lei Tuo;Huiyong Chen;Ming Yang
In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor
近年来,随着自动驾驶技术的发展,无界大场景的三维重建受到了研究人员的关注。现有方法在自动驾驶场景中实现了出色的重建精度,但大多缺乏场景编辑能力。虽然一些方法有编辑场景的能力,但它们高度依赖于手动注释的3D边界框,导致它们的可扩展性很差。为了解决这些问题,我们引入了一种新的高斯表示,称为DrivingEditor,它将场景解耦为两个部分,并通过单独的分支来处理它们,以便在训练过程中单独建模动态前景对象和静态背景。通过提出一种场景解耦建模框架,我们可以实现对任意动态目标的精确编辑,如动态对象的移除、添加等,同时提高自动驾驶场景尤其是动态前景对象的重建质量,而无需借助3D边界框。在Waymo开放数据集和KITTI基准测试上的大量实验证明了动态和静态场景的3D重建性能。此外,我们还对非结构化的大规模场景进行了额外的实验,这可以更令人信服地证明我们提出的模型在渲染非结构化场景时的性能和鲁棒性。我们的代码可在https://github.com/WangXu-xxx/DrivingEditor上获得。
{"title":"DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes","authors":"Wang Xu;Yeqiang Qian;Yun-Fu Liu;Lei Tuo;Huiyong Chen;Ming Yang","doi":"10.1109/TIP.2026.3659733","DOIUrl":"10.1109/TIP.2026.3659733","url":null,"abstract":"In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at <uri>https://github.com/WangXu-xxx/DrivingEditor</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1696-1710"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Positional Encoding Image Prior 位置编码图像先验。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay;Eli Schwartz;Raja Giryes
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP
在深度图像先验(DIP)中,卷积神经网络(CNN)拟合将潜在空间映射到退化(例如噪声)图像,但在此过程中学习重建干净图像。这种现象归因于CNN的内部图像先验。我们重新审视DIP框架,从神经隐式表示的角度来考察它。基于这一观点,我们用傅里叶特征(位置编码)代替了随机潜函数。我们通过经验证明,由于傅里叶特征的性质,DIP中的卷积层可以被简单的像素级mlp所取代。我们还证明了它们在线性网络的情况下是等价的。我们将我们的方案命名为“位置编码图像先验”(PIP),并证明它在各种图像重建任务上的性能与DIP非常相似,参数要少得多。此外,我们证明PIP可以很容易地扩展到视频,这是一个基于图像先验和某些INR方法面临稳定性挑战的领域。所有任务的代码和其他示例,包括视频,都可以在项目页面nimrodshabay .github.io/PIP上获得。
{"title":"Positional Encoding Image Prior","authors":"Nimrod Shabtay;Eli Schwartz;Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"10.1109/TIP.2026.3653206","url":null,"abstract":"In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page <monospace>nimrodshabtay.github.io/PIP</monospace>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2110-2121"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Confident Block Diagonal Analysis for Multi-View Palmprint Recognition in Unrestrained Environment 无约束环境下多视点掌纹识别的高置信度块对角分析
IF 13.7 Pub Date : 2026-02-04 DOI: 10.1109/TIP.2026.3659325
Shuping Zhao;Lunke Fei;Tingting Chai;Jie Wen;Bob Zhang;Jinrong Cui
Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.
无约束掌纹识别是指一种综合身份认证技术,它基于在不受控制的环境下(如智能手机摄像头、监控录像、近红外场景)采集的掌纹图像进行个人身份认证。然而,由于图像质量、光照条件和手部姿势的可变性,无约束掌纹识别面临着巨大的挑战。我们观察到许多现有的方法利用子空间结构作为先验,其中数据的块对角线性质已被证明。在本文中,我们考虑了一个统一的学习模型,以保证所有视图的一致块对角属性,称为高置信度块对角分析多视图掌纹识别(HCBDA_MPR)。特别地,本文提出了一个多视图块对角正则化器,以指导所有视图学习一致的块对角结构。通过这种方式,在学习所有视图的严格块对角结构的同时,可以保留每个视图的主要判别特征。在多个真实掌纹数据库上的实验结果证明了该方法的优越性,与其他先进的相关方法相比,该方法具有最高的识别精度。
{"title":"High-Confident Block Diagonal Analysis for Multi-View Palmprint Recognition in Unrestrained Environment","authors":"Shuping Zhao;Lunke Fei;Tingting Chai;Jie Wen;Bob Zhang;Jinrong Cui","doi":"10.1109/TIP.2026.3659325","DOIUrl":"10.1109/TIP.2026.3659325","url":null,"abstract":"Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1621-1635"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate Industrial Anomaly Detection and Localization Using Weakly-Supervised Residual Transformers 基于弱监督残差变压器的精确工业异常检测与定位
IF 13.7 Pub Date : 2026-02-04 DOI: 10.1109/TIP.2026.3659337
Hanxi Li;Jingqi Wu;Deyin Liu;Lin Yuanbo Wu;Hao Chen;Chunhua Shen
Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual $T$ ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional $F$ ast $A$ nomaly $R$ esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: https://github.com/BeJane/Semi_REST
工业异常检测(AD)的最新进展表明,在训练过程中加入少量异常样本可以显著提高准确性。然而,这种改进通常是以大量注释工作为代价的,这对于许多实际应用程序来说是不切实际的。在本文中,我们引入了一个新的框架,“弱监督残差T变换”(WeakREST),旨在实现高异常检测精度,同时最大限度地减少对人工注释的依赖。首先,我们将逐像素异常定位任务重新表述为逐块分类问题。其次,我们引入了一种基于残差的特征表示,称为“位置$F$ ast $ a $ normal $R$残差”(PosFAR),它可以更有效地捕获异常模式。为了利用这一特性,我们对Swin Transformer进行了调整,以增强异常检测和定位。此外,我们提出了一种弱标注方法,利用边界框和图像标签来定义异常区域。这种方法建立了半监督学习环境,减少了对精确像素级标签的依赖。为了进一步改进学习过程,我们开发了一种新的ResMixMatch算法,能够处理弱标签和基于残差的表示之间的相互作用。在基准数据集MVTec-AD上,我们的方法实现了83.0%的平均精度(AP),超过了之前在无监督设置下82.7%的最佳结果。在有监督的AD设置中,WeakREST达到了87.6%的AP,超过了之前最好的86.0%。值得注意的是,即使在使用较弱的注释(如边界框)时,WeakREST的性能也超过了依赖于逐像素监督的领先方法,实现了87.1%的AP,而MVTec-AD的最佳AP为86.0%。这种卓越的性能在其他成熟的AD数据集上也得到了一致的复制,包括MVTec 3D、KSDD2和Real-IAD。代码可从https://github.com/BeJane/Semi_REST获得
{"title":"Accurate Industrial Anomaly Detection and Localization Using Weakly-Supervised Residual Transformers","authors":"Hanxi Li;Jingqi Wu;Deyin Liu;Lin Yuanbo Wu;Hao Chen;Chunhua Shen","doi":"10.1109/TIP.2026.3659337","DOIUrl":"10.1109/TIP.2026.3659337","url":null,"abstract":"Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual <inline-formula> <tex-math>$T$ </tex-math></inline-formula>ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional <inline-formula> <tex-math>$F$ </tex-math></inline-formula>ast <inline-formula> <tex-math>$A$ </tex-math></inline-formula>nomaly <inline-formula> <tex-math>$R$ </tex-math></inline-formula>esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: <uri>https://github.com/BeJane/Semi_REST</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1551-1566"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Unsupervised Ultrasonic Image Anomaly Detection via Frequency-Spatial Feature Filtering and Gaussian Mixture Modeling 基于频率-空间特征滤波和高斯混合建模的无监督超声图像异常检测
IF 13.7 Pub Date : 2026-02-04 DOI: 10.1109/TIP.2026.3659292
Wenjing Zhang;Ke Lu;Jinbao Wang;Hao Liang;Can Gao;Jian Xue
Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via https://iiplab.net/ultrachip/
超声图像异常检测由于标记数据有限、结构和随机噪声强、缺陷表现多样等特点,面临着很大的挑战。为了克服这些障碍,我们引入了UltraChip,这是一种新的大规模c扫描基准,包含来自各种芯片封装类型的约8,000张真实图像,每张图像都用像素级掩模对裂缝,孔和层进行了精心注释。在此基础上,我们提出了FSGM-Net,这是一个为异常检测量身定制的完全无监督框架。FSGM-Net利用自适应频率-空间特征滤波机制:可学习的fft -空间斑块滤波器首先抑制噪声,并动态地为视觉变压器(ViT)斑块特征分配正态权值。随后,自适应高斯混合模型(Ada-GMM)捕获正态特征的分布,并指导深浅多尺度相互作用解码器进行精确的像素级异常推断。此外,我们提出了一种滤波器损失,它强制编码器-滤波器一致性和基于熵的稀疏门控,以及一种分布损失,它鼓励特征重建和自信的高斯混合建模。大量实验表明,FSGM-Net不仅在UltraChip上取得了最先进的结果,而且在单个GPU上支持实时推理的同时,也表现出优于MVTec-AD和VisA的跨域泛化。总之,数据集和框架在实际应用中推进了鲁棒性,无注释的超声无损检测。UltraChip数据集可通过https://iiplab.net/ultrachip/获取
{"title":"Improving Unsupervised Ultrasonic Image Anomaly Detection via Frequency-Spatial Feature Filtering and Gaussian Mixture Modeling","authors":"Wenjing Zhang;Ke Lu;Jinbao Wang;Hao Liang;Can Gao;Jian Xue","doi":"10.1109/TIP.2026.3659292","DOIUrl":"10.1109/TIP.2026.3659292","url":null,"abstract":"Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via <uri>https://iiplab.net/ultrachip/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1567-1581"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attack-Augmented Mixing-Contrastive Skeletal Representation Learning 攻击增强混合对比骨架表征学习
IF 13.7 Pub Date : 2026-02-04 DOI: 10.1109/TIP.2026.3659331
Binqian Xu;Xiangbo Shu;Jiachao Zhang;Rui Yan;Guo-Sen Xie
Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.
对比学习通过利用有效的正、负样本对,促进了无监督动作识别信息骨架表征的获取。然而,大多数现有方法通过弱或强数据增强来构建这些对,这通常依赖于骨骼的随机外观变化。虽然这种增强在某种程度上是有效的,但它们只是间接地引入了语义变化,并且面临两个固有的限制。首先,简单地修改骨架的外观往往不能反映有意义的语义变化。其次,随机扰动会无意中模糊正负对之间的界限,削弱对比目标。为了应对这些挑战,我们提出了一个攻击驱动的增强框架,该框架明确地引入了语义级扰动。这种方法有助于生成硬阳性,同时指导模型挖掘更多信息的硬阴性。基于这个想法,我们提出了攻击增强混合对比骨架表征学习(A2MC),这是一个新的框架,专注于对比硬正和硬负样本,以实现更稳健的表征学习。在A2MC中,我们设计了一个攻击增强(at - aug)模块,该模块集成了目标(基于攻击)和非目标(基于增强)扰动,以生成信息丰富的硬阳性样本。同时,我们提出了正负混合器(PNM),它混合了硬正负特征来合成具有挑战性的硬负极。然后用这些数据来更新混合记忆库,以实现更有效的对比学习。对三个公共基准的综合评估表明,我们的方法,称为A2MC,达到了与现有最先进的方法相当或超过的性能。
{"title":"Attack-Augmented Mixing-Contrastive Skeletal Representation Learning","authors":"Binqian Xu;Xiangbo Shu;Jiachao Zhang;Rui Yan;Guo-Sen Xie","doi":"10.1109/TIP.2026.3659331","DOIUrl":"10.1109/TIP.2026.3659331","url":null,"abstract":"Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1521-1534"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complementary Mixture-of-Experts and Complementary Cross-Attention for Single Image Reflection Separation in the Wild 野外单幅图像反射分离的互补混合专家和互补交叉注意
IF 13.7 Pub Date : 2026-02-04 DOI: 10.1109/TIP.2026.3659334
Jonghyuk Park;Jae-Young Sim
Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.
单图像反射分离(SIRS)旨在通过玻璃状反射表面捕获的包含两者叠加的单个图像重建透射和反射图像。最近基于学习的SIRS方法在具有轻微反射伪影的典型图像上显著提高了性能;然而,他们经常在各种各样的图像中挣扎,这些图像包含了在野外拍摄的具有挑战性的反射。在本文中,我们提出了一个基于灵活的双流架构的通用SIRS框架,能够处理各种反射工件。具体来说,我们结合了一个混合专家机制,根据空间异构反射特征动态分配专业专家到图像补丁。然后,指定的专家以自适应的方式合作提取传输流和反射流之间的互补特征。此外,我们利用变形金刚的多头注意机制同时利用高相关性和低相关性,然后互补用于促进自适应流间特征交互。在不同的真实世界数据集上评估的实验结果表明,所提出的方法在定性和定量上都明显优于现有的最先进的方法。
{"title":"Complementary Mixture-of-Experts and Complementary Cross-Attention for Single Image Reflection Separation in the Wild","authors":"Jonghyuk Park;Jae-Young Sim","doi":"10.1109/TIP.2026.3659334","DOIUrl":"10.1109/TIP.2026.3659334","url":null,"abstract":"Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1607-1620"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1