With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.
{"title":"Foundation Model Empowered Real-Time Video Conference With Semantic Communications.","authors":"Mingkai Chen, Wenbo Ma, Mujian Zeng, Xiaoming He, Jian Xiong, Lei Wang, Anwer Al-Dulaimi, Shahid Mumtaz","doi":"10.1109/TIP.2026.3659719","DOIUrl":"10.1109/TIP.2026.3659719","url":null,"abstract":"<p><p>With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1740-1755"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3654348
Pan Zhao, Hui Yuan, Chongzhen Tian, Tian Guo, Raouf Hamzaoui, Zhigeng Pan
Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), in terms of total bitrate setting, UGAE achieved an average BD-PSNR gain of 9.98 dB and -90.54% BD-bitrate for geometry under the D1 metric, as well as a 3.34 dB BD-PSNR improvement with -55.53% BD-bitrate for attributes. Additionally, it improved perceptual quality significantly. Our source code will be released on GitHub at: https://github.com/yuanhui0325/UGAE.
{"title":"UGAE: Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds.","authors":"Pan Zhao, Hui Yuan, Chongzhen Tian, Tian Guo, Raouf Hamzaoui, Zhigeng Pan","doi":"10.1109/TIP.2026.3654348","DOIUrl":"10.1109/TIP.2026.3654348","url":null,"abstract":"<p><p>Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), in terms of total bitrate setting, UGAE achieved an average BD-PSNR gain of 9.98 dB and -90.54% BD-bitrate for geometry under the D1 metric, as well as a 3.34 dB BD-PSNR improvement with -55.53% BD-bitrate for attributes. Additionally, it improved perceptual quality significantly. Our source code will be released on GitHub at: https://github.com/yuanhui0325/UGAE.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"888-903"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3658010
Hao Yang, Yue Sun, Hui Xie, Lina Zhao, Chi Kin Lam, Qiang Zhao, Xiangyu Xiong, Kunyan Cai, Behdad Dashtbozorg, Chenggang Yan, Tao Tan
The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM.
{"title":"Anatomy-Aware MR-Imaging-Only Radiotherapy.","authors":"Hao Yang, Yue Sun, Hui Xie, Lina Zhao, Chi Kin Lam, Qiang Zhao, Xiangyu Xiong, Kunyan Cai, Behdad Dashtbozorg, Chenggang Yan, Tao Tan","doi":"10.1109/TIP.2026.3658010","DOIUrl":"10.1109/TIP.2026.3658010","url":null,"abstract":"<p><p>The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1680-1695"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3659746
Mengzu Liu, Junwei Xu, Weisheng Dong, Le Dong, Guangming Shi
Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at https://github.com/ZUGE0312/RPDUN.
{"title":"Learning Retinex Prior for Compressive Hyperspectral Image Reconstruction.","authors":"Mengzu Liu, Junwei Xu, Weisheng Dong, Le Dong, Guangming Shi","doi":"10.1109/TIP.2026.3659746","DOIUrl":"10.1109/TIP.2026.3659746","url":null,"abstract":"<p><p>Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at https://github.com/ZUGE0312/RPDUN.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1786-1801"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Directly reconstructing 3D CT volume from few-view 2D X-rays using an end-to-end deep learning network is a challenging task, as X-ray images are merely projection views of the 3D CT volume. In this work, we facilitate complex 2D X-ray image to 3D CT mapping by incorporating new view synthesis, and reduce the learning difficulty through view-guided feature alignment. Specifically, we propose a dual-view guided diffusion model (DVG-Diffusion), which couples a real input X-ray view and a synthesized new X-ray view to jointly guide CT reconstruction. First, a novel view parameter-guided encoder captures features from X-rays that are spatially aligned with CT. Next, we concatenate the extracted dual-view features as conditions for the latent diffusion model to learn and refine the CT latent representation. Finally, the CT latent representation is decoded into a CT volume in pixel space. By incorporating view parameter guided encoding and dual-view guided CT reconstruction, our DVG-Diffusion can achieve an effective balance between high fidelity and perceptual quality for CT reconstruction. Experimental results demonstrate our method outperforms state-of-the-art methods. Based on experiments, the comprehensive analysis and discussions for views and reconstruction are also presented. The model and code are available at https://github.com/xiexing0916/DVG-Diffusion.
{"title":"DVG-Diffusion: Dual-View-Guided Diffusion Model for CT Reconstruction From X-Rays.","authors":"Xing Xie, Jiawei Liu, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu","doi":"10.1109/TIP.2026.3655171","DOIUrl":"10.1109/TIP.2026.3655171","url":null,"abstract":"<p><p>Directly reconstructing 3D CT volume from few-view 2D X-rays using an end-to-end deep learning network is a challenging task, as X-ray images are merely projection views of the 3D CT volume. In this work, we facilitate complex 2D X-ray image to 3D CT mapping by incorporating new view synthesis, and reduce the learning difficulty through view-guided feature alignment. Specifically, we propose a dual-view guided diffusion model (DVG-Diffusion), which couples a real input X-ray view and a synthesized new X-ray view to jointly guide CT reconstruction. First, a novel view parameter-guided encoder captures features from X-rays that are spatially aligned with CT. Next, we concatenate the extracted dual-view features as conditions for the latent diffusion model to learn and refine the CT latent representation. Finally, the CT latent representation is decoded into a CT volume in pixel space. By incorporating view parameter guided encoding and dual-view guided CT reconstruction, our DVG-Diffusion can achieve an effective balance between high fidelity and perceptual quality for CT reconstruction. Experimental results demonstrate our method outperforms state-of-the-art methods. Based on experiments, the comprehensive analysis and discussions for views and reconstruction are also presented. The model and code are available at https://github.com/xiexing0916/DVG-Diffusion.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1158-1173"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diffusion models have demonstrated impressive abilities in generating photo-realistic and creative images. To offer more controllability for the generation process of diffusion models, previous studies normally adopt extra modules to integrate condition signals by manipulating the intermediate features of the noise predictors, where they often fail in conditions not seen in the training. Although subsequent studies are motivated to handle multi-condition control, they are mostly resource-consuming to implement, where more generalizable and efficient solutions are expected for controllable visual generation. In this paper, we present a late-constraint controllable visual generation method, namely LaCon, which enables generalization across various modalities and granularities for each single-condition control. LaCon establishes an alignment between the external condition and specific diffusion timesteps, and guides diffusion models to produce conditional results based on this built alignment. Experimental results on prevailing benchmark datasets illustrate the promising performance and generalization capability of LaCon under various conditions and settings. Ablation studies analyze different components in LaCon, illustrating its great potential to offer flexible condition controls for different backbones.
{"title":"LaCon: Late-Constraint Controllable Visual Generation.","authors":"Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, Xin Luo, Dong Liu","doi":"10.1109/TIP.2026.3654412","DOIUrl":"10.1109/TIP.2026.3654412","url":null,"abstract":"<p><p>Diffusion models have demonstrated impressive abilities in generating photo-realistic and creative images. To offer more controllability for the generation process of diffusion models, previous studies normally adopt extra modules to integrate condition signals by manipulating the intermediate features of the noise predictors, where they often fail in conditions not seen in the training. Although subsequent studies are motivated to handle multi-condition control, they are mostly resource-consuming to implement, where more generalizable and efficient solutions are expected for controllable visual generation. In this paper, we present a late-constraint controllable visual generation method, namely LaCon, which enables generalization across various modalities and granularities for each single-condition control. LaCon establishes an alignment between the external condition and specific diffusion timesteps, and guides diffusion models to produce conditional results based on this built alignment. Experimental results on prevailing benchmark datasets illustrate the promising performance and generalization capability of LaCon under various conditions and settings. Ablation studies analyze different components in LaCon, illustrating its great potential to offer flexible condition controls for different backbones.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1111-1126"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3657214
Yun Zhang, Feifan Chen, Na Li, Zhiwei Guo, Xu Wang, Fen Miao, Sam Kwong
Colored point cloud comprising geometry and attribute components is one of the mainstream representations enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method, which learns to model both geometry and attribute patterns and leverages the spatial attribute correlation. Firstly, we establish and release a large-scale dataset for colored point cloud up-sampling, named SYSU-PCUD, which has 121 large-scale colored point clouds with diverse geometry and attribute complexities in six categories and four sampling rates. Secondly, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework to up-sample the geometry and attribute jointly. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Thirdly, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarsely up-sampled attributes for each point. Then, we propose an attribute enhancement module to refine the up-sampled attributes and generate high quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU are 33.90 dB, 32.10 dB, 31.10 dB, and 30.39 dB when up-sampling rates are $4times $ , $8times $ , $12times $ , and $16times $ , respectively. Compared to the state-of-the-art schemes, the JGAU achieves an average of 2.32 dB, 2.47 dB, 2.28 dB and 2.11 dB PSNR gains at four up-sampling rates, respectively, which are significant. The code is released with https://github.com/SYSU-Video/JGAU.
{"title":"Deep Learning-Based Joint Geometry and Attribute Up-Sampling for Large-Scale Colored Point Clouds.","authors":"Yun Zhang, Feifan Chen, Na Li, Zhiwei Guo, Xu Wang, Fen Miao, Sam Kwong","doi":"10.1109/TIP.2026.3657214","DOIUrl":"10.1109/TIP.2026.3657214","url":null,"abstract":"<p><p>Colored point cloud comprising geometry and attribute components is one of the mainstream representations enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method, which learns to model both geometry and attribute patterns and leverages the spatial attribute correlation. Firstly, we establish and release a large-scale dataset for colored point cloud up-sampling, named SYSU-PCUD, which has 121 large-scale colored point clouds with diverse geometry and attribute complexities in six categories and four sampling rates. Secondly, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework to up-sample the geometry and attribute jointly. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Thirdly, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarsely up-sampled attributes for each point. Then, we propose an attribute enhancement module to refine the up-sampled attributes and generate high quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU are 33.90 dB, 32.10 dB, 31.10 dB, and 30.39 dB when up-sampling rates are $4times $ , $8times $ , $12times $ , and $16times $ , respectively. Compared to the state-of-the-art schemes, the JGAU achieves an average of 2.32 dB, 2.47 dB, 2.28 dB and 2.11 dB PSNR gains at four up-sampling rates, respectively, which are significant. The code is released with https://github.com/SYSU-Video/JGAU.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1305-1320"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3659302
Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten- $p$ norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729_DNTRKPCA_code.
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications.","authors":"Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"10.1109/TIP.2026.3659302","url":null,"abstract":"<p><p>Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten- $p$ norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729_DNTRKPCA_code.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1711-1726"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3660576
Zhiwen Yang, Yuxin Peng
Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.
{"title":"Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion.","authors":"Zhiwen Yang, Yuxin Peng","doi":"10.1109/TIP.2026.3660576","DOIUrl":"10.1109/TIP.2026.3660576","url":null,"abstract":"<p><p>Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1771-1785"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.
{"title":"Causally-Aware Unsupervised Feature Selection Learning.","authors":"Zongxin Shen, Yanyong Huang, Dongjie Wang, Minbo Ma, Fengmao Lv, Tianrui Li","doi":"10.1109/TIP.2026.3654354","DOIUrl":"10.1109/TIP.2026.3654354","url":null,"abstract":"<p><p>Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1011-1024"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}