Pub Date : 2025-02-04DOI: 10.1109/TIP.2025.3534033
Junye Chen;Chaowei Fang;Jichang Li;Yicheng Leng;Guanbin Li
This paper aims to restore original background images in watermarked videos, overcoming challenges posed by traditional approaches that fail to handle the temporal dynamics and diverse watermark characteristics effectively. Our method introduces a unique framework that first “decouples” the extraction of prior knowledge—such as common-sense knowledge and residual background details—from the temporal modeling process, allowing for independent handling of background restoration and temporal consistency. Subsequently, it “couples” these extracted features by integrating them into the temporal modeling backbone of a video inpainting (VI) framework. This integration is facilitated by a specialized module, which includes an intrinsic background image prediction sub-module and a dual-branch frame embedding module, designed to reduce watermark interference and enhance the application of prior knowledge. Moreover, a frame-adaptive feature selection module dynamically adjusts the extraction of prior features based on the corruption level of each frame, ensuring their effective incorporation into the temporal processing. Extensive experiments on YouTube-VOS and DAVIS datasets validate our method’s efficiency in watermark removal and background restoration, showing significant improvement over state-of-the-art techniques in visible image watermark removal, video restoration, and video inpainting.
{"title":"Decouple and Couple: Exploiting Prior Knowledge for Visible Video Watermark Removal","authors":"Junye Chen;Chaowei Fang;Jichang Li;Yicheng Leng;Guanbin Li","doi":"10.1109/TIP.2025.3534033","DOIUrl":"10.1109/TIP.2025.3534033","url":null,"abstract":"This paper aims to restore original background images in watermarked videos, overcoming challenges posed by traditional approaches that fail to handle the temporal dynamics and diverse watermark characteristics effectively. Our method introduces a unique framework that first “decouples” the extraction of prior knowledge—such as common-sense knowledge and residual background details—from the temporal modeling process, allowing for independent handling of background restoration and temporal consistency. Subsequently, it “couples” these extracted features by integrating them into the temporal modeling backbone of a video inpainting (VI) framework. This integration is facilitated by a specialized module, which includes an intrinsic background image prediction sub-module and a dual-branch frame embedding module, designed to reduce watermark interference and enhance the application of prior knowledge. Moreover, a frame-adaptive feature selection module dynamically adjusts the extraction of prior features based on the corruption level of each frame, ensuring their effective incorporation into the temporal processing. Extensive experiments on YouTube-VOS and DAVIS datasets validate our method’s efficiency in watermark removal and background restoration, showing significant improvement over state-of-the-art techniques in visible image watermark removal, video restoration, and video inpainting.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1192-1203"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/TIP.2025.3526061
Haili Ye;Xiaoqing Zhang;Yan Hu;Huazhu Fu;Jiang Liu
The morphologies of vessel-like structures, such as blood vessels and nerve fibres, play significant roles in disease diagnosis, e.g., Parkinson’s disease. Although deep network-based refinement segmentation and topology-preserving segmentation methods recently have achieved promising results in segmenting vessel-like structures, they still face two challenges: 1) existing methods often have limitations in rehabilitating subsection ruptures in segmented vessel-like structures; 2) they are typically overconfident in predicted segmentation results. To tackle these two challenges, this paper attempts to leverage the potential of spatial interconnection relationships among subsection ruptures from the structure rehabilitation perspective. Based on this perspective, we propose a novel Vessel-like Structure Rehabilitation Network (VSR-Net) to both rehabilitate subsection ruptures and improve the model calibration based on coarse vessel-like structure segmentation results. VSR-Net first constructs subsection rupture clusters via a Curvilinear Clustering Module (CCM). Then, the well-designed Curvilinear Merging Module (CMM) is applied to rehabilitate the subsection ruptures to obtain the refined vessel-like structures. Extensive experiments on six 2D/3D medical image datasets show that VSR-Net significantly outperforms state-of-the-art (SOTA) refinement segmentation methods with lower calibration errors. Additionally, we provide quantitative analysis to explain the morphological difference between the VSR-Net’s rehabilitation results and ground truth (GT), which are smaller compared to those between SOTA methods and GT, demonstrating that our method more effectively rehabilitates vessel-like structures.
{"title":"VSR-Net: Vessel-Like Structure Rehabilitation Network With Graph Clustering","authors":"Haili Ye;Xiaoqing Zhang;Yan Hu;Huazhu Fu;Jiang Liu","doi":"10.1109/TIP.2025.3526061","DOIUrl":"10.1109/TIP.2025.3526061","url":null,"abstract":"The morphologies of vessel-like structures, such as blood vessels and nerve fibres, play significant roles in disease diagnosis, e.g., Parkinson’s disease. Although deep network-based refinement segmentation and topology-preserving segmentation methods recently have achieved promising results in segmenting vessel-like structures, they still face two challenges: 1) existing methods often have limitations in rehabilitating subsection ruptures in segmented vessel-like structures; 2) they are typically overconfident in predicted segmentation results. To tackle these two challenges, this paper attempts to leverage the potential of spatial interconnection relationships among subsection ruptures from the structure rehabilitation perspective. Based on this perspective, we propose a novel Vessel-like Structure Rehabilitation Network (VSR-Net) to both rehabilitate subsection ruptures and improve the model calibration based on coarse vessel-like structure segmentation results. VSR-Net first constructs subsection rupture clusters via a Curvilinear Clustering Module (CCM). Then, the well-designed Curvilinear Merging Module (CMM) is applied to rehabilitate the subsection ruptures to obtain the refined vessel-like structures. Extensive experiments on six 2D/3D medical image datasets show that VSR-Net significantly outperforms state-of-the-art (SOTA) refinement segmentation methods with lower calibration errors. Additionally, we provide quantitative analysis to explain the morphological difference between the VSR-Net’s rehabilitation results and ground truth (GT), which are smaller compared to those between SOTA methods and GT, demonstrating that our method more effectively rehabilitates vessel-like structures.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1090-1105"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/TIP.2025.3536217
Fangfang Wu;Tao Huang;Junwei Xu;Xun Cao;Weisheng Dong;Le Dong;Guangming Shi
Conventional spectral image demosaicing algorithms rely on pixels’ spatial or spectral correlations for reconstruction. Due to the missing data in the multispectral filter array (MSFA), the estimation of spatial or spectral correlations is inaccurate, leading to poor reconstruction results, and these algorithms are time-consuming. Deep learning-based spectral image demosaicing methods directly learn the nonlinear mapping relationship between 2D spectral mosaic images and 3D multispectral images. However, these learning-based methods focused only on learning the mapping relationship in the spatial domain, but neglected valuable image information in the frequency domain, resulting in limited reconstruction quality. To address the above issues, this paper proposes a novel lightweight spectral image demosaicing method based on joint spatial and frequency domain information learning. First, a novel parameter-free spectral image initialization strategy based on the Fourier transform is proposed, which leads to better initialized spectral images and eases the difficulty of subsequent spectral image reconstruction. Furthermore, an efficient spatial-frequency transformer network is proposed, which jointly learns the spatial correlations and the frequency domain characteristics. Compared to existing learning-based spectral image demosaicing methods, the proposed method significantly reduces the number of model parameters and computational complexity. Extensive experiments on simulated and real-world data show that the proposed method notably outperforms existing spectral image demosaicing methods.
{"title":"Joint Spatial and Frequency Domain Learning for Lightweight Spectral Image Demosaicing","authors":"Fangfang Wu;Tao Huang;Junwei Xu;Xun Cao;Weisheng Dong;Le Dong;Guangming Shi","doi":"10.1109/TIP.2025.3536217","DOIUrl":"10.1109/TIP.2025.3536217","url":null,"abstract":"Conventional spectral image demosaicing algorithms rely on pixels’ spatial or spectral correlations for reconstruction. Due to the missing data in the multispectral filter array (MSFA), the estimation of spatial or spectral correlations is inaccurate, leading to poor reconstruction results, and these algorithms are time-consuming. Deep learning-based spectral image demosaicing methods directly learn the nonlinear mapping relationship between 2D spectral mosaic images and 3D multispectral images. However, these learning-based methods focused only on learning the mapping relationship in the spatial domain, but neglected valuable image information in the frequency domain, resulting in limited reconstruction quality. To address the above issues, this paper proposes a novel lightweight spectral image demosaicing method based on joint spatial and frequency domain information learning. First, a novel parameter-free spectral image initialization strategy based on the Fourier transform is proposed, which leads to better initialized spectral images and eases the difficulty of subsequent spectral image reconstruction. Furthermore, an efficient spatial-frequency transformer network is proposed, which jointly learns the spatial correlations and the frequency domain characteristics. Compared to existing learning-based spectral image demosaicing methods, the proposed method significantly reduces the number of model parameters and computational complexity. Extensive experiments on simulated and real-world data show that the proposed method notably outperforms existing spectral image demosaicing methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1119-1132"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/TIP.2025.3536208
Senlong Huang;Yongxin Ge;Dongfang Liu;Mingjian Hong;Junhan Zhao;Alexander C. Loui
Semi-supervised learning based on consistency learning offers significant promise for enhancing medical image segmentation. Current approaches use copy-paste as an effective data perturbation technique to facilitate weak-to-strong consistency learning. However, these techniques often lead to a decrease in the accuracy of synthetic labels corresponding to the synthetic data and introduce excessive perturbations to the distribution of the training data. Such over-perturbation causes the data distribution to stray from its true distribution, thereby impairing the model’s generalization capabilities as it learns the decision boundaries. We propose a weak-to-strong consistency learning framework that integrally addresses these issues with two primary designs: 1) it emphasizes the use of highly reliable data to enhance the quality of labels in synthetic datasets through cross-copy-pasting between labeled and unlabeled datasets; 2) it employs uncertainty estimation and foreground region constraints to meticulously filter the regions for copy-pasting, thus the copy-paste technique implemented introduces a beneficial perturbation to the training data distribution. Our framework expands the copy-paste method by addressing its inherent limitations, and amplifying the potential of data perturbations for consistency learning. We extensively validated our model using six publicly available medical image segmentation datasets across different diagnostic tasks, including the segmentation of cardiac structures, prostate structures, brain structures, skin lesions, and gastrointestinal polyps. The results demonstrate that our method significantly outperforms state-of-the-art models. For instance, on the PROMISE12 dataset for the prostate structure segmentation task, using only 10% labeled data, our method achieves a 15.31% higher Dice score compared to the baseline models. Our experimental code will be made publicly available at https://github.com/slhuang24/RCP4CL.
{"title":"Rethinking Copy-Paste for Consistency Learning in Medical Image Segmentation","authors":"Senlong Huang;Yongxin Ge;Dongfang Liu;Mingjian Hong;Junhan Zhao;Alexander C. Loui","doi":"10.1109/TIP.2025.3536208","DOIUrl":"10.1109/TIP.2025.3536208","url":null,"abstract":"Semi-supervised learning based on consistency learning offers significant promise for enhancing medical image segmentation. Current approaches use copy-paste as an effective data perturbation technique to facilitate weak-to-strong consistency learning. However, these techniques often lead to a decrease in the accuracy of synthetic labels corresponding to the synthetic data and introduce excessive perturbations to the distribution of the training data. Such over-perturbation causes the data distribution to stray from its true distribution, thereby impairing the model’s generalization capabilities as it learns the decision boundaries. We propose a weak-to-strong consistency learning framework that integrally addresses these issues with two primary designs: 1) it emphasizes the use of highly reliable data to enhance the quality of labels in synthetic datasets through cross-copy-pasting between labeled and unlabeled datasets; 2) it employs uncertainty estimation and foreground region constraints to meticulously filter the regions for copy-pasting, thus the copy-paste technique implemented introduces a beneficial perturbation to the training data distribution. Our framework expands the copy-paste method by addressing its inherent limitations, and amplifying the potential of data perturbations for consistency learning. We extensively validated our model using six publicly available medical image segmentation datasets across different diagnostic tasks, including the segmentation of cardiac structures, prostate structures, brain structures, skin lesions, and gastrointestinal polyps. The results demonstrate that our method significantly outperforms state-of-the-art models. For instance, on the PROMISE12 dataset for the prostate structure segmentation task, using only 10% labeled data, our method achieves a 15.31% higher Dice score compared to the baseline models. Our experimental code will be made publicly available at <uri>https://github.com/slhuang24/RCP4CL</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1060-1074"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we explore a new road for format-compatible 3D object encryption by proposing a novel mechanism of leveraging 2D image encryption methods. It alleviates the difficulty of designing 3D object encryption schemes coming from the intrinsic intricacy of the data structure, and implements the flexible and diverse 3D object encryption designs. First, turning complexity into simplicity, the vertex values, real numbers with continuous values, are converted into integers ranging from 0 to 255. The simplification result for a 3D object is a 2D numerical matrix. Second, six prototypes for three encryption patterns (permutation, diffusion, and permutation-diffusion) are designed as exemplifications to encrypt the 2D matrix. Third, the integer-valued elements in the encrypted numeric matrix are converted into real numbers complying with the syntax of the 3D object. In addition, some experiments are conducted to verify the effectiveness of the proposed mechanism.
{"title":"All Roads Lead to Rome: Achieving 3D Object Encryption Through 2D Image Encryption Methods","authors":"Ruoyu Zhao;Yushu Zhang;Rushi Lan;Shuang Yi;Zhongyun Hua;Jian Weng","doi":"10.1109/TIP.2025.3536219","DOIUrl":"10.1109/TIP.2025.3536219","url":null,"abstract":"In this paper, we explore a new road for format-compatible 3D object encryption by proposing a novel mechanism of leveraging 2D image encryption methods. It alleviates the difficulty of designing 3D object encryption schemes coming from the intrinsic intricacy of the data structure, and implements the flexible and diverse 3D object encryption designs. First, turning complexity into simplicity, the vertex values, real numbers with continuous values, are converted into integers ranging from 0 to 255. The simplification result for a 3D object is a 2D numerical matrix. Second, six prototypes for three encryption patterns (permutation, diffusion, and permutation-diffusion) are designed as exemplifications to encrypt the 2D matrix. Third, the integer-valued elements in the encrypted numeric matrix are converted into real numbers complying with the syntax of the 3D object. In addition, some experiments are conducted to verify the effectiveness of the proposed mechanism.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1075-1089"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/TIP.2025.3536201
Chen Zhu;Guo Lu;Bing He;Rong Xie;Li Song
With the increasing consumption of 3D displays and virtual reality, multi-view video has become a promising format. However, its high resolution and multi-camera shooting result in a substantial increase in data volume, making storage and transmission a challenging task. To tackle these difficulties, we propose an implicit-explicit integrated representation for multi-view video compression. Specifically, we first use the explicit representation-based 2D video codec to encode one of the source views. Subsequently, we propose employing the implicit neural representation (INR)-based codec to encode the remaining views. The implicit codec takes the time and view index of multi-view video as coordinate input and generates the corresponding implicit reconstruction frames. To enhance the compressibility, we introduce a multi-level feature grid embedding and a fully convolutional architecture into the implicit codec. These components facilitate coordinate-feature and feature-RGB mapping, respectively. To further enhance the reconstruction quality from the INR codec, we leverage the high-quality reconstructed frames from the explicit codec to achieve inter-view compensation. Finally, the compensated results are fused with the implicit reconstructions from the INR to obtain the final reconstructed frames. Our proposed framework combines the strengths of both implicit neural representation and explicit 2D codec. Extensive experiments conducted on public datasets demonstrate that the proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV and other INR-based schemes in terms of view compression and scene modeling. The source code can be found at https://github.com/zc-lynen/MV-IERV.
{"title":"Implicit-Explicit Integrated Representations for Multi-View Video Compression","authors":"Chen Zhu;Guo Lu;Bing He;Rong Xie;Li Song","doi":"10.1109/TIP.2025.3536201","DOIUrl":"10.1109/TIP.2025.3536201","url":null,"abstract":"With the increasing consumption of 3D displays and virtual reality, multi-view video has become a promising format. However, its high resolution and multi-camera shooting result in a substantial increase in data volume, making storage and transmission a challenging task. To tackle these difficulties, we propose an implicit-explicit integrated representation for multi-view video compression. Specifically, we first use the explicit representation-based 2D video codec to encode one of the source views. Subsequently, we propose employing the implicit neural representation (INR)-based codec to encode the remaining views. The implicit codec takes the time and view index of multi-view video as coordinate input and generates the corresponding implicit reconstruction frames. To enhance the compressibility, we introduce a multi-level feature grid embedding and a fully convolutional architecture into the implicit codec. These components facilitate coordinate-feature and feature-RGB mapping, respectively. To further enhance the reconstruction quality from the INR codec, we leverage the high-quality reconstructed frames from the explicit codec to achieve inter-view compensation. Finally, the compensated results are fused with the implicit reconstructions from the INR to obtain the final reconstructed frames. Our proposed framework combines the strengths of both implicit neural representation and explicit 2D codec. Extensive experiments conducted on public datasets demonstrate that the proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV and other INR-based schemes in terms of view compression and scene modeling. The source code can be found at <uri>https://github.com/zc-lynen/MV-IERV</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1106-1118"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-03DOI: 10.1109/TIP.2025.3534532
Quan Tang;Chuanjian Liu;Fagui Liu;Jun Jiang;Bowen Zhang;C. L. Philip Chen;Kai Han;Yunhe Wang
The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.
{"title":"Rethinking Feature Reconstruction via Category Prototype in Semantic Segmentation","authors":"Quan Tang;Chuanjian Liu;Fagui Liu;Jun Jiang;Bowen Zhang;C. L. Philip Chen;Kai Han;Yunhe Wang","doi":"10.1109/TIP.2025.3534532","DOIUrl":"10.1109/TIP.2025.3534532","url":null,"abstract":"The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1036-1047"},"PeriodicalIF":0.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143083719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The reconstruction of limited data computed tomography (CT) aims to obtain high-quality images from a reduced set of projection views acquired from sparse views or limited angles. This approach is utilized to reduce radiation exposure or expedite the scanning process. Deep Learning (DL) techniques have been incorporated into limited data CT reconstruction tasks and achieve remarkable performance. However, these DL methods suffer from various limitations. Firstly, the distribution inconsistency between the simulation data and the real data hinders the generalization of these DL-based methods. Secondly, these DL-based methods could be unstable due to lack of kernel awareness. This paper addresses these issues by proposing an unrolling framework called Progressive Artifact Image Learning (PAIL) for limited data CT reconstruction. The proposed PAIL primarily consists of three key modules, i.e., a residual domain module (RDM), an image domain module (IDM), and a wavelet domain module (WDM). The RDM is designed to refine features from residual images and suppress the observable artifacts from the reconstructed images. This module could effectively alleviate the effects of distribution inconsistency among different data sets by transferring the optimization space from the original data domain to the residual data domain. The IDM is designed to suppress the unobservable artifacts in the image space. The RDM and IDM collaborate with each other during the iterative optimization process, progressively removing artifacts and reconstructing the underlying CT image. Furthermore, in order to void the potential hallucinations generated by the RDM and IDM, an additional WDM is incorporated into the network to enhance its stability. This is achieved by making the network become kernel-aware via integrating wavelet-based compressed sensing. The effectiveness of the proposed PAIL method has been consistently verified on two simulated CT data sets, a clinical cardiac data set and a sheep lung data set. Compared to other state-of-the-art methods, the proposed PAIL method achieves superior performance in various limited data CT reconstruction tasks, demonstrating its promising generalization and stability.
{"title":"Trustworthy Limited Data CT Reconstruction Using Progressive Artifact Image Learning","authors":"Jianjia Zhang;Zirong Li;Jiayi Pan;Shaoyu Wang;Weiwen Wu","doi":"10.1109/TIP.2025.3534559","DOIUrl":"10.1109/TIP.2025.3534559","url":null,"abstract":"The reconstruction of limited data computed tomography (CT) aims to obtain high-quality images from a reduced set of projection views acquired from sparse views or limited angles. This approach is utilized to reduce radiation exposure or expedite the scanning process. Deep Learning (DL) techniques have been incorporated into limited data CT reconstruction tasks and achieve remarkable performance. However, these DL methods suffer from various limitations. Firstly, the distribution inconsistency between the simulation data and the real data hinders the generalization of these DL-based methods. Secondly, these DL-based methods could be unstable due to lack of kernel awareness. This paper addresses these issues by proposing an unrolling framework called Progressive Artifact Image Learning (PAIL) for limited data CT reconstruction. The proposed PAIL primarily consists of three key modules, i.e., a residual domain module (RDM), an image domain module (IDM), and a wavelet domain module (WDM). The RDM is designed to refine features from residual images and suppress the observable artifacts from the reconstructed images. This module could effectively alleviate the effects of distribution inconsistency among different data sets by transferring the optimization space from the original data domain to the residual data domain. The IDM is designed to suppress the unobservable artifacts in the image space. The RDM and IDM collaborate with each other during the iterative optimization process, progressively removing artifacts and reconstructing the underlying CT image. Furthermore, in order to void the potential hallucinations generated by the RDM and IDM, an additional WDM is incorporated into the network to enhance its stability. This is achieved by making the network become kernel-aware via integrating wavelet-based compressed sensing. The effectiveness of the proposed PAIL method has been consistently verified on two simulated CT data sets, a clinical cardiac data set and a sheep lung data set. Compared to other state-of-the-art methods, the proposed PAIL method achieves superior performance in various limited data CT reconstruction tasks, demonstrating its promising generalization and stability.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1163-1178"},"PeriodicalIF":0.0,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143072453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-30DOI: 10.1109/TIP.2025.3533204
Tianyi Li;Mai Xu;Zheng Liu;Ying Chen;Kai Li
The latest versatile video coding (VVC) standard proposed by the Joint Video Exploration Team (JVET) has significantly improved coding efficiency compared to that of its predecessor, while introducing an extremely higher computational complexity by $6sim 26$ times. The quad-tree plus multi-type tree (QTMT)-based coding unit (CU) partition accounts for most of the encoding time in VVC encoding. This paper proposes a data-driven fast CU partition approach based on an efficient Transformer model to accelerate VVC inter-coding. First, we establish a large-scale database for inter-mode VVC, comprising diverse CU partition patterns from more than 800 raw video sequences across various resolutions and contents. Next, we propose a deep neural network model with a Transformer-based temporal topology for predicting the CU partition, named as TCP-Net, which is adaptive to the group of pictures (GOP) hierarchy in VVC. Then, we design a two-stage structured output for TCP-Net, reflecting both the locations of CU edges and the split modes of all possible CUs. Accordingly, we develop a dual-supervised optimization mechanism to train the TCP-Net model with improved accuracy. The experimental results have verified that our approach can reduce the encoding time by $46.89sim 55.91$ % with negligible rate-distortion (RD) degradation, outperforming other state-of-the-art approaches.
{"title":"A Deep Transformer-Based Fast CU Partition Approach for Inter-Mode VVC","authors":"Tianyi Li;Mai Xu;Zheng Liu;Ying Chen;Kai Li","doi":"10.1109/TIP.2025.3533204","DOIUrl":"10.1109/TIP.2025.3533204","url":null,"abstract":"The latest versatile video coding (VVC) standard proposed by the Joint Video Exploration Team (JVET) has significantly improved coding efficiency compared to that of its predecessor, while introducing an extremely higher computational complexity by <inline-formula> <tex-math>$6sim 26$ </tex-math></inline-formula> times. The quad-tree plus multi-type tree (QTMT)-based coding unit (CU) partition accounts for most of the encoding time in VVC encoding. This paper proposes a data-driven fast CU partition approach based on an efficient Transformer model to accelerate VVC inter-coding. First, we establish a large-scale database for inter-mode VVC, comprising diverse CU partition patterns from more than 800 raw video sequences across various resolutions and contents. Next, we propose a deep neural network model with a Transformer-based temporal topology for predicting the CU partition, named as TCP-Net, which is adaptive to the group of pictures (GOP) hierarchy in VVC. Then, we design a two-stage structured output for TCP-Net, reflecting both the locations of CU edges and the split modes of all possible CUs. Accordingly, we develop a dual-supervised optimization mechanism to train the TCP-Net model with improved accuracy. The experimental results have verified that our approach can reduce the encoding time by <inline-formula> <tex-math>$46.89sim 55.91$ </tex-math></inline-formula>% with negligible rate-distortion (RD) degradation, outperforming other state-of-the-art approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1133-1148"},"PeriodicalIF":0.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-30DOI: 10.1109/TIP.2025.3533213
Jiqing Zhang;Malu Zhang;Yuanchen Wang;Qianhui Liu;Baocai Yin;Haizhou Li;Xin Yang
The brain-inspired Spiking Neural Networks (SNNs) work in an event-driven manner and have an implicit recurrence in neuronal membrane potential to memorize information over time, which are inherently suitable to handle temporal event-based streams. Despite their temporal nature and recent approaches advancements, these methods have predominantly been assessed on event-based classification tasks. In this paper, we explore the utility of SNNs for event-based tracking tasks. Specifically, we propose a brain-inspired adaptive Leaky Integrate-and-Fire neuron (BA-LIF) that can adaptively adjust the membrane time constant according to the inputs, thereby accelerating the leakage of meaningless noise features and reducing the decay of valuable information. SNNs composed of our proposed BA-LIF neurons can achieve high performance without a careful and time-consuming trial-by-error initialization on the membrane time constant. The adaptive capability of our network is further improved by introducing an extra temporal feature aggregator (TFA) that assigns attention weights over the temporal dimension. Extensive experiments on various event-based tracking datasets validate the effectiveness of our proposed method. We further validate the generalization capability of our method by applying it to other event-classification tasks.
{"title":"Spiking Neural Networks With Adaptive Membrane Time Constant for Event-Based Tracking","authors":"Jiqing Zhang;Malu Zhang;Yuanchen Wang;Qianhui Liu;Baocai Yin;Haizhou Li;Xin Yang","doi":"10.1109/TIP.2025.3533213","DOIUrl":"10.1109/TIP.2025.3533213","url":null,"abstract":"The brain-inspired Spiking Neural Networks (SNNs) work in an event-driven manner and have an implicit recurrence in neuronal membrane potential to memorize information over time, which are inherently suitable to handle temporal event-based streams. Despite their temporal nature and recent approaches advancements, these methods have predominantly been assessed on event-based classification tasks. In this paper, we explore the utility of SNNs for event-based tracking tasks. Specifically, we propose a brain-inspired adaptive Leaky Integrate-and-Fire neuron (BA-LIF) that can adaptively adjust the membrane time constant according to the inputs, thereby accelerating the leakage of meaningless noise features and reducing the decay of valuable information. SNNs composed of our proposed BA-LIF neurons can achieve high performance without a careful and time-consuming trial-by-error initialization on the membrane time constant. The adaptive capability of our network is further improved by introducing an extra temporal feature aggregator (TFA) that assigns attention weights over the temporal dimension. Extensive experiments on various event-based tracking datasets validate the effectiveness of our proposed method. We further validate the generalization capability of our method by applying it to other event-classification tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1009-1021"},"PeriodicalIF":0.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143072295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}