Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3655117
Pengfei Chen,Jiebin Yan,Rajiv Soundararajan,Giuseppe Valenzise,Cai Li,Leida Li
Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.
{"title":"RAM-VQA: Restoration Assisted Multi-modality Video Quality Assessment.","authors":"Pengfei Chen,Jiebin Yan,Rajiv Soundararajan,Giuseppe Valenzise,Cai Li,Leida Li","doi":"10.1109/tip.2026.3655117","DOIUrl":"https://doi.org/10.1109/tip.2026.3655117","url":null,"abstract":"Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"284 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3653212
Wang Liu,Zhuangzi Li,Ge Li,Siwei Ma,Sam Kwong,Wei Gao
The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.
{"title":"Post-Processing Geometry Enhancement for G-PCC Compressed LiDAR via Cylindrical Densification.","authors":"Wang Liu,Zhuangzi Li,Ge Li,Siwei Ma,Sam Kwong,Wei Gao","doi":"10.1109/tip.2026.3653212","DOIUrl":"https://doi.org/10.1109/tip.2026.3653212","url":null,"abstract":"The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"216 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3654770
Can Peng,Manxin Chao,Ruoyu Li,Zaiqing Chen,Lijun Yun,Yuelong Xia
Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on highresolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HRSemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.
{"title":"HR-SemNet: A High-Resolution Network for Enhanced Small Object Detection With Local Contextual Semantics.","authors":"Can Peng,Manxin Chao,Ruoyu Li,Zaiqing Chen,Lijun Yun,Yuelong Xia","doi":"10.1109/tip.2026.3654770","DOIUrl":"https://doi.org/10.1109/tip.2026.3654770","url":null,"abstract":"Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on highresolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HRSemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1109/tip.2026.3654892
Xiang Yuan,Gong Cheng,Jiacheng Cheng,Ruixiang Yao,Junwei Han
Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.
{"title":"Unc-SOD: An Uncertainty Learning Framework for Small Object Detection.","authors":"Xiang Yuan,Gong Cheng,Jiacheng Cheng,Ruixiang Yao,Junwei Han","doi":"10.1109/tip.2026.3654892","DOIUrl":"https://doi.org/10.1109/tip.2026.3654892","url":null,"abstract":"Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.
{"title":"Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation.","authors":"Jisheng Dang,Huicheng Zheng,Yulan Guo,Jianhuang Lai,Bin Hu,Tat-Seng Chua","doi":"10.1109/tip.2025.3649360","DOIUrl":"https://doi.org/10.1109/tip.2025.3649360","url":null,"abstract":"object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"66 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compared to traditional computed tomography (CT), photon-counting detector (PCD)-based CT provides significant advantages, including enhanced CT image contrast and reduced radiation dose. However, owing to the current immaturity of PCD technology, scanned PCD data often contain stripe artifacts resulting from non-functional or defective detector units, which subsequently introduce ring artifacts in reconstructed CT images. The presence of ring artifact may compromise the accuracy of CT values and even introduce pseudo-structures, thereby reducing the application value of CT images. In this paper, we propose a dual-domain optimization model that takes advantage of the distribution characteristics of stripe artifact in 3D projection data and the prior features of reconstructed 3D CT images. Specifically, we demonstrate that stripe artifact in 3D projection data exhibit both group sparsity and low-rank properties. Building on this observation, we propose a TLT (TV-l2,1- Tucker) model to eliminate ring artifact in PCD-based cone beam CT (CBCT). In addition, an efficient iterative algorithm is designed to solve the proposed model. The effectiveness of both the model and the algorithm is evaluated through simulated and real data experiments. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches.
{"title":"Dual Domain Optimization Algorithm for CBCT Ring Artifact Correction.","authors":"Yanwei Qin,Xiaohui Su,Xin Lu,Baodi Yu,Yunsong Zhao,Fanyong Meng","doi":"10.1109/tip.2026.3652008","DOIUrl":"https://doi.org/10.1109/tip.2026.3652008","url":null,"abstract":"Compared to traditional computed tomography (CT), photon-counting detector (PCD)-based CT provides significant advantages, including enhanced CT image contrast and reduced radiation dose. However, owing to the current immaturity of PCD technology, scanned PCD data often contain stripe artifacts resulting from non-functional or defective detector units, which subsequently introduce ring artifacts in reconstructed CT images. The presence of ring artifact may compromise the accuracy of CT values and even introduce pseudo-structures, thereby reducing the application value of CT images. In this paper, we propose a dual-domain optimization model that takes advantage of the distribution characteristics of stripe artifact in 3D projection data and the prior features of reconstructed 3D CT images. Specifically, we demonstrate that stripe artifact in 3D projection data exhibit both group sparsity and low-rank properties. Building on this observation, we propose a TLT (TV-l2,1- Tucker) model to eliminate ring artifact in PCD-based cone beam CT (CBCT). In addition, an efficient iterative algorithm is designed to solve the proposed model. The effectiveness of both the model and the algorithm is evaluated through simulated and real data experiments. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"81 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1109/tip.2026.3651959
Paolo Giannitrapani, Elio D. Di Claudio, Giovanni Jacovitti
{"title":"BELE: Blur Equivalent Linearized Estimator","authors":"Paolo Giannitrapani, Elio D. Di Claudio, Giovanni Jacovitti","doi":"10.1109/tip.2026.3651959","DOIUrl":"https://doi.org/10.1109/tip.2026.3651959","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"37 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}