Intermediate flow estimation is an important part of video frame interpolation (VFI). Most previous works use interpolation to derive the intermediate flow assuming localized linear motion. However, this method is not effective when dealing with extreme motions. In this work, we assume that the motion trajectory of an object is determined by the appearance characteristics of this object. Based on this assumption, we propose a new intermediate flow estimation method, which obtains the motion features of intermediate frames from image appearance and inter-frame motion features. In addition, in order to fully extract the inter-frame features, we rethink the difference of VFI and previous works on using Swin-Transformer and compute the appearance features and motion features within the adaptive neighborhood by cyclically shifting the window. Experimental results show that our method achieves state-of-the-art performance on different datasets for both fixed-time and arbitrary-time interpolation. Moreover, our proposed method outperforms models that require inputting a sequence of four frames when handling videos with extremely large motion. The source code is available from https://github.com/chen12304/IFE-VFI
{"title":"Video Frame Interpolation via Appearance-Based Intermediate Flow Estimation","authors":"Keyi Chen;Jingwei Xin;Nannan Wang;Jie Li;Xinbo Gao","doi":"10.1109/TIP.2026.3666772","DOIUrl":"10.1109/TIP.2026.3666772","url":null,"abstract":"Intermediate flow estimation is an important part of video frame interpolation (VFI). Most previous works use interpolation to derive the intermediate flow assuming localized linear motion. However, this method is not effective when dealing with extreme motions. In this work, we assume that the motion trajectory of an object is determined by the appearance characteristics of this object. Based on this assumption, we propose a new intermediate flow estimation method, which obtains the motion features of intermediate frames from image appearance and inter-frame motion features. In addition, in order to fully extract the inter-frame features, we rethink the difference of VFI and previous works on using Swin-Transformer and compute the appearance features and motion features within the adaptive neighborhood by cyclically shifting the window. Experimental results show that our method achieves state-of-the-art performance on different datasets for both fixed-time and arbitrary-time interpolation. Moreover, our proposed method outperforms models that require inputting a sequence of four frames when handling videos with extremely large motion. The source code is available from <uri>https://github.com/chen12304/IFE-VFI</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2335-2349"},"PeriodicalIF":13.7,"publicationDate":"2026-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147313731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To efficiently assist humans in various tasks, it is crucial to accurately decode and understand the rich information embedded in brain's visual cognition. Existing brain-driven research often fails to overcome the challenge of small target data domains, and the lack of explicit semantic, spatial, and other information constraints on feature extractors prevents brain decoding models from learning uniform cross-domain representations, leading to degradation of their performance in unseen domains. To overcome these limitations, we propose DAMind, a multimodal EEG-based model for robust visual cross-domain alignment and decoding. Our approach integrates VLM with brain-inspired cognitive mechanisms, leveraging the strong image-text representation abilities to learn both fine-grained primary visual features and high-level semantic concepts from neural signals, provide effective visual fine-tuning using the visual guidance mechanism. DAMind introduces a stepwise EEG encoding process aligned with visual processing, and employs an instruction-based learning strategy for effective cross-domain zero-shot transfer. Its robust architecture efficiently achieves good generalization performance, enabling the mapping of EEG signals from multiple domains to a unified learning domain. We construct a comprehensive EEG decoding benchmark EBench, DAMind achieves state-of-the-art results on several visual tasks, and outperforms the baseline in zero-shot setting.
{"title":"DAMind: Zero-shot Visual Cross-Domain Alignment and Representation for EEG Decoding.","authors":"Haodong Jing, Yongqiang Ma, Panqi Yang, Haoyu Li, Shuai Huang, Badong Chen, Nanning Zheng","doi":"10.1109/TIP.2026.3666730","DOIUrl":"https://doi.org/10.1109/TIP.2026.3666730","url":null,"abstract":"<p><p>To efficiently assist humans in various tasks, it is crucial to accurately decode and understand the rich information embedded in brain's visual cognition. Existing brain-driven research often fails to overcome the challenge of small target data domains, and the lack of explicit semantic, spatial, and other information constraints on feature extractors prevents brain decoding models from learning uniform cross-domain representations, leading to degradation of their performance in unseen domains. To overcome these limitations, we propose DAMind, a multimodal EEG-based model for robust visual cross-domain alignment and decoding. Our approach integrates VLM with brain-inspired cognitive mechanisms, leveraging the strong image-text representation abilities to learn both fine-grained primary visual features and high-level semantic concepts from neural signals, provide effective visual fine-tuning using the visual guidance mechanism. DAMind introduces a stepwise EEG encoding process aligned with visual processing, and employs an instruction-based learning strategy for effective cross-domain zero-shot transfer. Its robust architecture efficiently achieves good generalization performance, enabling the mapping of EEG signals from multiple domains to a unified learning domain. We construct a comprehensive EEG decoding benchmark EBench, DAMind achieves state-of-the-art results on several visual tasks, and outperforms the baseline in zero-shot setting.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147313714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.
{"title":"SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking","authors":"Yinchao Ma;Dengqing Yang;Zhangyu He;Wenfei Yang;Tianzhu Zhang","doi":"10.1109/TIP.2026.3661393","DOIUrl":"10.1109/TIP.2026.3661393","url":null,"abstract":"Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2249-2261"},"PeriodicalIF":13.7,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-20DOI: 10.1109/TIP.2026.3661392
Yao Xiao;Pengxu Wei;Guangrun Wang;Cong Liu;Liang Lin
A few recent works attempt to train an adversarially robust Unsupervised Domain Adaptation (UDA) model, transferring the robustness from a robust source model or other robust pre-trained models to an unlabeled target domain. However, it is usually impractical to assume the availability of robust source models or robust pre-training, and meanwhile, source data are not always accessible or efficient for adaptation training in many real-world scenarios. In this paper, we dive into a more practical and challenging problem of robust source-free domain adaptation: can we train a robust model on an unlabeled target domain given only a non-robust source model (without source data)? Empirically, we find that applying adversarial training (AT) to the self-supervised adaptation process leads to severe model degradation, as it tends to amplify the inevitable errors of UDA models. To tackle this issue, we propose a novel approach called Source-Free Alternating Optimization (SFAO), which employs a non-robust target model to provide better guidance for the AT of the desired robust target model. The two models are trained in an alternating manner to minimize the discrepancy between the clean source domain and the adversarial target domain. Moreover, we propose Softly-Constrained Adversarial Training (SCAT) to further mitigate the adverse effects of incorrect pseudo-labels in AT. Extensive experimental results demonstrate that the proposed method significantly improves the model performance on both clean and adversarial data. Source code is available at: https://github.com/Coxy7/robust-SFDA.
{"title":"Robust Source-Free Domain Adaptation From Non-Robust Source Models","authors":"Yao Xiao;Pengxu Wei;Guangrun Wang;Cong Liu;Liang Lin","doi":"10.1109/TIP.2026.3661392","DOIUrl":"10.1109/TIP.2026.3661392","url":null,"abstract":"A few recent works attempt to train an adversarially robust Unsupervised Domain Adaptation (UDA) model, transferring the robustness from a robust source model or other robust pre-trained models to an unlabeled target domain. However, it is usually impractical to assume the availability of robust source models or robust pre-training, and meanwhile, source data are not always accessible or efficient for adaptation training in many real-world scenarios. In this paper, we dive into a more practical and challenging problem of robust source-free domain adaptation: can we train a robust model on an unlabeled target domain given only a non-robust source model (without source data)? Empirically, we find that applying adversarial training (AT) to the self-supervised adaptation process leads to severe model degradation, as it tends to amplify the inevitable errors of UDA models. To tackle this issue, we propose a novel approach called Source-Free Alternating Optimization (SFAO), which employs a non-robust target model to provide better guidance for the AT of the desired robust target model. The two models are trained in an alternating manner to minimize the discrepancy between the clean source domain and the adversarial target domain. Moreover, we propose Softly-Constrained Adversarial Training (<monospace>SCAT</monospace>) to further mitigate the adverse effects of incorrect pseudo-labels in AT. Extensive experimental results demonstrate that the proposed method significantly improves the model performance on both clean and adversarial data. Source code is available at: <uri>https://github.com/Coxy7/robust-SFDA</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2350-2363"},"PeriodicalIF":13.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-20DOI: 10.1109/TIP.2026.3664762
Aditya Panda;Dipti Prasad Mukherjee
The partially supervised Compositional Zero-Shot Learning (pCZSL) recognizes new compositions of states and objects, where for every image in the training set either the state or the object annotation is available. In pCZSL, features of a state vary depending on the object in the composition (e.g. the features of state ripe are different for ripe banana and ripe apple). Understanding the variation in features across scales of objects is also a key challenge. In the proposed architecture, a swin transformer based Hierarchical Feature Extractor (HFE) captures the large range of semantic interactions between state and object features. The Discriminative Context Aggregation module utilizes features from the intermediate layers of the HFE to understand the features of object at their corresponding scales. To leverage the partially labeled data in pCZSL, we pass strongly and weakly augmented versions of the input image to the proposed architecture. The predicted class probabilities for strongly and weakly augmented images are encouraged to be similar, minimizing a distribution alignment loss. This loss incorporates class specific re-weighting approach to alleviate the effect of data imbalance for pCZSL. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed approach.
{"title":"Partially Supervised Compositional Zero-Shot Learning by Class-Balanced Distribution Alignment","authors":"Aditya Panda;Dipti Prasad Mukherjee","doi":"10.1109/TIP.2026.3664762","DOIUrl":"10.1109/TIP.2026.3664762","url":null,"abstract":"The partially supervised Compositional Zero-Shot Learning (pCZSL) recognizes <italic>new</i> compositions of states and objects, where for every image in the training set either the state or the object annotation is available. In pCZSL, features of a state vary depending on the object in the composition (e.g. the features of state <italic>ripe</i> are different for <italic>ripe banana</i> and <italic>ripe apple</i>). Understanding the variation in features across scales of objects is also a key challenge. In the proposed architecture, a <italic>swin</i> transformer based Hierarchical Feature Extractor (HFE) captures the large range of semantic interactions between state and object features. The Discriminative Context Aggregation module utilizes features from the intermediate layers of the HFE to understand the features of object at their corresponding scales. To leverage the partially labeled data in pCZSL, we pass <italic>strongly</i> and <italic>weakly</i> augmented versions of the input image to the proposed architecture. The predicted class probabilities for <italic>strongly</i> and <italic>weakly</i> augmented images are encouraged to be similar, minimizing a <italic>distribution alignment</i> loss. This loss incorporates class specific re-weighting approach to alleviate the effect of data imbalance for pCZSL. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed approach.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2484-2498"},"PeriodicalIF":13.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised domain adaptive object detection methods enhance model robustness in the target domain without requiring target-domain annotations. Despite notable progress, existing methods face two major challenges: 1) insufficient and inefficient learning of holistic feature consistency due to cumbersome pixel-level style matching and semantic discrepancy elimination between domains as well as the overlooking of their collaborative effect; and 2) unreliable learning of category feature compactness caused by poor-quality target-domain samples, inaccurate pseudo-labels and noisy cross-domain contrast paradigms. To address these challenges, we propose a novel Semantic Consistency and Compactness Learning (SCCL) network. For consistency learning, we introduce a Visual Adaptation-guided Semantic Alignment (VSA) module that achieves style matching through simple feature adaptation and incorporates a novel adversarial-free self-supervised method for feature disentanglement. The collaboration between these two aspects enables sufficient and efficient consistency learning. For reliable compactness learning, we develop a plug-and-play Instance Center-Contrastive (ICC) head that, for the first time, comprehensively addresses all three potential causes of unreliable learning through three integrated innovations, concerning sample pseudo-label quality enhancement, reliable sample storage and updating, and a robust sample contrast paradigm. Besides, the mutual reinforcement effect of VSA and ICC simultaneously enhances feature transferability and discriminability. Extensive experiments across four UDA object detection benchmarks with two baselines show that SCCL achieves superior adaptability and robustness. Code will be available at https://github.com/TooZE23/SCCL.
{"title":"Unsupervised Domain Adaptive Object Detection via Semantic Consistency and Compactness Learning","authors":"Yajing Liu;Zhen Zhang;Yiming Su;Chunhui Hao;Xiyao Liu;Jiandong Tian","doi":"10.1109/TIP.2026.3663935","DOIUrl":"10.1109/TIP.2026.3663935","url":null,"abstract":"Unsupervised domain adaptive object detection methods enhance model robustness in the target domain without requiring target-domain annotations. Despite notable progress, existing methods face two major challenges: 1) insufficient and inefficient learning of holistic feature consistency due to cumbersome pixel-level style matching and semantic discrepancy elimination between domains as well as the overlooking of their collaborative effect; and 2) unreliable learning of category feature compactness caused by poor-quality target-domain samples, inaccurate pseudo-labels and noisy cross-domain contrast paradigms. To address these challenges, we propose a novel Semantic Consistency and Compactness Learning (SCCL) network. For consistency learning, we introduce a Visual Adaptation-guided Semantic Alignment (VSA) module that achieves style matching through simple feature adaptation and incorporates a novel adversarial-free self-supervised method for feature disentanglement. The collaboration between these two aspects enables sufficient and efficient consistency learning. For reliable compactness learning, we develop a plug-and-play Instance Center-Contrastive (ICC) head that, for the first time, comprehensively addresses all three potential causes of unreliable learning through three integrated innovations, concerning sample pseudo-label quality enhancement, reliable sample storage and updating, and a robust sample contrast paradigm. Besides, the mutual reinforcement effect of VSA and ICC simultaneously enhances feature transferability and discriminability. Extensive experiments across four UDA object detection benchmarks with two baselines show that SCCL achieves superior adaptability and robustness. Code will be available at <uri>https://github.com/TooZE23/SCCL</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2276-2291"},"PeriodicalIF":13.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modal few-shot semantic segmentation (FSS) aims to perform dense prediction from multiple modality images including visible image, depth image, and thermal image with a few annotated samples. However, some efforts treat the three modality information equally, where they don’t incorporate the inherent differences among multiple modalities. Besides, the objects vary in size greatly, and the cutting-edge matching paradigms fail to establish an effective support-query connection. Therefore, we propose a novel scale-invariant feature matching network (i.e., SFM-Net), which consists of an encoder, a feature matching block, a feature elevation block, and a decoder, to conduct visible-depth-thermal (V-D-T) few-shot semantic segmentation. Firstly, in the encoder part, after the extraction of multi-level initial features, we fuse each level’s RGB feature and thermal feature, yielding the support features and the query features. Secondly, in the feature matching block, a pixel-to-patch cross-attention (PTPCA) module is deployed to explore the correlation between each level’s support feature and the query feature, where the pixel-to-patch pooling (PTP-pool) units are designed to build scale-invariant relationships, generating the coarse mask for the query image. Thirdly, in the feature elevation block, we employ the prior-related fusion (PF) module to integrate the depth image with a coarse mask via the cross-attention mechanism, yielding the enhanced coarse prediction result, which is further aggregated in a bottom-up way. Finally, in the decoder, we deploy a reverse attention (RA) unit to gradually explore the complementarity between object internal regions and spatial details, and further generate the final segmentation results via conventional convolution layers. Extensive experiments are conducted on the VDT-2048-$5^{i}$ dataset, and the results show that our model outperforms the state-of-the-art methods with a large margin.
{"title":"Scale-Invariant Feature Matching Network for V-D-T Few-Shot Semantic Segmentation","authors":"Xiaofei Zhou;Jia Lin;Dongmei Chen;Deyang Liu;Jiyong Zhang;Runmin Cong","doi":"10.1109/TIP.2026.3663882","DOIUrl":"10.1109/TIP.2026.3663882","url":null,"abstract":"Multi-modal few-shot semantic segmentation (FSS) aims to perform dense prediction from multiple modality images including visible image, depth image, and thermal image with a few annotated samples. However, some efforts treat the three modality information equally, where they don’t incorporate the inherent differences among multiple modalities. Besides, the objects vary in size greatly, and the cutting-edge matching paradigms fail to establish an effective support-query connection. Therefore, we propose a novel scale-invariant feature matching network (i.e., SFM-Net), which consists of an encoder, a feature matching block, a feature elevation block, and a decoder, to conduct visible-depth-thermal (V-D-T) few-shot semantic segmentation. Firstly, in the encoder part, after the extraction of multi-level initial features, we fuse each level’s RGB feature and thermal feature, yielding the support features and the query features. Secondly, in the feature matching block, a pixel-to-patch cross-attention (PTPCA) module is deployed to explore the correlation between each level’s support feature and the query feature, where the pixel-to-patch pooling (PTP-pool) units are designed to build scale-invariant relationships, generating the coarse mask for the query image. Thirdly, in the feature elevation block, we employ the prior-related fusion (PF) module to integrate the depth image with a coarse mask via the cross-attention mechanism, yielding the enhanced coarse prediction result, which is further aggregated in a bottom-up way. Finally, in the decoder, we deploy a reverse attention (RA) unit to gradually explore the complementarity between object internal regions and spatial details, and further generate the final segmentation results via conventional convolution layers. Extensive experiments are conducted on the VDT-2048-<inline-formula> <tex-math>$5^{i}$ </tex-math></inline-formula> dataset, and the results show that our model outperforms the state-of-the-art methods with a large margin.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2198-2209"},"PeriodicalIF":13.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-19DOI: 10.1109/TIP.2026.3663857
Zheng Xing;Weibing Zhao
Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and incorporating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.
{"title":"Temporal Visual Semantics-Induced Human Motion Understanding With Large Language Models","authors":"Zheng Xing;Weibing Zhao","doi":"10.1109/TIP.2026.3663857","DOIUrl":"10.1109/TIP.2026.3663857","url":null,"abstract":"Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and incorporating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2182-2197"},"PeriodicalIF":13.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-18DOI: 10.1109/TIP.2026.3663930
Avinab Saha;Yu-Chih Chen;Christian Häne;Jean-Charles Bazin;Ioannis Katsavounidis;Alexandre Chapiro;Alan C. Bovik
We present HoloQA, a new state-of-the-art Full Reference Video Quality Assessment (VQA) model that was designed using principles of visual neuroscience, information theory, and self-supervised deep learning to accurately predict the quality of rendered digital human avatars in Virtual Reality (VR) and Augmented Reality (AR) systems. The growing adoption of VR/AR applications that aim to transmit digital human avatars over bandwidth-limited video networks has driven the need for VQA algorithms that better account for the kinds of distortions that reduce the quality of rendered and viewed avatars. As we will show, standard VQA models often fail to capture distortions unique to the rendering, transmission, and compression of videos containing human avatars. Towards solving this difficult problem, we adopt a multi-level Mixture-of-Experts approach. This involves computing distortion-aware perceptual features and high-level content-aware deep features that capture semantic attributes of human body avatars. The high-level features are computed using a self-supervised, pre-trained deep learning network. We show that HoloQA is able to achieve state-of-the-art performance on the recently introduced LIVE-Meta Rendered Human Avatar VQA database, demonstrating its efficacy in predicting the quality of rendered human avatars in VR. Furthermore, we demonstrate the competitive performance of HoloQA on other digital human avatar databases and on another synthetically generated video quality use case: cloud gaming. The code associated with this work will be made available on https://github.com/avinabsaha/HologramQAGitHub
{"title":"HoloQA: Full Reference Video Quality Assessor of Rendered Human Avatars in Virtual Reality","authors":"Avinab Saha;Yu-Chih Chen;Christian Häne;Jean-Charles Bazin;Ioannis Katsavounidis;Alexandre Chapiro;Alan C. Bovik","doi":"10.1109/TIP.2026.3663930","DOIUrl":"10.1109/TIP.2026.3663930","url":null,"abstract":"We present HoloQA, a new state-of-the-art Full Reference Video Quality Assessment (VQA) model that was designed using principles of visual neuroscience, information theory, and self-supervised deep learning to accurately predict the quality of rendered digital human avatars in Virtual Reality (VR) and Augmented Reality (AR) systems. The growing adoption of VR/AR applications that aim to transmit digital human avatars over bandwidth-limited video networks has driven the need for VQA algorithms that better account for the kinds of distortions that reduce the quality of rendered and viewed avatars. As we will show, standard VQA models often fail to capture distortions unique to the rendering, transmission, and compression of videos containing human avatars. Towards solving this difficult problem, we adopt a multi-level Mixture-of-Experts approach. This involves computing distortion-aware perceptual features and high-level content-aware deep features that capture semantic attributes of human body avatars. The high-level features are computed using a self-supervised, pre-trained deep learning network. We show that HoloQA is able to achieve state-of-the-art performance on the recently introduced LIVE-Meta Rendered Human Avatar VQA database, demonstrating its efficacy in predicting the quality of rendered human avatars in VR. Furthermore, we demonstrate the competitive performance of HoloQA on other digital human avatar databases and on another synthetically generated video quality use case: cloud gaming. The code associated with this work will be made available on <uri>https://github.com/avinabsaha/HologramQAGitHub</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2210-2223"},"PeriodicalIF":13.7,"publicationDate":"2026-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146222656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-18DOI: 10.1109/TIP.2026.3663898
Chao Huang;Jingxuan Zhang;Ye Zhang;Hao Wu;Peibei Cao;Zhihua Wang;Yang Yu;Xiaochun Cao
Fingerprint biometrics plays a crucial role in biometric identification, especially in applications such as criminal investigations. Although recent progress in recognition methodology has significantly enhanced automated fingerprint recognition, these systems still rely heavily on the quality of the input fingerprints. In criminal investigations, fingerprints are often of low quality due to their incidental deposition from natural oils and sweat, rather than being deliberately captured under controlled conditions. This degradation can significantly impact usability and identification accuracy, underscoring the need for effective Fingerprint Quality Assessment (FQA) methods. In this paper, we establish the Crime Scene Fingerprints quality assessment Dataset (CSFD-10k), the largest dataset of its kind, containing 11,500 fingerprint images from real criminal investigations. Of these, 10,000 samples are assigned Mean Opinion Scores (MOSs) for correlation testing, while the remaining 1,500 are labeled based on matching performance for generalizability testing. All labels are provided by frontline criminal police officers. Using this dataset, we propose a deep neural network-based Dual-Branch FQA (DB-FQA) framework that integrates image-level and edge-level features. The DB-FQA enhances ridge details by transforming raw grayscale fingerprints into edge maps using the Logical/Linear operator. A dual-branch network processes both the raw fingerprint and the edge map, and the Multi-scale Adaptive Cross feature Fusion (MACF) module fuses these features, guided by the edge map to highlight quality-related regions of interest. Extensive experiments demonstrate the robustness and superiority of our proposed method, offering substantial support for forensic fingerprint biometrics. The code and dataset are available at https://github.com/wzhsysu/FIQA.
{"title":"Latent Fingerprint Quality Assessment for Criminal Investigations: A Benchmark Dataset and Method","authors":"Chao Huang;Jingxuan Zhang;Ye Zhang;Hao Wu;Peibei Cao;Zhihua Wang;Yang Yu;Xiaochun Cao","doi":"10.1109/TIP.2026.3663898","DOIUrl":"10.1109/TIP.2026.3663898","url":null,"abstract":"Fingerprint biometrics plays a crucial role in biometric identification, especially in applications such as criminal investigations. Although recent progress in recognition methodology has significantly enhanced automated fingerprint recognition, these systems still rely heavily on the quality of the input fingerprints. In criminal investigations, fingerprints are often of low quality due to their incidental deposition from natural oils and sweat, rather than being deliberately captured under controlled conditions. This degradation can significantly impact usability and identification accuracy, underscoring the need for effective Fingerprint Quality Assessment (FQA) methods. In this paper, we establish the Crime Scene Fingerprints quality assessment Dataset (CSFD-10k), the largest dataset of its kind, containing 11,500 fingerprint images from real criminal investigations. Of these, 10,000 samples are assigned Mean Opinion Scores (MOSs) for correlation testing, while the remaining 1,500 are labeled based on matching performance for generalizability testing. All labels are provided by frontline criminal police officers. Using this dataset, we propose a deep neural network-based Dual-Branch FQA (DB-FQA) framework that integrates image-level and edge-level features. The DB-FQA enhances ridge details by transforming raw grayscale fingerprints into edge maps using the Logical/Linear operator. A dual-branch network processes both the raw fingerprint and the edge map, and the Multi-scale Adaptive Cross feature Fusion (MACF) module fuses these features, guided by the edge map to highlight quality-related regions of interest. Extensive experiments demonstrate the robustness and superiority of our proposed method, offering substantial support for forensic fingerprint biometrics. The code and dataset are available at <uri>https://github.com/wzhsysu/FIQA</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2262-2275"},"PeriodicalIF":13.7,"publicationDate":"2026-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146222643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}