Pub Date : 2026-01-12DOI: 10.1109/TMM.2025.3632629
Wensheng Li;Jing Zhang;Li Zhuo;Qi Tian
Livestreaming platforms attract countless daily active users, making online content regulation imperative. The complex and diverse multimodal content elements in dynamic livestreaming scene pose a great challenge to video content understanding. Thanks to the success of contrastive language-image pre-training (CLIP) for dynamic scene classification, which is one of the basic tasks of video content understanding. We propose a heterogeneous multimodal state space network (HMS2Net) for dynamic scene classification in livestreaming via CLIP. (1) To fully and efficiently mine the dynamic scene elements in livestreaming, we design a heterogeneous teacher-student Transformer (HT-SFormer) with CLIP to extract multimodal features in an energy-efficient unified pipeline; (2) To cope with the possible information conflicts in heterogeneous feature fusion, we introduce a cross-modal adaptive feature filter and fusion (CMAF) module to generate more complete information complementarity by adjusting multimodal feature composition; (3) For temporal context-awareness of dynamic scene, we establish a dynamic state space memory (DSSM) structure for capturing the correlation of multimodal data between neighboring video frames. A series of comparative experiments are conducted on the publicly available datasets DAVIS, Mini-kinetics, HMDB51, and the self-built BJUT-LCD. Our HMS2Net produce competitive results of 71.09%, 95.40%, 53.64%, and 82.36%, respectively, demonstrating the effectiveness and superiority of dynamic scene classification in livestreaming.
{"title":"HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming","authors":"Wensheng Li;Jing Zhang;Li Zhuo;Qi Tian","doi":"10.1109/TMM.2025.3632629","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632629","url":null,"abstract":"Livestreaming platforms attract countless daily active users, making online content regulation imperative. The complex and diverse multimodal content elements in dynamic livestreaming scene pose a great challenge to video content understanding. Thanks to the success of contrastive language-image pre-training (CLIP) for dynamic scene classification, which is one of the basic tasks of video content understanding. We propose a heterogeneous multimodal state space network (HMS<sup>2</sup>Net) for dynamic scene classification in livestreaming via CLIP. (1) To fully and efficiently mine the dynamic scene elements in livestreaming, we design a heterogeneous teacher-student Transformer (HT-SFormer) with CLIP to extract multimodal features in an energy-efficient unified pipeline; (2) To cope with the possible information conflicts in heterogeneous feature fusion, we introduce a cross-modal adaptive feature filter and fusion (CMAF) module to generate more complete information complementarity by adjusting multimodal feature composition; (3) For temporal context-awareness of dynamic scene, we establish a dynamic state space memory (DSSM) structure for capturing the correlation of multimodal data between neighboring video frames. A series of comparative experiments are conducted on the publicly available datasets DAVIS, Mini-kinetics, HMDB51, and the self-built BJUT-LCD. Our HMS<sup>2</sup>Net produce competitive results of 71.09%, 95.40%, 53.64%, and 82.36%, respectively, demonstrating the effectiveness and superiority of dynamic scene classification in livestreaming.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"772-785"},"PeriodicalIF":9.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TMM.2025.3632640
Jiangpeng He;Xiaoyan Zhang;Luotao Lin;Jack Ma;Heather A. Eicher-Miller;Fengqing Zhu
Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.
{"title":"Long-Tailed Continual Learning for Visual Food Recognition","authors":"Jiangpeng He;Xiaoyan Zhang;Luotao Lin;Jack Ma;Heather A. Eicher-Miller;Fengqing Zhu","doi":"10.1109/TMM.2025.3632640","DOIUrl":"10.1109/TMM.2025.3632640","url":null,"abstract":"Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"865-877"},"PeriodicalIF":9.7,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145700829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coded aperture snapshot spectral imaging (CASSI) captures 3D hyperspectral images (HSIs) in a single shot by encoding incident light into 2D measurements. However, recovering the original hyperspectral data from these measurements is a severely ill-posed inverse problem due to significant information loss during compression. Recent deep learning methods, especially deep unfolding networks, have demonstrated promising reconstruction results by embedding learnable priors into iterative optimization frameworks. However, most existing approaches use a single network to jointly estimate spatial and spectral priors, limiting their ability to handle the distinct properties of HSIs. To overcome this limitation, we propose the Spatial-Spectral Prior Decoupling Model (SSPD), which reformulates HSI reconstruction as a prior absorption problem, enabling independent modeling of spatial and spectral priors with specialized network architectures. To achieve this, we design two attention mechanisms tailored for hyperspectral data: one for capturing spatial correlations and another for preserving spectral signatures. Additionally, we develop a hybrid loss function that combines convergence constraints and cross-prior interactions, ensuring accurate prior fusion and stable reconstruction. Experiments on synthetic and real-world datasets confirm that SSPD outperforms existing methods in spectral snapshot compressive imaging.
{"title":"SSPD: Spatial-Spectral Prior Decoupling Model for Spectral Snapshot Compressive Imaging","authors":"Lizhu Liu;Yaonan Wang;Yurong Chen;Jiwen Lu;Hui Zhang","doi":"10.1109/TMM.2025.3638016","DOIUrl":"https://doi.org/10.1109/TMM.2025.3638016","url":null,"abstract":"Coded aperture snapshot spectral imaging (CASSI) captures 3D hyperspectral images (HSIs) in a single shot by encoding incident light into 2D measurements. However, recovering the original hyperspectral data from these measurements is a severely ill-posed inverse problem due to significant information loss during compression. Recent deep learning methods, especially deep unfolding networks, have demonstrated promising reconstruction results by embedding learnable priors into iterative optimization frameworks. However, most existing approaches use a single network to jointly estimate spatial and spectral priors, limiting their ability to handle the distinct properties of HSIs. To overcome this limitation, we propose the Spatial-Spectral Prior Decoupling Model (SSPD), which reformulates HSI reconstruction as a prior absorption problem, enabling independent modeling of spatial and spectral priors with specialized network architectures. To achieve this, we design two attention mechanisms tailored for hyperspectral data: one for capturing spatial correlations and another for preserving spectral signatures. Additionally, we develop a hybrid loss function that combines convergence constraints and cross-prior interactions, ensuring accurate prior fusion and stable reconstruction. Experiments on synthetic and real-world datasets confirm that SSPD outperforms existing methods in spectral snapshot compressive imaging.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9847-9860"},"PeriodicalIF":9.7,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1109/TMM.2025.3632642
Zizhuang Zou;Mao Ye;Luping Ji;Lihua Zhou;Song Tang;Yan Gan;Shuai Li
Multi-Object Tracking (MOT) in Uncrewed Aerial Vehicles (UAV) aims to continuously and stably detect and track objects in videos captured by UAVs. In existing MOT tracking-by-detection schemes, the tracker with a fixed step size is always employed, and a fixed length of past tracking information is input to the tracker to guide position prediction. However, the limited prediction range of a single-scale tracker leads to frequent tracking losses, and limited historical information also reduces tracking accuracy. To address these limitations, we propose a novel Long-Short Match (LSMTrack) tracking method. The key idea is to use long and short trackers and maintain a long-term motion state to improve tracking performance, thus reducing the likelihood of entering the lost status. To this end, a new Mamba-based tracker and a long-short match strategy are proposed. For long and short trackers, the same architecture is used based on Mamba. Unlike the previous Mamba-based approach, the proposed tracker maintains a long-term state while updating the state and making position predictions in each time step, so we call it a step Mamba tracker. Meanwhile, we devise a long-short match strategy at the inference stage to integrate long and short trackers, and design a lost control operation which updates the long-term states using historical state values. In this way, the matching probability and the inference efficiency are guaranteed. Experimental results on two UAV MOT datasets confirm the state-of-the-art performance. Specifically, the best results are achieved in terms of two popular MOTA and IDF1 tracking evaluation metrics.
{"title":"Long-Short Match for Lost Control in UAV Multi-Object Tracking","authors":"Zizhuang Zou;Mao Ye;Luping Ji;Lihua Zhou;Song Tang;Yan Gan;Shuai Li","doi":"10.1109/TMM.2025.3632642","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632642","url":null,"abstract":"Multi-Object Tracking (MOT) in Uncrewed Aerial Vehicles (UAV) aims to continuously and stably detect and track objects in videos captured by UAVs. In existing MOT tracking-by-detection schemes, the tracker with a fixed step size is always employed, and a fixed length of past tracking information is input to the tracker to guide position prediction. However, the limited prediction range of a single-scale tracker leads to frequent tracking losses, and limited historical information also reduces tracking accuracy. To address these limitations, we propose a novel Long-Short Match (LSMTrack) tracking method. The key idea is to use long and short trackers and maintain a long-term motion state to improve tracking performance, thus reducing the likelihood of entering the lost status. To this end, a new Mamba-based tracker and a long-short match strategy are proposed. For long and short trackers, the same architecture is used based on Mamba. Unlike the previous Mamba-based approach, the proposed tracker maintains a long-term state while updating the state and making position predictions in each time step, so we call it a step Mamba tracker. Meanwhile, we devise a long-short match strategy at the inference stage to integrate long and short trackers, and design a lost control operation which updates the long-term states using historical state values. In this way, the matching probability and the inference efficiency are guaranteed. Experimental results on two UAV MOT datasets confirm the state-of-the-art performance. Specifically, the best results are achieved in terms of two popular MOTA and IDF1 tracking evaluation metrics.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"786-800"},"PeriodicalIF":9.7,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1109/TMM.2025.3632692
Wuxuan Shi;Mang Ye;Wei Yu;Bo Du
Catastrophic forgetting, the degradation of knowledge about previously seen classes when learning new concepts from a shifting data stream, is a pitfall faced by neural network learning in open environments. Recent research on continual image classification usually relies on storing samples or prototypes to resist this forgetting. We find that during acquiring knowledge of the new classes, the features of old classes gradually disperse, which leads to confusion of features between classes and makes them difficult to discriminate. Coping with feature dispersion would be a key consideration in resisting catastrophic forgetting, which has been neglected in previous works. To this end, we try to address this issue from two perspectives. First, we propose a dispersing feature generation mechanism, which generates pseudo-features based on the pre-pooling prototypes of the old classes to simulate feature dispersion and remind the classifier to adjust the decision boundary. Second, we design a consistent alignment constraint to alleviate the severity of feature dispersion by maintaining consistency in the hidden states of different depths when aligning the current model with the previous model. Extensive experimental results on various benchmarks show the superiority of our proposed method.
{"title":"Feature Dispersion Adaptation With Pre-Pooling Prototype for Continual Image Classification","authors":"Wuxuan Shi;Mang Ye;Wei Yu;Bo Du","doi":"10.1109/TMM.2025.3632692","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632692","url":null,"abstract":"Catastrophic forgetting, the degradation of knowledge about previously seen classes when learning new concepts from a shifting data stream, is a pitfall faced by neural network learning in open environments. Recent research on continual image classification usually relies on storing samples or prototypes to resist this forgetting. We find that during acquiring knowledge of the new classes, the features of old classes gradually disperse, which leads to confusion of features between classes and makes them difficult to discriminate. Coping with feature dispersion would be a key consideration in resisting catastrophic forgetting, which has been neglected in previous works. To this end, we try to address this issue from two perspectives. First, we propose a dispersing feature generation mechanism, which generates pseudo-features based on the pre-pooling prototypes of the old classes to simulate feature dispersion and remind the classifier to adjust the decision boundary. Second, we design a consistent alignment constraint to alleviate the severity of feature dispersion by maintaining consistency in the hidden states of different depths when aligning the current model with the previous model. Extensive experimental results on various benchmarks show the superiority of our proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"801-812"},"PeriodicalIF":9.7,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632670
Yihan Wang;Baoli Sun;Haojie Li;Xinzhu Ma;Zhihui Wang;Zhiyong Wang
The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.
{"title":"UniAlign: A Universal Cross-Modality Knowledge Alignment Framework for Fine-Grained Action Recognition","authors":"Yihan Wang;Baoli Sun;Haojie Li;Xinzhu Ma;Zhihui Wang;Zhiyong Wang","doi":"10.1109/TMM.2025.3632670","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632670","url":null,"abstract":"The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"891-901"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632687
Zheng Liu;Jianjun Zhang;Ming Zhang;Runze Ke;Chengcheng Yu;Ligang Liu
Point cloud reconstruction is an ingredient in geometry modeling, computer graphics, and 3D vision. In this paper, we propose a novel unsupervised learning method called the Recurrent Multi-Step Moving Strategy, which progressively moves query points toward the underlying surface to accurately learn unsigned distance fields (UDFs) for point cloud reconstruction. Specifically, we design a recurrent network for UDF estimation that integrates a multi-step strategy for query movement. This model treats query movement as a trajectory prediction process, establishing dependencies between the current query move decision and the previous path, thus utilizing temporal information to improve UDF estimation accuracy. Further, we design distance and gradient regularization losses to ensure the precision, consistency, and continuity of the estimated UDFs. Extensive evaluations, comparisons, and ablation studies are conducted to show the superiority of our method over the competing approaches in terms of reconstruction accuracy and generality. Our unsupervised reconstruction method outperforms many supervised techniques and demonstrates efficacy across diverse scenarios, including single-object, indoor, and outdoor benchmarks.
{"title":"Unsupervised Point Cloud Reconstruction via Recurrent Multi-Step Moving Strategy","authors":"Zheng Liu;Jianjun Zhang;Ming Zhang;Runze Ke;Chengcheng Yu;Ligang Liu","doi":"10.1109/TMM.2025.3632687","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632687","url":null,"abstract":"Point cloud reconstruction is an ingredient in geometry modeling, computer graphics, and 3D vision. In this paper, we propose a novel unsupervised learning method called the Recurrent Multi-Step Moving Strategy, which progressively moves query points toward the underlying surface to accurately learn unsigned distance fields (UDFs) for point cloud reconstruction. Specifically, we design a recurrent network for UDF estimation that integrates a multi-step strategy for query movement. This model treats query movement as a trajectory prediction process, establishing dependencies between the current query move decision and the previous path, thus utilizing temporal information to improve UDF estimation accuracy. Further, we design distance and gradient regularization losses to ensure the precision, consistency, and continuity of the estimated UDFs. Extensive evaluations, comparisons, and ablation studies are conducted to show the superiority of our method over the competing approaches in terms of reconstruction accuracy and generality. Our unsupervised reconstruction method outperforms many supervised techniques and demonstrates efficacy across diverse scenarios, including single-object, indoor, and outdoor benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"972-984"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632682
Yang Zhou;Jingru Yang;Jin Wang;Kaixiang Huang;Guodong Lu;Shengfeng He
Sketches, as a new solution in multimedia systems that can replace natural language, are characterized by sparse visual cues such as simple strokes that differ significantly from natural images containing complex elements such as background, foreground, and texture. This misalignment poses substantial challenges for zero-shot sketch-based image retrieval (ZS-SBIR). Prior approaches match sketches to full images and tend to overlook redundant elements in natural images, leading to model distraction and semantic ambiguity. To address this issue, we introduce a distraction-agnostic framework, purified cross-domain matching (PuXIM), which operates on a straightforward principle: masking and matching. We devise a visual-cross-linguistic (VxL) sampler that generates linguistic masks based on semantic labels to obscure semantically irrelevant image features. Our novel contribution is the concept of purified masked matching (PMM), which comprises two processes: (1) reconstruction, which compels the image encoder to reconstruct the masked image feature, and (2) interaction, which involves a transformer decoder that processes both sketch and masked image features to investigate cross-domain relationships for effective matching. Evaluated on the TU-Berlin, Sketchy, and QuickDraw datasets, PuXIM sets new benchmarks in terms of performance. Importantly, the distraction-agnostic nature of the matching process renders PuXIM more conducive to training, enabling efficient adaptation to zero-shot scenarios with reduced data requirements and low data quality.
{"title":"Purified Zero-Shot Sketch-Based Image Retrieval","authors":"Yang Zhou;Jingru Yang;Jin Wang;Kaixiang Huang;Guodong Lu;Shengfeng He","doi":"10.1109/TMM.2025.3632682","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632682","url":null,"abstract":"Sketches, as a new solution in multimedia systems that can replace natural language, are characterized by sparse visual cues such as simple strokes that differ significantly from natural images containing complex elements such as background, foreground, and texture. This misalignment poses substantial challenges for zero-shot sketch-based image retrieval (ZS-SBIR). Prior approaches match sketches to full images and tend to overlook redundant elements in natural images, leading to model distraction and semantic ambiguity. To address this issue, we introduce a distraction-agnostic framework, purified cross-domain matching (PuXIM), which operates on a straightforward principle: masking and matching. We devise a visual-cross-linguistic (VxL) sampler that generates linguistic masks based on semantic labels to obscure semantically irrelevant image features. Our novel contribution is the concept of purified masked matching (PMM), which comprises two processes: (1) <italic>reconstruction</i>, which compels the image encoder to reconstruct the masked image feature, and (2) <italic>interaction</i>, which involves a transformer decoder that processes both sketch and masked image features to investigate cross-domain relationships for effective matching. Evaluated on the TU-Berlin, Sketchy, and QuickDraw datasets, PuXIM sets new benchmarks in terms of performance. Importantly, the distraction-agnostic nature of the matching process renders PuXIM more conducive to training, enabling efficient adaptation to zero-shot scenarios with reduced data requirements and low data quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"929-943"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent diffusion models have demonstrated exceptional efficacy across various image restoration tasks, but still suffer from time-consuming and substantial computational resource consumption. To address these challenges, we present LPCDiff, a novel Laplacian Pyramid-based Conditional Diffusion model designed for real-scene image dehazing. LPCDiff leverages the Laplacian pyramid decomposition to decouple the input image into two components: the low-resolution low-pass image and the high-frequency residuals. These components are subsequently reconstructed through a diffusion model and a well-designed high-frequency residual recovery module. With such a strategy, LPCDiff can substantially accelerate inference speed and reduce computational costs without sacrificing image fidelity. In addition, the framework empowers the model to capture intrinsic high-frequency details and low-frequency structural information within the image, resulting in sharper and more realistic haze-free outputs. Moreover, to extract more valuable information from the limited training data, we introduce a low-frequency refinement module to further enhance the intricate details of the final dehazed images. Through extensive experimentation, our method significantly outperforms 12 state-of-the-art approaches on three real-world and one synthetic image dehazing benchmarks.
{"title":"Real-Scene Image Dehazing via Laplacian Pyramid-Based Conditional Diffusion Model","authors":"Yongzhen Wang;Jie Sun;Heng Liu;Xiao-Ping Zhang;Mingqiang Wei","doi":"10.1109/TMM.2025.3632694","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632694","url":null,"abstract":"Recent diffusion models have demonstrated exceptional efficacy across various image restoration tasks, but still suffer from time-consuming and substantial computational resource consumption. To address these challenges, we present LPCDiff, a novel Laplacian Pyramid-based Conditional Diffusion model designed for real-scene image dehazing. LPCDiff leverages the Laplacian pyramid decomposition to decouple the input image into two components: the low-resolution low-pass image and the high-frequency residuals. These components are subsequently reconstructed through a diffusion model and a well-designed high-frequency residual recovery module. With such a strategy, LPCDiff can substantially accelerate inference speed and reduce computational costs without sacrificing image fidelity. In addition, the framework empowers the model to capture intrinsic high-frequency details and low-frequency structural information within the image, resulting in sharper and more realistic haze-free outputs. Moreover, to extract more valuable information from the limited training data, we introduce a low-frequency refinement module to further enhance the intricate details of the final dehazed images. Through extensive experimentation, our method significantly outperforms 12 state-of-the-art approaches on three real-world and one synthetic image dehazing benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"944-957"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}