Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve $6.3%$ to 55.8% gains in QoE over the state-of-the-arts in the online scenario.
{"title":"Enhancing Neural Adaptive Wireless Video Streaming via Cross-Layer Information Exposure and Online Tuning","authors":"Lingzhi Zhao;Ying Cui;Yuhang Jia;Yunfei Zhang;Klara Nahrstedt","doi":"10.1109/TMM.2024.3521820","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521820","url":null,"abstract":"Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve <inline-formula><tex-math>$6.3%$</tex-math></inline-formula> to 55.8% gains in QoE over the state-of-the-arts in the online scenario.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1289-1304"},"PeriodicalIF":8.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-28DOI: 10.1109/TMM.2024.3521815
Gaosheng Liu;Huanjing Yue;Bihan Wen;Jingyu Yang
The dense light field sampling of focused plenoptic images (FPIs) yields substantial amounts of redundant data, necessitating efficient compression in practical applications. However, the presence of discontinuous structures and long-distance properties in FPIs poses a challenge. In this paper, we propose a novel end-to-end approach for learned focused plenoptic image compression (LFPIC). Specifically, we introduce a local-global correlation learning strategy to build the nonlinear transforms. This strategy can effectively handle the discontinuous structures and leverage long-distance correlations in FPI for high compression efficiency. Additionally, we propose a spatial-wise context model tailored for LFPIC to help emphasize the most related symbols during coding and further enhance the rate-distortion performance. Experimental results demonstrate the effectiveness of our proposed method, achieving a 22.16% BD-rate reduction (measured in PSNR) on the public dataset compared to the recent state-of-the-art LFPIC method. This improvement holds significant promise for benefiting the applications of focused plenoptic cameras.
{"title":"Learned Focused Plenoptic Image Compression With Local-Global Correlation Learning","authors":"Gaosheng Liu;Huanjing Yue;Bihan Wen;Jingyu Yang","doi":"10.1109/TMM.2024.3521815","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521815","url":null,"abstract":"The dense light field sampling of focused plenoptic images (FPIs) yields substantial amounts of redundant data, necessitating efficient compression in practical applications. However, the presence of discontinuous structures and long-distance properties in FPIs poses a challenge. In this paper, we propose a novel end-to-end approach for learned focused plenoptic image compression (LFPIC). Specifically, we introduce a local-global correlation learning strategy to build the nonlinear transforms. This strategy can effectively handle the discontinuous structures and leverage long-distance correlations in FPI for high compression efficiency. Additionally, we propose a spatial-wise context model tailored for LFPIC to help emphasize the most related symbols during coding and further enhance the rate-distortion performance. Experimental results demonstrate the effectiveness of our proposed method, achieving a 22.16% BD-rate reduction (measured in PSNR) on the public dataset compared to the recent state-of-the-art LFPIC method. This improvement holds significant promise for benefiting the applications of focused plenoptic cameras.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1216-1227"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.
{"title":"Frequency-Guided Spatial Adaptation for Camouflaged Object Detection","authors":"Shizhou Zhang;Dexuan Kong;Yinghui Xing;Yue Lu;Lingyan Ran;Guoqiang Liang;Hexu Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3521681","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521681","url":null,"abstract":"Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"72-83"},"PeriodicalIF":8.4,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13DOI: 10.1109/TMM.2024.3521731
Lin Jiang;Jigang Wu;Shuping Zhao;Jiaxing Li
In cross-domain recognition tasks, the divergent distributions of data acquired from various domains degrade the effectiveness of knowledge transfer. Additionally, in practice, cross-domain data also contain a massive amount of redundant information, usually disturbing the training processes of cross-domain classifiers. Seeking to address these issues and obtain efficient domain-invariant knowledge, this paper proposes a novel cross-domain classification method, named cross-scatter sparse dictionary pair learning (CSSDL). Firstly, a pair of dictionaries is learned in a common subspace, in which the marginal distribution divergence between the cross-domain data is mitigated, and domain-invariant information can be efficiently extracted. Then, a cross-scatter discriminant term is proposed to decrease the distance between cross-domain data belonging to the same class. As such, this term guarantees that the data derived from same class can be aligned and that the conditional distribution divergence is mitigated. In addition, a flexible label regression method is introduced to match the feature representation and label information in the label space. Thereafter, a discriminative and transferable feature representation can be obtained. Moreover, two sparse constraints are introduced to maintain the sparse characteristics of the feature representation. Extensive experimental results obtained on public datasets demonstrate the effectiveness of the proposed CSSDL approach.
{"title":"Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification","authors":"Lin Jiang;Jigang Wu;Shuping Zhao;Jiaxing Li","doi":"10.1109/TMM.2024.3521731","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521731","url":null,"abstract":"In cross-domain recognition tasks, the divergent distributions of data acquired from various domains degrade the effectiveness of knowledge transfer. Additionally, in practice, cross-domain data also contain a massive amount of redundant information, usually disturbing the training processes of cross-domain classifiers. Seeking to address these issues and obtain efficient domain-invariant knowledge, this paper proposes a novel cross-domain classification method, named cross-scatter sparse dictionary pair learning (CSSDL). Firstly, a pair of dictionaries is learned in a common subspace, in which the marginal distribution divergence between the cross-domain data is mitigated, and domain-invariant information can be efficiently extracted. Then, a cross-scatter discriminant term is proposed to decrease the distance between cross-domain data belonging to the same class. As such, this term guarantees that the data derived from same class can be aligned and that the conditional distribution divergence is mitigated. In addition, a flexible label regression method is introduced to match the feature representation and label information in the label space. Thereafter, a discriminative and transferable feature representation can be obtained. Moreover, two sparse constraints are introduced to maintain the sparse characteristics of the feature representation. Extensive experimental results obtained on public datasets demonstrate the effectiveness of the proposed CSSDL approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"371-384"},"PeriodicalIF":8.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1109/TMM.2024.3521818
Yi Jin;Xiaoxiao Ma;Rui Zhang;Huaian Chen;Yuxuan Gu;Pengyang Ling;Enhong Chen
Learning-based video denoisers have attained state-of-the-art (SOTA) performances on public evaluation benchmarks. Nevertheless, they typically encounter significant performance drops when applied to unseen real-world data, owing to inherent data discrepancies. To address this problem, this work delves into the model pretraining techniques and proposes masked central frame modeling (MCFM), a new video pretraining approach that significantly improves the generalization ability of the denoiser. This proposal stems from a key observation: pretraining denoiser by reconstructing intact videos from the corrupted sequences, where the central frames are masked at a suitable probability, contributes to achieving superior performance on real-world data. Building upon MCFM, we introduce a robust video denoiser, named MVDenoiser, which is firstly pretrained on massive available ordinary videos for general video modeling, and then finetuned on costful real-world noisy/clean video pairs for noisy-to-clean mapping. Additionally, beyond the denoising model, we further establish a new paired real-world noisy video dataset (RNVD) to facilitate cross-dataset evaluation of generalization ability. Extensive experiments conducted across different datasets demonstrate that the proposed method achieves superior performance compared to existing methods.
{"title":"Masked Video Pretraining Advances Real-World Video Denoising","authors":"Yi Jin;Xiaoxiao Ma;Rui Zhang;Huaian Chen;Yuxuan Gu;Pengyang Ling;Enhong Chen","doi":"10.1109/TMM.2024.3521818","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521818","url":null,"abstract":"Learning-based video denoisers have attained state-of-the-art (SOTA) performances on public evaluation benchmarks. Nevertheless, they typically encounter significant performance drops when applied to unseen real-world data, owing to inherent data discrepancies. To address this problem, this work delves into the model pretraining techniques and proposes masked central frame modeling (MCFM), a new video pretraining approach that significantly improves the generalization ability of the denoiser. This proposal stems from a key observation: pretraining denoiser by reconstructing intact videos from the corrupted sequences, where the central frames are masked at a suitable probability, contributes to achieving superior performance on real-world data. Building upon MCFM, we introduce a robust video denoiser, named MVDenoiser, which is firstly pretrained on massive available ordinary videos for general video modeling, and then finetuned on costful real-world noisy/clean video pairs for noisy-to-clean mapping. Additionally, beyond the denoising model, we further establish a new paired real-world noisy video dataset (RNVD) to facilitate cross-dataset evaluation of generalization ability. Extensive experiments conducted across different datasets demonstrate that the proposed method achieves superior performance compared to existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"622-636"},"PeriodicalIF":8.4,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1109/TMM.2024.3521846
Guanghui Wu;Lili Chen;Zengping Chen
Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This not only results in a cumbersome model architecture but also limits the representation capacity. In this paper, we revisit previous methods and have the following insights: (1) these three tasks are reciprocal and all depend on matching information and (2) different representations carry complementary information. Based on these insights, we propose Uni-DPM, a compact self-supervised framework to complete these three tasks with a shared representation. Specifically, we introduce an U-net-like model to synchronously complete multiple tasks by leveraging their common dependence on matching information, and iteratively refine the predictions by utilizing the reciprocity among tasks. Furthermore, we design a shared Appearance-Matching-Temporal (AMT) representation for these three tasks by exploiting the complementarity among different types of information. In addition, our Uni-DPM is scalable to downstream tasks, including scene flow, optical flow, and motion segmentation. Comparative experiments demonstrate the competitiveness of our Uni-DPM on these tasks, while ablation experiments also verify our insights.
{"title":"Uni-DPM: Unifying Self-Supervised Monocular Depth, Pose, and Object Motion Estimation With a Shared Representation","authors":"Guanghui Wu;Lili Chen;Zengping Chen","doi":"10.1109/TMM.2024.3521846","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521846","url":null,"abstract":"Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This not only results in a cumbersome model architecture but also limits the representation capacity. In this paper, we revisit previous methods and have the following insights: (1) these three tasks are reciprocal and all depend on matching information and (2) different representations carry complementary information. Based on these insights, we propose Uni-DPM, a compact self-supervised framework to complete these three tasks with a shared representation. Specifically, we introduce an U-net-like model to synchronously complete multiple tasks by leveraging their common dependence on matching information, and iteratively refine the predictions by utilizing the reciprocity among tasks. Furthermore, we design a shared Appearance-Matching-Temporal (AMT) representation for these three tasks by exploiting the complementarity among different types of information. In addition, our Uni-DPM is scalable to downstream tasks, including scene flow, optical flow, and motion segmentation. Comparative experiments demonstrate the competitiveness of our Uni-DPM on these tasks, while ablation experiments also verify our insights.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1498-1511"},"PeriodicalIF":8.4,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1109/TMM.2024.3521854
Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen
In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.
{"title":"Point Patches Contrastive Learning for Enhanced Point Cloud Completion","authors":"Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen","doi":"10.1109/TMM.2024.3521854","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521854","url":null,"abstract":"In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"581-596"},"PeriodicalIF":8.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1109/TMM.2024.3521671
Yunlong Tang;Yuxuan Wan;Lei Qi;Xin Geng
Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.
{"title":"DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization","authors":"Yunlong Tang;Yuxuan Wan;Lei Qi;Xin Geng","doi":"10.1109/TMM.2024.3521671","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521671","url":null,"abstract":"Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"120-132"},"PeriodicalIF":8.4,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03DOI: 10.1109/TMM.2024.3501532
{"title":"List of Reviewers","authors":"","doi":"10.1109/TMM.2024.3501532","DOIUrl":"https://doi.org/10.1109/TMM.2024.3501532","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11428-11439"},"PeriodicalIF":8.4,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10823085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TMM.2024.3521676
Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao
Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.
{"title":"Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding","authors":"Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521676","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521676","url":null,"abstract":"Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"95-107"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}