IEEE Transactions on Multimedia最新文献

Enhancing Neural Adaptive Wireless Video Streaming via Cross-Layer Information Exposure and Online Tuning 通过跨层信息暴露和在线调整增强神经自适应无线视频流

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-29 DOI: 10.1109/TMM.2024.3521820

Lingzhi Zhao;Ying Cui;Yuhang Jia;Yunfei Zhang;Klara Nahrstedt

Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve

$6.3%$

to 55.8% gains in QoE over the state-of-the-arts in the online scenario.

{"title":"Enhancing Neural Adaptive Wireless Video Streaming via Cross-Layer Information Exposure and Online Tuning","authors":"Lingzhi Zhao;Ying Cui;Yuhang Jia;Yunfei Zhang;Klara Nahrstedt","doi":"10.1109/TMM.2024.3521820","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521820","url":null,"abstract":"Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve <inline-formula><tex-math>$6.3%$</tex-math></inline-formula> to 55.8% gains in QoE over the state-of-the-arts in the online scenario.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1289-1304"},"PeriodicalIF":8.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learned Focused Plenoptic Image Compression With Local-Global Correlation Learning

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-28 DOI: 10.1109/TMM.2024.3521815

Gaosheng Liu;Huanjing Yue;Bihan Wen;Jingyu Yang

The dense light field sampling of focused plenoptic images (FPIs) yields substantial amounts of redundant data, necessitating efficient compression in practical applications. However, the presence of discontinuous structures and long-distance properties in FPIs poses a challenge. In this paper, we propose a novel end-to-end approach for learned focused plenoptic image compression (LFPIC). Specifically, we introduce a local-global correlation learning strategy to build the nonlinear transforms. This strategy can effectively handle the discontinuous structures and leverage long-distance correlations in FPI for high compression efficiency. Additionally, we propose a spatial-wise context model tailored for LFPIC to help emphasize the most related symbols during coding and further enhance the rate-distortion performance. Experimental results demonstrate the effectiveness of our proposed method, achieving a 22.16% BD-rate reduction (measured in PSNR) on the public dataset compared to the recent state-of-the-art LFPIC method. This improvement holds significant promise for benefiting the applications of focused plenoptic cameras.

{"title":"Learned Focused Plenoptic Image Compression With Local-Global Correlation Learning","authors":"Gaosheng Liu;Huanjing Yue;Bihan Wen;Jingyu Yang","doi":"10.1109/TMM.2024.3521815","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521815","url":null,"abstract":"The dense light field sampling of focused plenoptic images (FPIs) yields substantial amounts of redundant data, necessitating efficient compression in practical applications. However, the presence of discontinuous structures and long-distance properties in FPIs poses a challenge. In this paper, we propose a novel end-to-end approach for learned focused plenoptic image compression (LFPIC). Specifically, we introduce a local-global correlation learning strategy to build the nonlinear transforms. This strategy can effectively handle the discontinuous structures and leverage long-distance correlations in FPI for high compression efficiency. Additionally, we propose a spatial-wise context model tailored for LFPIC to help emphasize the most related symbols during coding and further enhance the rate-distortion performance. Experimental results demonstrate the effectiveness of our proposed method, achieving a 22.16% BD-rate reduction (measured in PSNR) on the public dataset compared to the recent state-of-the-art LFPIC method. This improvement holds significant promise for benefiting the applications of focused plenoptic cameras.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1216-1227"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Frequency-Guided Spatial Adaptation for Camouflaged Object Detection 频率制导空间自适应伪装目标检测

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-17 DOI: 10.1109/TMM.2024.3521681

Shizhou Zhang;Dexuan Kong;Yinghui Xing;Yue Lu;Lingyan Ran;Guoqiang Liang;Hexu Wang;Yanning Zhang

Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.

伪装目标检测（COD）的目的是分割出与周围环境非常相似的伪装对象。近年来的研究表明，通过频率信息增强特征表示可以极大地缓解前景目标与背景之间的模糊问题。随着视觉基础模型如InternImage、Segment Anything Model等的出现，将预训练好的模型适配到COD任务上，采用轻量级的适配模块是一个新颖而有前景的研究方向。现有的适配模块主要关注空间域的特征适配。本文提出了一种基于频率制导的COD空间自适应方法。具体来说，我们将适配器的输入特征转换到频域。通过对谱图中不重叠圆内的频率分量进行分组和相互作用，动态增强或减弱不同的频率分量，使图像细节和轮廓特征的强度自适应调整。同时突出有利于区分物体和背景的特征，间接暗示被伪装物体的位置和形状。我们在四个广泛采用的基准数据集上进行了广泛的实验，所提出的方法优于26种最先进的方法。代码将被发布。

{"title":"Frequency-Guided Spatial Adaptation for Camouflaged Object Detection","authors":"Shizhou Zhang;Dexuan Kong;Yinghui Xing;Yue Lu;Lingyan Ran;Guoqiang Liang;Hexu Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3521681","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521681","url":null,"abstract":"Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"72-83"},"PeriodicalIF":8.4,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification 跨域分类的交叉散射稀疏字典对学习

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-13 DOI: 10.1109/TMM.2024.3521731

Lin Jiang;Jigang Wu;Shuping Zhao;Jiaxing Li

In cross-domain recognition tasks, the divergent distributions of data acquired from various domains degrade the effectiveness of knowledge transfer. Additionally, in practice, cross-domain data also contain a massive amount of redundant information, usually disturbing the training processes of cross-domain classifiers. Seeking to address these issues and obtain efficient domain-invariant knowledge, this paper proposes a novel cross-domain classification method, named cross-scatter sparse dictionary pair learning (CSSDL). Firstly, a pair of dictionaries is learned in a common subspace, in which the marginal distribution divergence between the cross-domain data is mitigated, and domain-invariant information can be efficiently extracted. Then, a cross-scatter discriminant term is proposed to decrease the distance between cross-domain data belonging to the same class. As such, this term guarantees that the data derived from same class can be aligned and that the conditional distribution divergence is mitigated. In addition, a flexible label regression method is introduced to match the feature representation and label information in the label space. Thereafter, a discriminative and transferable feature representation can be obtained. Moreover, two sparse constraints are introduced to maintain the sparse characteristics of the feature representation. Extensive experimental results obtained on public datasets demonstrate the effectiveness of the proposed CSSDL approach.

在跨领域识别任务中，不同领域的数据分布不一致，降低了知识转移的有效性。此外，在实际应用中，跨域数据还包含大量冗余信息，通常会干扰跨域分类器的训练过程。为了解决这些问题并获得高效的领域不变知识，本文提出了一种新的跨领域分类方法——交叉散射稀疏字典对学习（cross-scatter sparse dictionary pair learning， CSSDL）。首先，在公共子空间中学习一对字典，减轻了跨域数据的边际分布发散，有效地提取了域不变信息；然后，提出了一个交叉散射判别项，以减小属于同一类的跨域数据之间的距离。因此，这一项保证了来自同一类的数据可以对齐，并减轻了条件分布的分歧。此外，引入了一种灵活的标签回归方法来匹配标签空间中的特征表示和标签信息。从而得到可判别、可转移的特征表示。此外，引入了两个稀疏约束来保持特征表示的稀疏特性。在公共数据集上获得的大量实验结果证明了所提出的CSSDL方法的有效性。

{"title":"Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification","authors":"Lin Jiang;Jigang Wu;Shuping Zhao;Jiaxing Li","doi":"10.1109/TMM.2024.3521731","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521731","url":null,"abstract":"In cross-domain recognition tasks, the divergent distributions of data acquired from various domains degrade the effectiveness of knowledge transfer. Additionally, in practice, cross-domain data also contain a massive amount of redundant information, usually disturbing the training processes of cross-domain classifiers. Seeking to address these issues and obtain efficient domain-invariant knowledge, this paper proposes a novel cross-domain classification method, named cross-scatter sparse dictionary pair learning (CSSDL). Firstly, a pair of dictionaries is learned in a common subspace, in which the marginal distribution divergence between the cross-domain data is mitigated, and domain-invariant information can be efficiently extracted. Then, a cross-scatter discriminant term is proposed to decrease the distance between cross-domain data belonging to the same class. As such, this term guarantees that the data derived from same class can be aligned and that the conditional distribution divergence is mitigated. In addition, a flexible label regression method is introduced to match the feature representation and label information in the label space. Thereafter, a discriminative and transferable feature representation can be obtained. Moreover, two sparse constraints are introduced to maintain the sparse characteristics of the feature representation. Extensive experimental results obtained on public datasets demonstrate the effectiveness of the proposed CSSDL approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"371-384"},"PeriodicalIF":8.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Masked Video Pretraining Advances Real-World Video Denoising

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-10 DOI: 10.1109/TMM.2024.3521818

Yi Jin;Xiaoxiao Ma;Rui Zhang;Huaian Chen;Yuxuan Gu;Pengyang Ling;Enhong Chen

Learning-based video denoisers have attained state-of-the-art (SOTA) performances on public evaluation benchmarks. Nevertheless, they typically encounter significant performance drops when applied to unseen real-world data, owing to inherent data discrepancies. To address this problem, this work delves into the model pretraining techniques and proposes masked central frame modeling (MCFM), a new video pretraining approach that significantly improves the generalization ability of the denoiser. This proposal stems from a key observation: pretraining denoiser by reconstructing intact videos from the corrupted sequences, where the central frames are masked at a suitable probability, contributes to achieving superior performance on real-world data. Building upon MCFM, we introduce a robust video denoiser, named MVDenoiser, which is firstly pretrained on massive available ordinary videos for general video modeling, and then finetuned on costful real-world noisy/clean video pairs for noisy-to-clean mapping. Additionally, beyond the denoising model, we further establish a new paired real-world noisy video dataset (RNVD) to facilitate cross-dataset evaluation of generalization ability. Extensive experiments conducted across different datasets demonstrate that the proposed method achieves superior performance compared to existing methods.

基于学习的视频去噪器在公共评估基准上达到了最先进（SOTA）的性能。然而，由于固有的数据差异，当它们应用于不可见的真实世界数据时，性能通常会大幅下降。为解决这一问题，本研究深入研究了模型预训练技术，并提出了一种新的视频预训练方法--屏蔽中心帧建模（MCFM），它能显著提高去噪器的泛化能力。这一提议源于一个重要的观察结果：通过从损坏的序列中重建完整的视频来预训去噪器，其中中心帧以适当的概率被遮蔽，这有助于在真实世界的数据上实现卓越的性能。在 MCFM 的基础上，我们引入了一种名为 MVDenoiser 的鲁棒视频去噪器，它首先在海量可用普通视频上进行预训练，以建立通用视频模型，然后在成本高昂的真实世界噪声/清洁视频对上进行微调，以实现噪声到清洁的映射。此外，除了去噪模型之外，我们还进一步建立了一个新的真实世界噪声视频配对数据集（RNVD），以方便对泛化能力进行跨数据集评估。在不同数据集上进行的广泛实验证明，与现有方法相比，所提出的方法性能更优。

{"title":"Masked Video Pretraining Advances Real-World Video Denoising","authors":"Yi Jin;Xiaoxiao Ma;Rui Zhang;Huaian Chen;Yuxuan Gu;Pengyang Ling;Enhong Chen","doi":"10.1109/TMM.2024.3521818","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521818","url":null,"abstract":"Learning-based video denoisers have attained state-of-the-art (SOTA) performances on public evaluation benchmarks. Nevertheless, they typically encounter significant performance drops when applied to unseen real-world data, owing to inherent data discrepancies. To address this problem, this work delves into the model pretraining techniques and proposes masked central frame modeling (MCFM), a new video pretraining approach that significantly improves the generalization ability of the denoiser. This proposal stems from a key observation: pretraining denoiser by reconstructing intact videos from the corrupted sequences, where the central frames are masked at a suitable probability, contributes to achieving superior performance on real-world data. Building upon MCFM, we introduce a robust video denoiser, named MVDenoiser, which is firstly pretrained on massive available ordinary videos for general video modeling, and then finetuned on costful real-world noisy/clean video pairs for noisy-to-clean mapping. Additionally, beyond the denoising model, we further establish a new paired real-world noisy video dataset (RNVD) to facilitate cross-dataset evaluation of generalization ability. Extensive experiments conducted across different datasets demonstrate that the proposed method achieves superior performance compared to existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"622-636"},"PeriodicalIF":8.4,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Uni-DPM: Unifying Self-Supervised Monocular Depth, Pose, and Object Motion Estimation With a Shared Representation

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-10 DOI: 10.1109/TMM.2024.3521846

Guanghui Wu;Lili Chen;Zengping Chen

Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This not only results in a cumbersome model architecture but also limits the representation capacity. In this paper, we revisit previous methods and have the following insights: (1) these three tasks are reciprocal and all depend on matching information and (2) different representations carry complementary information. Based on these insights, we propose Uni-DPM, a compact self-supervised framework to complete these three tasks with a shared representation. Specifically, we introduce an U-net-like model to synchronously complete multiple tasks by leveraging their common dependence on matching information, and iteratively refine the predictions by utilizing the reciprocity among tasks. Furthermore, we design a shared Appearance-Matching-Temporal (AMT) representation for these three tasks by exploiting the complementarity among different types of information. In addition, our Uni-DPM is scalable to downstream tasks, including scene flow, optical flow, and motion segmentation. Comparative experiments demonstrate the competitiveness of our Uni-DPM on these tasks, while ablation experiments also verify our insights.

{"title":"Uni-DPM: Unifying Self-Supervised Monocular Depth, Pose, and Object Motion Estimation With a Shared Representation","authors":"Guanghui Wu;Lili Chen;Zengping Chen","doi":"10.1109/TMM.2024.3521846","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521846","url":null,"abstract":"Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This not only results in a cumbersome model architecture but also limits the representation capacity. In this paper, we revisit previous methods and have the following insights: (1) these three tasks are reciprocal and all depend on matching information and (2) different representations carry complementary information. Based on these insights, we propose Uni-DPM, a compact self-supervised framework to complete these three tasks with a shared representation. Specifically, we introduce an U-net-like model to synchronously complete multiple tasks by leveraging their common dependence on matching information, and iteratively refine the predictions by utilizing the reciprocity among tasks. Furthermore, we design a shared Appearance-Matching-Temporal (AMT) representation for these three tasks by exploiting the complementarity among different types of information. In addition, our Uni-DPM is scalable to downstream tasks, including scene flow, optical flow, and motion segmentation. Comparative experiments demonstrate the competitiveness of our Uni-DPM on these tasks, while ablation experiments also verify our insights.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1498-1511"},"PeriodicalIF":8.4,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Point Patches Contrastive Learning for Enhanced Point Cloud Completion

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-07 DOI: 10.1109/TMM.2024.3521854

Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen

In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.

{"title":"Point Patches Contrastive Learning for Enhanced Point Cloud Completion","authors":"Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen","doi":"10.1109/TMM.2024.3521854","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521854","url":null,"abstract":"In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"581-596"},"PeriodicalIF":8.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization DPStyler：用于无源域泛化的动态PromptStyler

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-06 DOI: 10.1109/TMM.2024.3521671

Yunlong Tang;Yuxuan Wan;Lei Qi;Xin Geng

Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.

无源域泛化（SFDG）旨在开发一种不依赖任何源域的未知目标域模型。SFDG的研究主要建立在大规模视觉语言模型的现有知识基础上，利用预训练模型的联合视觉语言空间来模拟跨领域的风格迁移，从而消除了对源领域图像的依赖。然而，如何利用文本提示有效地模拟丰富多样的样式，以及如何从包含语义和样式信息的特征中提取对分类有用的域不变信息，是值得改进的方向。在本文中，我们介绍了Dynamic PromptStyler (DPStyler)，它包括样式生成和样式移除模块来解决这些问题。样式生成模块在每个训练阶段刷新所有样式，而样式移除模块消除由输入样式引起的编码器输出特征的变化。此外，由于负责使用随机采样或风格混合生成风格词向量的风格生成模块使模型对输入文本提示敏感，因此我们引入了模型集成方法来减轻这种敏感性。大量的实验表明，我们的框架在基准数据集上优于最先进的方法。

{"title":"DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization","authors":"Yunlong Tang;Yuxuan Wan;Lei Qi;Xin Geng","doi":"10.1109/TMM.2024.3521671","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521671","url":null,"abstract":"Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"120-132"},"PeriodicalIF":8.4,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

List of Reviewers 审稿人名单

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-03 DOI: 10.1109/TMM.2024.3501532

引用次数: 0

Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding 弱监督时态句基础的双语义重构网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2025-01-01 DOI: 10.1109/TMM.2024.3521676

Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao

Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.

弱监督时态句子基础的目的是在没有精确时间戳的情况下，识别与给定句子查询相对应的未修剪视频中的语义相关视频时刻。神经心理学研究表明，人类大脑处理信息的方式因词汇的语法类别而异，强调了分别考虑名词和动词的重要性。然而，目前的方法主要是利用预提取的视频特征来重建随机屏蔽查询，而忽略了语法类之间的区别。这种疏忽可能会妨碍在语言元素和视频中的相应组件之间形成有意义的联系。为了解决这一问题，本文引入了双语义重构网络（DSRN）模型。DSRN通过将物体特征与名词、运动特征与动词明显关联来处理视频特征，从而模仿人脑的解析机制。它从一个特征解缠模块开始，该模块分别从视频内容中提取对象感知和动作感知特征。然后，在双分支结构中，通过两个专用的建议生成模块，将这些解耦的特征用于生成对象和运动的单独建议。提出了一种一致性约束，以确保物体相关和运动相关建议之间的边界高度一致。随后，DSRN使用生成的建议，独立地从句子查询中重构隐藏名词和动词。最后，运用整合块对两类建议进行综合，通过对比学习区分正面和负面实例。在Charades-STA和ActivityNet Captions数据集上的实验表明，该方法达到了最先进的性能。

{"title":"Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding","authors":"Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521676","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521676","url":null,"abstract":"Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"95-107"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0