IEEE Transactions on Multimedia最新文献_第7页

Improving Image Inpainting via Adversarial Collaborative Training 通过对抗性协同训练改进图像绘制

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521800

Li Huang;Yaping Huang;Qingji Guan

Image inpainting aims to restore visually realistic contents from a corrupted image, while inpainting forensic methods focus on locating the inpainted regions to fight against inpainting manipulations. Motivated by these two mutually interdependent tasks, in this paper, we propose a novel image inpainting network called Adversarial Collaborative Network (AdvColabNet), which leverages the contradictory and collaborative information from the two tasks of image inpainting and inpainting forensics to enhance the progress of the inpainting model through adversarial collaborative training. Specifically, the proposed AdvColabNet is a coarse-to-fine two-stage framework. In the coarse training stage, a simple generative adversarial model-based U-Net-style network generates initial coarse inpainting results. In the fine stage, the authenticity of inpainting results is assessed using the estimated forensic mask. A forensics-driven adaptive weighting refinement strategy is developed to emphasize learning from pixels with higher probabilities of being inpainted, which helps the network to focus on the challenging regions, resulting in more plausible inpainting results. Comprehensive evaluations on the CelebA-HQ and Places2 datasets demonstrate that our method achieves state-of-the-art robustness performance in terms of PSNR, SSIM, MAE, FID, and LPIPS metrics. We also show that our method effectively deceives the proposed inpainting forensic method compared to state-of-the-art inpainting methods, further demonstrating the superiority of the proposed method.

图像修复的目的是从被破坏的图像中恢复视觉上真实的内容，而图像修复的法医方法则侧重于定位被修复的区域，以对抗篡改。在这两个相互依赖的任务的激励下，本文提出了一种新的图像补漆网络，称为对抗协作网络（AdvColabNet），该网络利用图像补漆和图像取证两个任务的矛盾和协作信息，通过对抗协作训练来提高补漆模型的进展。具体来说，建议的AdvColabNet是一个从粗到精的两阶段框架。在粗训练阶段，一个简单的基于生成对抗模型的u - net式网络生成初始粗涂结果。在精细阶段，使用预估的法医掩模来评估喷漆结果的真实性。开发了一种取证驱动的自适应加权细化策略，以强调从具有较高被涂入概率的像素中学习，这有助于网络专注于具有挑战性的区域，从而产生更可信的涂入结果。对CelebA-HQ和Places2数据集的综合评估表明，我们的方法在PSNR、SSIM、MAE、FID和LPIPS指标方面实现了最先进的鲁棒性性能。我们还表明，与最先进的绘画方法相比，我们的方法有效地欺骗了所提出的绘画法医方法，进一步证明了所提出方法的优越性。

{"title":"Improving Image Inpainting via Adversarial Collaborative Training","authors":"Li Huang;Yaping Huang;Qingji Guan","doi":"10.1109/TMM.2024.3521800","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521800","url":null,"abstract":"Image inpainting aims to restore visually realistic contents from a corrupted image, while inpainting forensic methods focus on locating the inpainted regions to fight against inpainting manipulations. Motivated by these two mutually interdependent tasks, in this paper, we propose a novel image inpainting network called Adversarial Collaborative Network (AdvColabNet), which leverages the contradictory and collaborative information from the two tasks of image inpainting and inpainting forensics to enhance the progress of the inpainting model through adversarial collaborative training. Specifically, the proposed AdvColabNet is a coarse-to-fine two-stage framework. In the coarse training stage, a simple generative adversarial model-based U-Net-style network generates initial coarse inpainting results. In the fine stage, the authenticity of inpainting results is assessed using the estimated forensic mask. A forensics-driven adaptive weighting refinement strategy is developed to emphasize learning from pixels with higher probabilities of being inpainted, which helps the network to focus on the challenging regions, resulting in more plausible inpainting results. Comprehensive evaluations on the CelebA-HQ and Places2 datasets demonstrate that our method achieves state-of-the-art robustness performance in terms of PSNR, SSIM, MAE, FID, and LPIPS metrics. We also show that our method effectively deceives the proposed inpainting forensic method compared to state-of-the-art inpainting methods, further demonstrating the superiority of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"356-370"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3D Shape Segmentation With Potential Consistency Mining and Enhancement 基于潜在一致性挖掘和增强的三维形状分割

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521674

Zhenyu Shu;Shiyang Li;Shiqing Xin;Ligang Liu

3D shape segmentation is a crucial task in the field of multimedia analysis and processing, and recent years have seen a surge in research on this topic. However, many existing methods only consider geometric features of 3D shapes and fail to explore the potential connections between faces, limiting their segmentation performance. In this paper, we propose a novel segmentation approach that mines and enhances the potential consistency of 3D shapes to overcome this limitation. The key idea is to mine the consistency between different partitions of 3D shapes and to use the unique consistency enhancement strategy to continuously optimize the consistency features for the network. Our method also includes a comprehensive set of network structures to mine and enhance consistent features, enabling more effective feature extraction and better utilization of contextual information around each face when processing complex shapes. We evaluate our approach on public benchmarks through extensive experiments and demonstrate its effectiveness in achieving higher accuracy than existing methods.

三维形状分割是多媒体分析与处理领域的一项重要任务，近年来在该领域的研究激增。然而，许多现有的方法只考虑三维形状的几何特征，而没有探索人脸之间的潜在联系，限制了它们的分割性能。在本文中，我们提出了一种新的分割方法，挖掘和增强三维形状的潜在一致性来克服这一限制。其核心思想是挖掘三维形状的不同分区之间的一致性，并使用独特的一致性增强策略对网络的一致性特征进行持续优化。我们的方法还包括一套全面的网络结构来挖掘和增强一致的特征，从而在处理复杂形状时更有效地提取特征并更好地利用每个面部周围的上下文信息。我们通过广泛的实验在公共基准上评估了我们的方法，并证明了它在实现比现有方法更高的准确性方面的有效性。

引用次数: 0

Category-Contrastive Fine-Grained Crowd Counting and Beyond 类别对比细粒度人群计数及其他

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521823

Meijing Zhang;Mengxue Chen;Qi Li;Yanchen Chen;Rui Lin;Xiaolian Li;Shengfeng He;Wenxi Liu

Crowd counting has drawn increasing attention across various fields. However, existing crowd counting tasks primarily focus on estimating the overall population, ignoring the behavioral and semantic information of different social groups within the crowd. In this paper, we aim to address a newly proposed research problem, namely fine-grained crowd counting, which involves identifying different categories of individuals and accurately counting them in static images. In order to fully leverage the categorical information in static crowd images, we propose a two-tier salient feature propagation module designed to sequentially extract semantic information from both the crowd and its surrounding environment. Additionally, we introduce a category difference loss to refine the feature representation by highlighting the differences between various crowd categories. Moreover, our proposed framework can adapt to a novel problem setup called few-example fine-grained crowd counting. This setup, unlike the original fine-grained crowd counting, requires only a few exemplar point annotations instead of dense annotations from predefined categories, making it applicable in a wider range of scenarios. The baseline model for this task can be established by substituting the loss function in our proposed model with a novel hybrid loss function that integrates point-oriented cross-entropy loss and category contrastive loss. Through comprehensive experiments, we present results in both the formulation and application of fine-grained crowd counting.

人群计数在各个领域引起了越来越多的关注。然而，现有的人群计数任务主要集中在估计总体人口，忽略了人群中不同社会群体的行为和语义信息。在本文中，我们的目标是解决一个新提出的研究问题，即细粒度人群计数，它涉及识别不同类别的个体并在静态图像中准确计数。为了充分利用静态人群图像中的分类信息，我们提出了一种两层显著特征传播模块，旨在从人群及其周围环境中依次提取语义信息。此外，我们引入了类别差异损失，通过突出不同人群类别之间的差异来改进特征表示。此外，我们提出的框架可以适应一种新的问题设置，称为少示例细粒度人群计数。与最初的细粒度人群计数不同，这种设置只需要几个示例点注释，而不是来自预定义类别的密集注释，这使得它适用于更广泛的场景。该任务的基线模型可以通过将我们提出的模型中的损失函数替换为集成了面向点的交叉熵损失和类别对比损失的新型混合损失函数来建立。通过综合实验，我们给出了细粒度人群计数的公式和应用结果。

{"title":"Category-Contrastive Fine-Grained Crowd Counting and Beyond","authors":"Meijing Zhang;Mengxue Chen;Qi Li;Yanchen Chen;Rui Lin;Xiaolian Li;Shengfeng He;Wenxi Liu","doi":"10.1109/TMM.2024.3521823","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521823","url":null,"abstract":"Crowd counting has drawn increasing attention across various fields. However, existing crowd counting tasks primarily focus on estimating the overall population, ignoring the behavioral and semantic information of different social groups within the crowd. In this paper, we aim to address a newly proposed research problem, namely fine-grained crowd counting, which involves identifying different categories of individuals and accurately counting them in static images. In order to fully leverage the categorical information in static crowd images, we propose a two-tier salient feature propagation module designed to sequentially extract semantic information from both the crowd and its surrounding environment. Additionally, we introduce a category difference loss to refine the feature representation by highlighting the differences between various crowd categories. Moreover, our proposed framework can adapt to a novel problem setup called few-example fine-grained crowd counting. This setup, unlike the original fine-grained crowd counting, requires only a few exemplar point annotations instead of dense annotations from predefined categories, making it applicable in a wider range of scenarios. The baseline model for this task can be established by substituting the loss function in our proposed model with a novel hybrid loss function that integrates point-oriented cross-entropy loss and category contrastive loss. Through comprehensive experiments, we present results in both the formulation and application of fine-grained crowd counting.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"477-488"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Progressive Pseudo Labeling for Multi-Dataset Detection Over Unified Label Space

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521841

Kai Ye;Zepeng Huang;Yilei Xiong;Yu Gao;Jinheng Xie;Linlin Shen

Existing multi-dataset detection works mainly focus on the performance of detector on each of the datasets, with different label spaces. However, in real-world applications, a unified label space across multiple datasets is usually required. To address such a gap, we propose a progressive pseudo labeling (PPL) approach to detect objects across different datasets, over a unified label space. Specifically, we employ the widely used architecture of teacher-student model pair to jointly refine pseudo labels and train the unified object detector. The student model learns from both annotated labels and pseudo labels from the teacher model, which is updated by the exponential moving average (EMA) of the student. Three modules, i.e. Entropy-guided Adaptive Threshold (EAT), Global Classification Module (GCM) and Scene-Aware Fusion (SAF) strategy, are proposed to handle the noise of pseudo labels and fit the overall distribution. Extensive experiments are conducted on different multi-dataset benchmarks. The results demonstrate that our proposed method significantly outperforms the State-of-the-Art and is even comparable with supervised methods trained using annotations of all labels.

{"title":"Progressive Pseudo Labeling for Multi-Dataset Detection Over Unified Label Space","authors":"Kai Ye;Zepeng Huang;Yilei Xiong;Yu Gao;Jinheng Xie;Linlin Shen","doi":"10.1109/TMM.2024.3521841","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521841","url":null,"abstract":"Existing multi-dataset detection works mainly focus on the performance of detector on each of the datasets, with different label spaces. However, in real-world applications, a unified label space across multiple datasets is usually required. To address such a gap, we propose a progressive pseudo labeling (PPL) approach to detect objects across different datasets, over a unified label space. Specifically, we employ the widely used architecture of teacher-student model pair to jointly refine pseudo labels and train the unified object detector. The student model learns from both annotated labels and pseudo labels from the teacher model, which is updated by the exponential moving average (EMA) of the student. Three modules, i.e. Entropy-guided Adaptive Threshold (EAT), Global Classification Module (GCM) and Scene-Aware Fusion (SAF) strategy, are proposed to handle the noise of pseudo labels and fit the overall distribution. Extensive experiments are conducted on different multi-dataset benchmarks. The results demonstrate that our proposed method significantly outperforms the State-of-the-Art and is even comparable with supervised methods trained using annotations of all labels.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"531-543"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception 基于扩散的视觉感知的内隐和外显语言引导

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521825

Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang

Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU

$^text{ss}$

score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.

文本到图像扩散模型在条件图像合成方面显示出强大的能力。通过大规模的视觉语言预训练，扩散模型能够在不同的文本提示下生成纹理丰富、结构合理的高质量图像。然而，将预训练的扩散模型用于视觉感知是一个悬而未决的问题。在本文中，我们提出了一个基于扩散的视觉感知的隐式和显式语言指导框架，称为IEDP。我们的IEDP包括一个隐式语言指导分支和一个显式语言指导分支。隐式分支使用冻结的CLIP图像编码器直接生成隐式文本嵌入，这些嵌入被馈送到扩散模型，而不需要显式文本提示。显式分支使用相应图像的真值标签作为文本提示来约束扩散模型中的特征提取。在训练过程中，我们通过共享这两个分支的模型权值来联合训练扩散模型。因此，隐式和显式分支可以共同指导特征学习。在推理过程中，我们只使用隐式分支进行最终预测，不需要任何真值标签。在语义分割和深度估计两种典型的感知任务上进行了实验。我们的IEDP在这两项任务上都取得了令人满意的表现。对于语义分割，我们的IEDP在ADE20K验证集上的mIoU$^text{ss}$得分为55.9%，比基准方法VPD高出2.2%。对于深度估计，我们的IEDP以11.0%的相对增益优于基准方法VPD。

{"title":"Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception","authors":"Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang","doi":"10.1109/TMM.2024.3521825","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521825","url":null,"abstract":"Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU<inline-formula><tex-math>$^text{ss}$</tex-math></inline-formula> score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"466-476"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework 语境丰富的对比损失：增强对比学习框架中固有样本连接的呈现

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521796

Haojin Deng;Yimin Yang

Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples, similar to self-supervised learning. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.

对比学习越来越受欢迎，并推动最先进的性能跨越许多大规模的基准。在对比学习中，对比损失函数在通过轮换或裁剪等技术识别样本之间的相似性方面起着关键作用。然而，这种学习机制也会从增强的样本中引入信息失真。这是因为经过训练的模型可能会过度依赖具有相同标签的样本的信息，而同时忽略了来自相同初始图像的正对，特别是在扩展的数据集中。本文提出了一个上下文丰富的对比损失函数，通过包含两个收敛目标，同时提高了学习效率并解决了信息失真问题。第一个分量对标签对比非常敏感，能够区分相同和不同类别的特征，提高对比训练效率。同时，第二个组件拉近来自同一源图像的增强样本并与所有其他样本保持距离，类似于自监督学习。我们在图像分类任务中评估了所提出的方法，这些任务是最广泛接受的8个识别大规模基准数据集：CIFAR10， CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST， UTKFace和CelebA数据集。实验结果表明，该方法在泛化性能和学习收敛速度方面均优于16种最先进的对比学习方法。有趣的是，我们的技术在解决系统失真任务中脱颖而出。与原始的下游BiasedMNIST数据集中的对比损失函数相比，它显示了22.9%的改进，突出了它对更有效和公平的下游训练的承诺。

{"title":"Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework","authors":"Haojin Deng;Yimin Yang","doi":"10.1109/TMM.2024.3521796","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521796","url":null,"abstract":"Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples, similar to self-supervised learning. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"429-441"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explain Vision Focus: Blending Human Saliency Into Synthetic Face Images 解释视觉焦点：将人类显著性融合到合成人脸图像中

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521670

Kaiwei Zhang;Dandan Zhu;Xiongkuo Min;Huiyu Duan;Guangtao Zhai

Synthetic faces have been extensively researched and applied in various fields, such as face parsing and recognition. Compared to real face images, synthetic faces engender more controllable and consistent experimental stimuli due to the ability to precisely merge expression animations onto the facial skeleton. Accordingly, we establish an eye-tracking database with 780 synthetic face images and fixation data collected from 22 participants. The use of synthetic images with consistent expressions ensures reliable data support for exploring the database and determining the following findings: (1) A correlation study between saliency intensity and facial movement reveals that the variation of attention distribution within facial regions is mainly attributed to the movement of the mouth. (2) A categorized analysis of different demographic factors demonstrates that the bias towards salient regions aligns with differences in some demographic categories of synthetic characters. In practice, inference of facial saliency distribution is commonly used to predict the regions of interest for facial video-related applications. Therefore, we propose a benchmark model that accurately predicts saliency maps, closely matching the ground truth annotations. This achievement is made possible by utilizing channel alignment and progressive summation for feature fusion, along with the incorporation of Sinusoidal Position Encoding. The ablation experiment also demonstrates the effectiveness of our proposed model. We hope that this paper will contribute to advancing the photorealism of generative digital humans.

合成人脸在人脸分析、人脸识别等领域得到了广泛的研究和应用。与真实的人脸图像相比，由于能够精确地将表情动画合并到面部骨架上，合成人脸产生了更可控和一致的实验刺激。因此，我们建立了一个眼动追踪数据库，其中包含了来自22名参与者的780张合成人脸图像和注视数据。使用具有一致表情的合成图像，为数据库的挖掘提供了可靠的数据支持，并确定了以下发现：(1)显著性强度与面部运动的相关性研究表明，面部区域内注意力分布的变化主要归因于嘴部的运动。(2)对不同人口统计因素的分类分析表明，对显著区域的偏爱与某些综合性状人口统计类别的差异是一致的。在实践中，面部显著性分布的推断通常用于预测面部视频相关应用的感兴趣区域。因此，我们提出了一个准确预测显著性图的基准模型，与地面真值注释密切匹配。这一成就是通过利用通道对齐和累进求和进行特征融合，以及正弦位置编码的结合而实现的。烧蚀实验也验证了该模型的有效性。我们希望本文将有助于推进生成数字人的照片真实感。

{"title":"Explain Vision Focus: Blending Human Saliency Into Synthetic Face Images","authors":"Kaiwei Zhang;Dandan Zhu;Xiongkuo Min;Huiyu Duan;Guangtao Zhai","doi":"10.1109/TMM.2024.3521670","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521670","url":null,"abstract":"Synthetic faces have been extensively researched and applied in various fields, such as face parsing and recognition. Compared to real face images, synthetic faces engender more controllable and consistent experimental stimuli due to the ability to precisely merge expression animations onto the facial skeleton. Accordingly, we establish an eye-tracking database with 780 synthetic face images and fixation data collected from 22 participants. The use of synthetic images with consistent expressions ensures reliable data support for exploring the database and determining the following findings: (1) A correlation study between saliency intensity and facial movement reveals that the variation of attention distribution within facial regions is mainly attributed to the movement of the mouth. (2) A categorized analysis of different demographic factors demonstrates that the bias towards salient regions aligns with differences in some demographic categories of synthetic characters. In practice, inference of facial saliency distribution is commonly used to predict the regions of interest for facial video-related applications. Therefore, we propose a benchmark model that accurately predicts saliency maps, closely matching the ground truth annotations. This achievement is made possible by utilizing channel alignment and progressive summation for feature fusion, along with the incorporation of Sinusoidal Position Encoding. The ablation experiment also demonstrates the effectiveness of our proposed model. We hope that this paper will contribute to advancing the photorealism of generative digital humans.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"489-502"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EvCSLR: Event-Guided Continuous Sign Language Recognition and Benchmark

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521750

Yu Jiang;Yuehang Wang;Siqi Li;Yongji Zhang;Qianren Guo;Qi Chu;Yue Gao

Classical continuous sign language recognition (CSLR) suffers from some main challenges in real-world scenarios: accurate inter-frame movement trajectories may fail to be captured by traditional RGB cameras due to the motion blur, and valid information may be insufficient under low-illumination scenarios. In this paper, we for the first time leverage an event camera to overcome the above-mentioned challenges. Event cameras are bio-inspired vision sensors that could efficiently record high-speed sign language movements under low-illumination scenarios and capture human information while eliminating redundant background interference. To fully exploit the benefits of the event camera for CSLR, we propose a novel event-guided multi-modal CSLR framework, which could achieve significant performance under complex scenarios. Specifically, a time redundancy correction (TRCorr) module is proposed to rectify redundant information in the temporal sequences, directing the model to focus on distinctive features. A multi-modal cross-attention interaction (MCAI) module is proposed to facilitate information fusion between events and frame domains. Furthermore, we construct the first event-based CSLR dataset, named EvCSLR, which will be released as the first event-based CSLR benchmark. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on EvCSLR and PHOENIX-2014 T datasets.

{"title":"EvCSLR: Event-Guided Continuous Sign Language Recognition and Benchmark","authors":"Yu Jiang;Yuehang Wang;Siqi Li;Yongji Zhang;Qianren Guo;Qi Chu;Yue Gao","doi":"10.1109/TMM.2024.3521750","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521750","url":null,"abstract":"Classical continuous sign language recognition (CSLR) suffers from some main challenges in real-world scenarios: accurate inter-frame movement trajectories may fail to be captured by traditional RGB cameras due to the motion blur, and valid information may be insufficient under low-illumination scenarios. In this paper, we for the first time leverage an event camera to overcome the above-mentioned challenges. Event cameras are bio-inspired vision sensors that could efficiently record high-speed sign language movements under low-illumination scenarios and capture human information while eliminating redundant background interference. To fully exploit the benefits of the event camera for CSLR, we propose a novel event-guided multi-modal CSLR framework, which could achieve significant performance under complex scenarios. Specifically, a time redundancy correction (TRCorr) module is proposed to rectify redundant information in the temporal sequences, directing the model to focus on distinctive features. A multi-modal cross-attention interaction (MCAI) module is proposed to facilitate information fusion between events and frame domains. Furthermore, we construct the first event-based CSLR dataset, named <bold>EvCSLR</b>, which will be released as the first event-based CSLR benchmark. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on EvCSLR and PHOENIX-2014 T datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1349-1361"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521837

Liangwei Chen;Xiren Zhou;Qiuju Chen;Fang Xiong;Huanhuan Chen

The spectral shape holds crucial information for Audio Classification (AC), encompassing the spectrum's envelope, details, and dynamic changes over time. Conventional methods utilize cepstral coefficients for spectral shape description but overlook its variation details. Deep-learning approaches capture some dynamics but demand substantial training or fine-tuning resources. The Learning in the Model Space (LMS) framework precisely captures the dynamic information of temporal data by utilizing model fitting, even when computational resources and data are limited. However, applying LMS to audio faces challenges: 1) The high sampling rate of audio hinders efficient data fitting and capturing of dynamic information. 2) The Dynamic Information of Partial Spectral Shapes (DIPSS) may enhance classification, as only specific spectral shapes are relevant for AC. This paper extends an AC framework called Effective Dynamic Information Capture (EDIC) to tackle the above issues. EDIC constructs Mel-Frequency Cepstral Coefficients (MFCC) sequences within different dimensional intervals as the fitted data, which not only reduces the number of sequence sampling points but can also describe the change of the spectral shape in different parts over time. EDIC enables us to implement a topology-based selection algorithm in the model space, selecting effective DIPSS for the current AC task. The performance on three tasks confirms the effectiveness of EDIC.

{"title":"Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification","authors":"Liangwei Chen;Xiren Zhou;Qiuju Chen;Fang Xiong;Huanhuan Chen","doi":"10.1109/TMM.2024.3521837","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521837","url":null,"abstract":"The spectral shape holds crucial information for Audio Classification (AC), encompassing the spectrum's envelope, details, and dynamic changes over time. Conventional methods utilize cepstral coefficients for spectral shape description but overlook its variation details. Deep-learning approaches capture some dynamics but demand substantial training or fine-tuning resources. The Learning in the Model Space (LMS) framework precisely captures the dynamic information of temporal data by utilizing model fitting, even when computational resources and data are limited. However, applying LMS to audio faces challenges: 1) The high sampling rate of audio hinders efficient data fitting and capturing of dynamic information. 2) The Dynamic Information of Partial Spectral Shapes (DIPSS) may enhance classification, as only specific spectral shapes are relevant for AC. This paper extends an AC framework called Effective Dynamic Information Capture (EDIC) to tackle the above issues. EDIC constructs Mel-Frequency Cepstral Coefficients (MFCC) sequences within different dimensional intervals as the fitted data, which not only reduces the number of sequence sampling points but can also describe the change of the spectral shape in different parts over time. EDIC enables us to implement a topology-based selection algorithm in the model space, selecting effective DIPSS for the current AC task. The performance on three tasks confirms the effectiveness of EDIC.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1114-1126"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521702

Liangchen Liu;Nannan Wang;Dawei Zhou;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu

This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is stricter and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem.

{"title":"Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization","authors":"Liangchen Liu;Nannan Wang;Dawei Zhou;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu","doi":"10.1109/TMM.2024.3521702","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521702","url":null,"abstract":"This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is stricter and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both <bold>loss value</b> and <bold>loss sharpness</b>, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1100-1113"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0