ACM Transactions on Multimedia Computing Communications and Applications最新文献_第9页

Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation 利用空间和时间注意力构建类别图表征，实现视觉导航

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-22 DOI: 10.1145/3653714

Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv

Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.

给定一个感兴趣的物体，视觉导航的目的是根据一连串的部分观察结果找到该物体的位置。为此，代理需要：1）在训练过程中获取关于世界中物体类别关系的特定知识；2）根据预先学习的物体类别关系及其在当前未见环境中的轨迹定位目标物体。在本文中，我们提出了一个类别关系图（CRG）来学习物体类别布局关系的知识，并提出了一个时空区域注意（TSR）架构来感知物体的长期时空依赖关系，从而帮助导航。我们建立了 CRG 来学习物体布局的先验知识，并推断出特定物体的位置。随后，我们提出了 TSR 架构，以捕捉观察轨迹中物体在时间、空间和区域上的关系。具体来说，我们采用一个时间注意力模块（T）来模拟观察序列的时间结构，隐式编码历史移动或轨迹信息。然后，空间注意力模块（S）根据 CRG 和过去的观察结果，揭示当前观察对象的空间背景。最后，区域注意力模块（R）将注意力转移到目标相关区域。利用我们的方法所提取的视觉表征，代理可以准确地感知环境，并轻松地学习卓越的导航策略。在 AI2-THOR 上的实验表明，我们的 CRG-TSR 方法在效果和效率上都明显优于现有方法。补充材料包括代码，并将公开发布。

{"title":"Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation","authors":"Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv","doi":"10.1145/3653714","DOIUrl":"https://doi.org/10.1145/3653714","url":null,"abstract":"Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples 通过类噪声对抗实例进行可恢复的隐私保护图像分类

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-21 DOI: 10.1145/3653676

Jun Liu, Jiantao Zhou, Jinyu Tian, Weiwei Sun

With the increasing prevalence of cloud computing platforms, ensuring data privacy during the cloud-based image-related services such as classification has become crucial. In this study, we propose a novel privacy-preserving image classification scheme that enables the direct application of classifiers trained in the plaintext domain to classify encrypted images, without the need of retraining a dedicated classifier. Moreover, encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key. Specifically, our proposed scheme involves utilizing a feature extractor and an encoder to mask the plaintext image through a newly designed Noise-like Adversarial Example (NAE). Such an NAE not only introduces a noise-like visual appearance to the encrypted image but also compels the target classifier to predict the ciphertext as the same label as the original plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning (SRL) framework for restoring the plaintext image with minimal degradation. Extensive experiments demonstrate that 1) the classification accuracy of the classifier trained in the plaintext domain remains the same in both the ciphertext and plaintext domains; 2) the encrypted images can be recovered into their original form with an average PSNR of up to 51+ dB for the SVHN dataset and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory generalization capability on the encryption, decryption and classification tasks across datasets that are different from the training one; and 4) a high-level of security is achieved against three potential threat models. The code is available at https://github.com/csjunjun/RIC.git.

随着云计算平台的日益普及，在基于云的图像相关服务（如分类）中确保数据隐私变得至关重要。在本研究中，我们提出了一种新颖的隐私保护图像分类方案，该方案可直接应用在明文域中训练的分类器对加密图像进行分类，而无需重新训练专用分类器。此外，加密图像可以使用秘钥高保真（可恢复）地解密回原始形式。具体来说，我们提出的方案包括利用特征提取器和编码器，通过新设计的噪声对抗示例（NAE）对明文图像进行掩码。这种 NAE 不仅会给加密图像带来类似噪声的视觉外观，还会迫使目标分类器将密码文本预测为与原始明文图像相同的标签。在解码阶段，我们采用对称残差学习（SRL）框架，以最小的损耗恢复明文图像。广泛的实验证明：1）在明文域训练的分类器在密文域和明文域的分类准确率保持不变；2）加密图像可以恢复为原始形式，SVHN 数据集的平均 PSNR 高达 51+ dB，VGGFace2 数据集的平均 PSNR 高达 48+ dB；3) 我们的系统在不同于训练数据集的加密、解密和分类任务中表现出令人满意的泛化能力；以及 4) 针对三种潜在威胁模型实现了高水平的安全性。代码可在 https://github.com/csjunjun/RIC.git 上获取。

{"title":"Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples","authors":"Jun Liu, Jiantao Zhou, Jinyu Tian, Weiwei Sun","doi":"10.1145/3653676","DOIUrl":"https://doi.org/10.1145/3653676","url":null,"abstract":"With the increasing prevalence of cloud computing platforms, ensuring data privacy during the cloud-based image-related services such as classification has become crucial. In this study, we propose a novel privacy-preserving image classification scheme that enables the direct application of classifiers trained in the plaintext domain to classify encrypted images, without the need of retraining a dedicated classifier. Moreover, encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key. Specifically, our proposed scheme involves utilizing a feature extractor and an encoder to mask the plaintext image through a newly designed Noise-like Adversarial Example (NAE). Such an NAE not only introduces a noise-like visual appearance to the encrypted image but also compels the target classifier to predict the ciphertext as the same label as the original plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning (SRL) framework for restoring the plaintext image with minimal degradation. Extensive experiments demonstrate that 1) the classification accuracy of the classifier trained in the plaintext domain remains the same in both the ciphertext and plaintext domains; 2) the encrypted images can be recovered into their original form with an average PSNR of up to 51+ dB for the SVHN dataset and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory generalization capability on the encryption, decryption and classification tasks across datasets that are different from the training one; and 4) a high-level of security is achieved against three potential threat models. The code is available at https://github.com/csjunjun/RIC.git.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feature Extraction Matters More: An Effective and Efficient Universal Deepfake Disruptor 特征提取更重要高效通用的深度伪造干扰器

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-20 DOI: 10.1145/3653457

Long Tang, Dengpan Ye, Zhenhao Lu, Yunming Zhang, Chuanxi Chen

Face manipulation can modify a victim’s facial attributes, e.g., age or hair color, in an image, which is an important component of DeepFakes. Adversarial examples are an emerging approach to combat the threat of visual misinformation to society. To efficiently protect facial images from being forged, designing a universal face anti-manipulation disruptor is essential. However, existing works treat deepfake disruption as an end-to-end process, ignoring the functional difference between feature extraction and image reconstruction. In this work, we propose a novel Feature-Output ensemble UNiversal Disruptor (FOUND) against face manipulation networks, which explores a new opinion considering attacking feature-extraction (encoding) modules as the critical task in deepfake disruption. We conduct an effective two-stage disruption process. We first perform ensemble disruption on multi-model encoders, maximizing the Wasserstein distance between features before and after the adversarial attack. Then develop a gradient-ensemble strategy to enhance the disruption effect by simplifying the complex optimization problem of disrupting ensemble end-to-end models. Extensive experiments indicate that one FOUND generated with a few facial images can successfully disrupt multiple face manipulation models on cross-attribute and cross-face images, surpassing state-of-the-art universal disruptors in both success rate and efficiency.

人脸操作可以修改图像中受害者的面部属性，如年龄或发色，这是 DeepFakes 的重要组成部分。对抗性实例是一种新兴的方法，可用于消除视觉错误信息对社会的威胁。为了有效保护面部图像不被伪造，设计一种通用的面部防操纵干扰器至关重要。然而，现有的工作将深度防伪作为一个端到端的过程，忽略了特征提取和图像重建之间的功能差异。在这项工作中，我们提出了一种新型的特征-输出集合通用干扰器（FOUND）来对付人脸操纵网络，它探索了一种新的观点，将攻击特征提取（编码）模块作为深度防伪干扰的关键任务。我们进行了有效的两阶段破坏过程。我们首先对多模型编码器进行集合干扰，最大化对抗性攻击前后特征之间的瓦瑟斯坦距离。然后开发一种梯度-集合策略，通过简化破坏集合端到端模型的复杂优化问题来增强破坏效果。大量实验表明，用少量面部图像生成的一个 FOUND 可以成功地破坏跨属性和跨面部图像上的多个人脸操作模型，在成功率和效率上都超过了最先进的通用破坏器。

{"title":"Feature Extraction Matters More: An Effective and Efficient Universal Deepfake Disruptor","authors":"Long Tang, Dengpan Ye, Zhenhao Lu, Yunming Zhang, Chuanxi Chen","doi":"10.1145/3653457","DOIUrl":"https://doi.org/10.1145/3653457","url":null,"abstract":"Face manipulation can modify a victim’s facial attributes, e.g., age or hair color, in an image, which is an important component of DeepFakes. Adversarial examples are an emerging approach to combat the threat of visual misinformation to society. To efficiently protect facial images from being forged, designing a universal face anti-manipulation disruptor is essential. However, existing works treat deepfake disruption as an end-to-end process, ignoring the functional difference between feature extraction and image reconstruction. In this work, we propose a novel Feature-Output ensemble UNiversal Disruptor (FOUND) against face manipulation networks, which explores a new opinion considering attacking feature-extraction (encoding) modules as the critical task in deepfake disruption. We conduct an effective two-stage disruption process. We first perform ensemble disruption on multi-model encoders, maximizing the Wasserstein distance between features before and after the adversarial attack. Then develop a gradient-ensemble strategy to enhance the disruption effect by simplifying the complex optimization problem of disrupting ensemble end-to-end models. Extensive experiments indicate that one FOUND generated with a few facial images can successfully disrupt multiple face manipulation models on cross-attribute and cross-face images, surpassing state-of-the-art universal disruptors in both success rate and efficiency.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"22 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140166015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New Metrics and Dataset for Biological Development Video Generation 生成生物发展视频的新指标和数据集

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-20 DOI: 10.1145/3653456

P. Celard, E. L. Iglesias, J. M. Sorribes-Fdez, L. Borrajo, A. Seara Vieira

Image generative models have advanced in many areas to produce synthetic images of high resolution and detail. This success has enabled its use in the biomedical field, paving the way for the generation of videos showing the biological evolution of its content. Despite the power of generative video models, their use has not yet extended to time-based development, focusing almost exclusively on generating motion in space. This situation is largely due to the lack of specific data sets and metrics to measure the individual quality of videos, particularly when there is no ground truth available for comparison. We propose a new dataset, called GoldenDOT, which tracks the evolution of apples cut in parallel over 10 days, allowing to observe their progress over time while remaining static. In addition, four new metrics are proposed that provide different analyses of the generated videos as a whole and individually. In this paper, the proposed dataset and measures are used to study three state of the art video generative models and their feasibility for video generation with biological development: TemporalGAN (TGANv2), Low Dimensional Video Discriminator GAN (LDVDGAN), and Video Diffusion Model (VDM). Among them, the TGANv2 model has managed to obtain the best results in the vast majority of metrics, including those already known in the state of the art, demonstrating the viability of the new proposed metrics and their congruence with these standard measures.

图像生成模型在许多领域都取得了进展，可以生成高分辨率和高细节的合成图像。这一成功使其得以应用于生物医学领域，为生成显示其内容生物进化的视频铺平了道路。尽管生成式视频模型功能强大，但其应用尚未扩展到基于时间的开发领域，几乎只侧重于生成空间运动。造成这种情况的主要原因是缺乏特定的数据集和衡量标准来衡量视频的质量，尤其是在没有地面实况可供比较的情况下。我们提出了一个名为 GoldenDOT 的新数据集，该数据集可在 10 天内并行跟踪苹果切割的演变过程，从而在保持静态的同时观察它们随时间的变化。此外，我们还提出了四个新指标，分别对生成的视频整体和个体进行不同的分析。在本文中，提出的数据集和衡量标准被用于研究三种最先进的视频生成模型及其在生成生物发育视频方面的可行性：这三种模型是：TemporalGAN (TGANv2)、Low Dimensional Video Discriminator GAN (LDVDGAN) 和 Video Diffusion Model (VDM)。在这些模型中，TGANv2 模型在绝大多数指标中都取得了最好的结果，包括那些已知的技术指标，这证明了新提出的指标的可行性及其与这些标准指标的一致性。

{"title":"New Metrics and Dataset for Biological Development Video Generation","authors":"P. Celard, E. L. Iglesias, J. M. Sorribes-Fdez, L. Borrajo, A. Seara Vieira","doi":"10.1145/3653456","DOIUrl":"https://doi.org/10.1145/3653456","url":null,"abstract":"Image generative models have advanced in many areas to produce synthetic images of high resolution and detail. This success has enabled its use in the biomedical field, paving the way for the generation of videos showing the biological evolution of its content. Despite the power of generative video models, their use has not yet extended to time-based development, focusing almost exclusively on generating motion in space. This situation is largely due to the lack of specific data sets and metrics to measure the individual quality of videos, particularly when there is no ground truth available for comparison. We propose a new dataset, called GoldenDOT, which tracks the evolution of apples cut in parallel over 10 days, allowing to observe their progress over time while remaining static. In addition, four new metrics are proposed that provide different analyses of the generated videos as a whole and individually. In this paper, the proposed dataset and measures are used to study three state of the art video generative models and their feasibility for video generation with biological development: TemporalGAN (TGANv2), Low Dimensional Video Discriminator GAN (LDVDGAN), and Video Diffusion Model (VDM). Among them, the TGANv2 model has managed to obtain the best results in the vast majority of metrics, including those already known in the state of the art, demonstrating the viability of the new proposed metrics and their congruence with these standard measures.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"103 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140165844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Adversarial Networks with Learnable Auxiliary Module for Image Synthesis 具有可学习辅助模块的生成式对抗网络用于图像合成

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-17 DOI: 10.1145/3653021

Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang

Training generative adversarial networks (GANs) for noise-to-image synthesis is a challenge task, primarily due to the instability of GANs’ training process. One of the key issues is the generator’s sensitivity to input data, which can cause sudden fluctuations in the generator’s loss value with certain inputs. This sensitivity suggests an inadequate ability to resist disturbances in the generator, causing the discriminator’s loss value to oscillate and negatively impacting the discriminator. Then, the negative feedback of discriminator is also not conducive to updating generator’s parameters, leading to suboptimal image generation quality. In response to this challenge, we present an innovative GANs model equipped with a learnable auxiliary module that processes auxiliary noise. The core objective of this module is to enhance the stability of both the generator and discriminator throughout the training process. To achieve this target, we incorporate a learnable auxiliary penalty and an augmented discriminator, designed to control the generator and reinforce the discriminator’s stability, respectively. We further apply our method to the Hinge and LSGANs loss functions, illustrating its efficacy in reducing the instability of both the generator and the discriminator. The tests we conducted on LSUN, CelebA, Market-1501 and Creative Senz3D datasets serve as proof of our method’s ability to improve the training stability and overall performance of the baseline methods.

训练生成式对抗网络（GAN）进行噪声-图像合成是一项具有挑战性的任务，这主要是由于生成式对抗网络的训练过程具有不稳定性。其中一个关键问题是生成器对输入数据的敏感性，在某些输入情况下，生成器的损失值会突然波动。这种敏感性表明生成器抵御干扰的能力不足，从而导致鉴别器的损耗值波动，对鉴别器产生负面影响。然后，鉴别器的负反馈也不利于更新发生器的参数，导致图像生成质量不理想。为了应对这一挑战，我们提出了一种创新的 GANs 模型，该模型配备了一个可学习的辅助模块，用于处理辅助噪声。该模块的核心目标是在整个训练过程中增强生成器和鉴别器的稳定性。为了实现这一目标，我们加入了可学习辅助惩罚和增强判别器，分别用于控制生成器和增强判别器的稳定性。我们进一步将我们的方法应用于 Hinge 和 LSGANs 损失函数，说明它在降低生成器和判别器的不稳定性方面的功效。我们在 LSUN、CelebA、Market-1501 和 Creative Senz3D 数据集上进行的测试证明了我们的方法能够提高训练稳定性和基线方法的整体性能。

{"title":"Generative Adversarial Networks with Learnable Auxiliary Module for Image Synthesis","authors":"Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang","doi":"10.1145/3653021","DOIUrl":"https://doi.org/10.1145/3653021","url":null,"abstract":"Training generative adversarial networks (GANs) for noise-to-image synthesis is a challenge task, primarily due to the instability of GANs’ training process. One of the key issues is the generator’s sensitivity to input data, which can cause sudden fluctuations in the generator’s loss value with certain inputs. This sensitivity suggests an inadequate ability to resist disturbances in the generator, causing the discriminator’s loss value to oscillate and negatively impacting the discriminator. Then, the negative feedback of discriminator is also not conducive to updating generator’s parameters, leading to suboptimal image generation quality. In response to this challenge, we present an innovative GANs model equipped with a learnable auxiliary module that processes auxiliary noise. The core objective of this module is to enhance the stability of both the generator and discriminator throughout the training process. To achieve this target, we incorporate a learnable auxiliary penalty and an augmented discriminator, designed to control the generator and reinforce the discriminator’s stability, respectively. We further apply our method to the Hinge and LSGANs loss functions, illustrating its efficacy in reducing the instability of both the generator and the discriminator. The tests we conducted on LSUN, CelebA, Market-1501 and Creative Senz3D datasets serve as proof of our method’s ability to improve the training stability and overall performance of the baseline methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Make Partition Fit Task: A Novel Framework for Joint Learning of City Region Partition and Representation 使分区适合任务：城市区域划分与表征的联合学习新框架

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-17 DOI: 10.1145/3652857

Mingyu Deng, Wanyi Zhang, Jie Zhao, Zhu Wang, Mingliang Zhou, Jun Luo, Chao Chen

The proliferation of multimodal big data in cities provides unprecedented opportunities for modeling and forecasting urban problems, e.g., crime prediction and house price prediction, through data-driven approaches. A fundamental and critical issue in modeling and forecasting urban problems lies in identifying suitable spatial analysis units, also known as city region partition. Existing works rely on subjective domain knowledge for static partitions, which is general and universal for all tasks. In fact, different tasks may need different city region partitions. To address this issue, we propose a task-oriented framework for Joint Learning of region Partition and Representation (JLPR for short hereafter). To make partition fit task, JLPR integrates the region partition into the representation model training and learns region partitions using the supervision signal from the downstream task. We evaluate the framework on two prediction tasks (i.e., crime prediction and housing price prediction) in Chicago. Experiments show that JLPR consistently outperforms state-of-the-art partitioning methods in both tasks, which achieves above 25% and 70% performance improvements in terms of Mean Absolute Error (MAE) for crime prediction and house price prediction tasks, respectively. Additionally, we meticulously undertake three visualization case studies, which yield profound and illuminating findings from diverse perspectives, demonstrating the remarkable effectiveness and superiority of our approach.

城市中多模式大数据的激增为通过数据驱动方法对犯罪预测和房价预测等城市问题进行建模和预测提供了前所未有的机遇。城市问题建模和预测的一个基本和关键问题在于确定合适的空间分析单元，也称为城市区域分割。现有的工作依赖于主观领域知识来进行静态分区，这对于所有任务来说都是通用的。事实上，不同的任务可能需要不同的城市区域划分。为了解决这个问题，我们提出了一个面向任务的区域划分与表征联合学习框架（以下简称 JLPR）。为了使分区与任务相匹配，JLPR 将区域分区整合到表示模型训练中，并利用下游任务的监督信号来学习区域分区。我们在芝加哥的两项预测任务（即犯罪预测和房价预测）中对该框架进行了评估。实验表明，JLPR 在这两项任务中的表现始终优于最先进的分区方法，在犯罪预测和房价预测任务中，JLPR 的平均绝对误差（MAE）分别提高了 25% 和 70% 以上。此外，我们还细致地进行了三项可视化案例研究，从不同角度得出了深刻而富有启发性的结论，证明了我们的方法的显著效果和优越性。

{"title":"Make Partition Fit Task: A Novel Framework for Joint Learning of City Region Partition and Representation","authors":"Mingyu Deng, Wanyi Zhang, Jie Zhao, Zhu Wang, Mingliang Zhou, Jun Luo, Chao Chen","doi":"10.1145/3652857","DOIUrl":"https://doi.org/10.1145/3652857","url":null,"abstract":"The proliferation of multimodal big data in cities provides unprecedented opportunities for modeling and forecasting urban problems, e.g., crime prediction and house price prediction, through data-driven approaches. A fundamental and critical issue in modeling and forecasting urban problems lies in identifying suitable spatial analysis units, also known as city region partition. Existing works rely on subjective domain knowledge for static partitions, which is general and universal for all tasks. In fact, different tasks may need different city region partitions. To address this issue, we propose a task-oriented framework for <underline>J</underline>oint <underline>L</underline>earning of region <underline>P</underline>artition and <underline>R</underline>epresentation (JLPR for short hereafter). To make partition fit task, JLPR integrates the region partition into the representation model training and learns region partitions using the supervision signal from the downstream task. We evaluate the framework on two prediction tasks (i.e., crime prediction and housing price prediction) in Chicago. Experiments show that JLPR consistently outperforms state-of-the-art partitioning methods in both tasks, which achieves above 25% and 70% performance improvements in terms of Mean Absolute Error (MAE) for crime prediction and house price prediction tasks, respectively. Additionally, we meticulously undertake three visualization case studies, which yield profound and illuminating findings from diverse perspectives, demonstrating the remarkable effectiveness and superiority of our approach.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"33 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140165958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Realizing Efficient On-Device Language-based Image Retrieval 实现基于语言的高效设备上图像检索

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-15 DOI: 10.1145/3649896

Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly

Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.

深度学习技术的进步使基于语言的搜索和检索成为可能，例如在云端对用户照片进行搜索和检索。出于隐私考虑，许多用户更愿意将照片存储在家中。因此，需要能在资源有限的设备上执行跨模态搜索的模型。最先进的跨模态检索模型通过学习纠缠表征来实现语言查询和图像之间的精细相似性计算，从而达到很高的准确率，但代价是检索延迟过高。另外，还有一类新方法，其性能好、延迟低，但需要更多的计算资源和数量级更高的训练数据（即由数百万图像标题对组成的大型网络抓取数据集），因此无法用于商业用途。从实用的角度来看，现有的方法都不适合在低资源设备上开发低延迟跨模态检索的商业应用。我们提出的 CrispSearch 是一种级联方法，可大大降低检索延迟，同时将基于设备语言的图像检索的排序准确性损失降至最低。我们这种方法的理念是将轻量级、运行效率高的粗略模型与精细的重新排序阶段相结合。在给定语言查询的情况下，粗略模型可以有效地过滤掉许多不相关的候选图像。经过过滤后，只有少数强候选图片会被选中并发送给精细模型进行重新排序。在标准基准数据集上使用两种 SOTA 模型进行精细重新排序的大量实验结果表明，CrispSearch 比 SOTA 精细方法的速度提高了 38 倍，而性能下降几乎可以忽略不计。此外，我们的方法不需要数百万个训练实例，因此是设备搜索和检索的实用解决方案。

{"title":"Realizing Efficient On-Device Language-based Image Retrieval","authors":"Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly","doi":"10.1145/3649896","DOIUrl":"https://doi.org/10.1145/3649896","url":null,"abstract":"Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"99 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Agent DRL-based Multipath Scheduling for Video Streaming with QUIC 基于 DRL 的多代理 QUIC 视频流多路径调度

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-15 DOI: 10.1145/3649139

Xueqiang Han, Biao Han, Jinrong Li, Congxi Song

The popularization of video streaming brings challenges in satisfying diverse Quality of Service (QoS) requirements. The multipath extension of the Quick UDP Internet Connection (QUIC) protocol, also called MPQUIC, has the potential to improve video streaming performance with multiple simultaneously transmitting paths. The multipath scheduler of MPQUIC determines how to distribute the packets onto different paths. However, while applying current multipath schedulers into MPQUIC, our experimental results show that they fail to adapt to various receive buffer sizes of different devices and comprehensive QoS requirements of video streaming. These problems are especially severe under heterogeneous and dynamic network environments. To tackle these problems, we propose MARS, a Multi-Agent deep Reinforcement learning (MADRL) based Multipath QUIC Scheduler, which is able to promptly adapt to dynamic network environments. It exploits the MADRL method to learn a neural network for each path and generate scheduling policy. Besides, it introduces a novel multi-objective reward function that takes out-of-order (OFO) queue size and different QoS metrics into consideration to realize adaptive scheduling optimization. We implement MARS in an MPQUIC prototype and deploy in Dynamic Adaptive Streaming over HTTP (DASH) system. Then we compare it with the state-of-the-art multipath schedulers in both emulated and real-world networks. Experimental results show that MARS outperforms the other schedulers with better adaptive capability regarding the receive buffer sizes and QoS.

视频流的普及为满足不同的服务质量（QoS）要求带来了挑战。快速 UDP 互联网连接（Quick UDP Internet Connection，QUIC）协议的多路径扩展（又称 MPQUIC）有望通过多条同时传输的路径提高视频流性能。MPQUIC 的多路径调度器决定如何将数据包分配到不同的路径上。然而，在将当前的多路径调度器应用到 MPQUIC 时，我们的实验结果表明，这些调度器无法适应不同设备的各种接收缓冲区大小和视频流的全面 QoS 要求。这些问题在异构和动态网络环境下尤为严重。为了解决这些问题，我们提出了基于多代理深度强化学习（MADRL）的多路 QUIC 调度器 MARS，它能够迅速适应动态网络环境。它利用 MADRL 方法为每条路径学习神经网络并生成调度策略。此外，它还引入了一种新颖的多目标奖励函数，将失序（OFO）队列大小和不同的 QoS 指标考虑在内，以实现自适应调度优化。我们在 MPQUIC 原型中实现了 MARS，并将其部署在动态自适应 HTTP 流（DASH）系统中。然后，我们在模拟网络和实际网络中将其与最先进的多路径调度器进行了比较。实验结果表明，MARS 在接收缓冲区大小和服务质量方面的自适应能力优于其他调度器。

{"title":"Multi-Agent DRL-based Multipath Scheduling for Video Streaming with QUIC","authors":"Xueqiang Han, Biao Han, Jinrong Li, Congxi Song","doi":"10.1145/3649139","DOIUrl":"https://doi.org/10.1145/3649139","url":null,"abstract":"The popularization of video streaming brings challenges in satisfying diverse Quality of Service (QoS) requirements. The multipath extension of the Quick UDP Internet Connection (QUIC) protocol, also called MPQUIC, has the potential to improve video streaming performance with multiple simultaneously transmitting paths. The multipath scheduler of MPQUIC determines how to distribute the packets onto different paths. However, while applying current multipath schedulers into MPQUIC, our experimental results show that they fail to adapt to various receive buffer sizes of different devices and comprehensive QoS requirements of video streaming. These problems are especially severe under heterogeneous and dynamic network environments. To tackle these problems, we propose MARS, a <underline>M</underline>ulti-<underline>A</underline>gent deep <underline>R</underline>einforcement learning (MADRL) based Multipath QUIC <underline>S</underline>cheduler, which is able to promptly adapt to dynamic network environments. It exploits the MADRL method to learn a neural network for each path and generate scheduling policy. Besides, it introduces a novel multi-objective reward function that takes out-of-order (OFO) queue size and different QoS metrics into consideration to realize adaptive scheduling optimization. We implement MARS in an MPQUIC prototype and deploy in Dynamic Adaptive Streaming over HTTP (DASH) system. Then we compare it with the state-of-the-art multipath schedulers in both emulated and real-world networks. Experimental results show that MARS outperforms the other schedulers with better adaptive capability regarding the receive buffer sizes and QoS.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"154 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Invisible Adversarial Watermarking: A Novel Security Mechanism for Enhancing Copyright Protection 隐形对抗水印：加强版权保护的新型安全机制

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-14 DOI: 10.1145/3652608

Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma

Invisible watermarking can be used as an important tool for copyright certification in the Metaverse. However, with the advent of deep learning, Deep Neural Networks (DNNs) have posed new threats to this technique. For example, artificially trained DNNs can perform unauthorized content analysis and achieve illegal access to protected images. Furthermore, some specially crafted DNNs may even erase invisible watermarks embedded within the protected images, which eventually leads to the collapse of this protection and certification mechanism. To address these issues, inspired by the adversarial attack, we introduce Invisible Adversarial Watermarking (IAW), a novel security mechanism to enhance the copyright protection efficacy of watermarks. Specifically, we design an Adversarial Watermarking Fusion Model (AWFM) to efficiently generate Invisible Adversarial Watermark Images (IAWIs). By modeling the embedding of watermarks and adversarial perturbations as a unified task, the generated IAWIs can effectively defend against unauthorized identification, access, and erase via DNNs, and identify the ownership by extracting the embedded watermark. Experimental results show that the proposed IAW presents superior extraction accuracy, attack ability, and robustness on different DNNs, and the protected images maintain good visual quality, which ensures its effectiveness as an image protection mechanism.

隐形水印可以作为元宇宙中版权认证的重要工具。然而，随着深度学习技术的出现，深度神经网络（DNN）给这一技术带来了新的威胁。例如，经过人工训练的 DNN 可以执行未经授权的内容分析，实现对受保护图像的非法访问。此外，一些经过特殊制作的 DNN 甚至会擦除受保护图像中嵌入的隐形水印，最终导致这种保护和认证机制的崩溃。为了解决这些问题，我们受对抗攻击的启发，推出了一种新型安全机制--隐形对抗水印（IAW），以增强水印的版权保护功效。具体来说，我们设计了一种对抗性水印融合模型（AWFM）来有效生成隐形对抗性水印图像（IAWIs）。通过将水印嵌入和对抗性扰动建模为一项统一的任务，生成的 IAWIs 可以通过 DNN 有效抵御未经授权的识别、访问和删除，并通过提取嵌入的水印识别所有权。实验结果表明，所提出的 IAW 在不同 DNN 上都表现出了卓越的提取精度、攻击能力和鲁棒性，被保护的图像保持了良好的视觉质量，确保了其作为图像保护机制的有效性。

{"title":"Invisible Adversarial Watermarking: A Novel Security Mechanism for Enhancing Copyright Protection","authors":"Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma","doi":"10.1145/3652608","DOIUrl":"https://doi.org/10.1145/3652608","url":null,"abstract":"Invisible watermarking can be used as an important tool for copyright certification in the Metaverse. However, with the advent of deep learning, Deep Neural Networks (DNNs) have posed new threats to this technique. For example, artificially trained DNNs can perform unauthorized content analysis and achieve illegal access to protected images. Furthermore, some specially crafted DNNs may even erase invisible watermarks embedded within the protected images, which eventually leads to the collapse of this protection and certification mechanism. To address these issues, inspired by the adversarial attack, we introduce Invisible Adversarial Watermarking (IAW), a novel security mechanism to enhance the copyright protection efficacy of watermarks. Specifically, we design an Adversarial Watermarking Fusion Model (AWFM) to efficiently generate Invisible Adversarial Watermark Images (IAWIs). By modeling the embedding of watermarks and adversarial perturbations as a unified task, the generated IAWIs can effectively defend against unauthorized identification, access, and erase via DNNs, and identify the ownership by extracting the embedded watermark. Experimental results show that the proposed IAW presents superior extraction accuracy, attack ability, and robustness on different DNNs, and the protected images maintain good visual quality, which ensures its effectiveness as an image protection mechanism.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"36 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-Visual Contrastive Pre-train for Face Forgery Detection 用于人脸伪造检测的视听对比预训练

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-13 DOI: 10.1145/3651311

Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Ying Guo, Zhen Cheng, Pengfei Yan, Nenghai Yu

The highly realistic avatar in the metaverse may lead to severe leakage of facial privacy. Malicious users can more easily obtain the 3D structure of faces, thus using Deepfake technology to create counterfeit videos with higher realism. To automatically discern facial videos forged with the advancing generation techniques, deepfake detectors need to achieve stronger generalization abilities. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks would provide fundamental features for deepfake detection. We propose a video-level deepfake detection method based on a temporal transformer with a self-supervised audio-visual contrastive learning approach for pre-training the deepfake detector. The proposed method learns motion representations in the mouth region by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. The deepfake detector adopts the pre-trained weights and partially fine-tunes on deepfake datasets. Extensive experiments show that our self-supervised pre-training method can effectively improve the accuracy and robustness of our deepfake detection model without extra human efforts. Compared with existing deepfake detection methods, our proposed method achieves better generalization ability in cross-dataset evaluations.

元宇宙中高度逼真的头像可能会导致严重的面部隐私泄露。恶意用户可以更容易地获取人脸的三维结构，从而利用 Deepfake 技术制作出逼真度更高的伪造视频。为了自动分辨随着新一代技术的发展而伪造的人脸视频，深度伪造检测器需要实现更强的泛化能力。受迁移学习的启发，在其他大规模人脸相关任务中预先训练的神经网络将为深度赝品检测提供基本特征。我们提出了一种基于时空变换器的视频级深度检假方法，该方法采用自我监督的视听对比学习方法对深度检假器进行预训练。所提出的方法通过鼓励配对的视频和音频表征接近而未配对的视频和音频表征多样化来学习口腔区域的运动表征。深度假货检测器采用预训练的权重，并在深度假货数据集上进行部分微调。广泛的实验表明，我们的自监督预训练方法可以有效提高深度假货检测模型的准确性和鲁棒性，而无需额外的人力投入。与现有的深度赝品检测方法相比，我们提出的方法在跨数据集评估中实现了更好的泛化能力。

{"title":"Audio-Visual Contrastive Pre-train for Face Forgery Detection","authors":"Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Ying Guo, Zhen Cheng, Pengfei Yan, Nenghai Yu","doi":"10.1145/3651311","DOIUrl":"https://doi.org/10.1145/3651311","url":null,"abstract":"The highly realistic avatar in the metaverse may lead to severe leakage of facial privacy. Malicious users can more easily obtain the 3D structure of faces, thus using Deepfake technology to create counterfeit videos with higher realism. To automatically discern facial videos forged with the advancing generation techniques, deepfake detectors need to achieve stronger generalization abilities. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks would provide fundamental features for deepfake detection. We propose a video-level deepfake detection method based on a temporal transformer with a self-supervised audio-visual contrastive learning approach for pre-training the deepfake detector. The proposed method learns motion representations in the mouth region by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. The deepfake detector adopts the pre-trained weights and partially fine-tunes on deepfake datasets. Extensive experiments show that our self-supervised pre-training method can effectively improve the accuracy and robustness of our deepfake detection model without extra human efforts. Compared with existing deepfake detection methods, our proposed method achieves better generalization ability in cross-dataset evaluations.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0