Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.
{"title":"Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation","authors":"Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv","doi":"10.1145/3653714","DOIUrl":"https://doi.org/10.1145/3653714","url":null,"abstract":"<p>Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing prevalence of cloud computing platforms, ensuring data privacy during the cloud-based image-related services such as classification has become crucial. In this study, we propose a novel privacy-preserving image classification scheme that enables the direct application of classifiers trained in the plaintext domain to classify encrypted images, without the need of retraining a dedicated classifier. Moreover, encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key. Specifically, our proposed scheme involves utilizing a feature extractor and an encoder to mask the plaintext image through a newly designed Noise-like Adversarial Example (NAE). Such an NAE not only introduces a noise-like visual appearance to the encrypted image but also compels the target classifier to predict the ciphertext as the same label as the original plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning (SRL) framework for restoring the plaintext image with minimal degradation. Extensive experiments demonstrate that 1) the classification accuracy of the classifier trained in the plaintext domain remains the same in both the ciphertext and plaintext domains; 2) the encrypted images can be recovered into their original form with an average PSNR of up to 51+ dB for the SVHN dataset and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory generalization capability on the encryption, decryption and classification tasks across datasets that are different from the training one; and 4) a high-level of security is achieved against three potential threat models. The code is available at https://github.com/csjunjun/RIC.git.
{"title":"Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples","authors":"Jun Liu, Jiantao Zhou, Jinyu Tian, Weiwei Sun","doi":"10.1145/3653676","DOIUrl":"https://doi.org/10.1145/3653676","url":null,"abstract":"<p>With the increasing prevalence of cloud computing platforms, ensuring data privacy during the cloud-based image-related services such as classification has become crucial. In this study, we propose a novel privacy-preserving image classification scheme that enables the direct application of classifiers trained in the plaintext domain to classify encrypted images, without the need of retraining a dedicated classifier. Moreover, encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key. Specifically, our proposed scheme involves utilizing a feature extractor and an encoder to mask the plaintext image through a newly designed Noise-like Adversarial Example (NAE). Such an NAE not only introduces a noise-like visual appearance to the encrypted image but also compels the target classifier to predict the ciphertext as the same label as the original plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning (SRL) framework for restoring the plaintext image with minimal degradation. Extensive experiments demonstrate that 1) the classification accuracy of the classifier trained in the plaintext domain remains the same in both the ciphertext and plaintext domains; 2) the encrypted images can be recovered into their original form with an average PSNR of up to 51+ dB for the SVHN dataset and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory generalization capability on the encryption, decryption and classification tasks across datasets that are different from the training one; and 4) a high-level of security is achieved against three potential threat models. The code is available at https://github.com/csjunjun/RIC.git.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long Tang, Dengpan Ye, Zhenhao Lu, Yunming Zhang, Chuanxi Chen
Face manipulation can modify a victim’s facial attributes, e.g., age or hair color, in an image, which is an important component of DeepFakes. Adversarial examples are an emerging approach to combat the threat of visual misinformation to society. To efficiently protect facial images from being forged, designing a universal face anti-manipulation disruptor is essential. However, existing works treat deepfake disruption as an end-to-end process, ignoring the functional difference between feature extraction and image reconstruction. In this work, we propose a novel Feature-Output ensemble UNiversal Disruptor (FOUND) against face manipulation networks, which explores a new opinion considering attacking feature-extraction (encoding) modules as the critical task in deepfake disruption. We conduct an effective two-stage disruption process. We first perform ensemble disruption on multi-model encoders, maximizing the Wasserstein distance between features before and after the adversarial attack. Then develop a gradient-ensemble strategy to enhance the disruption effect by simplifying the complex optimization problem of disrupting ensemble end-to-end models. Extensive experiments indicate that one FOUND generated with a few facial images can successfully disrupt multiple face manipulation models on cross-attribute and cross-face images, surpassing state-of-the-art universal disruptors in both success rate and efficiency.
人脸操作可以修改图像中受害者的面部属性,如年龄或发色,这是 DeepFakes 的重要组成部分。对抗性实例是一种新兴的方法,可用于消除视觉错误信息对社会的威胁。为了有效保护面部图像不被伪造,设计一种通用的面部防操纵干扰器至关重要。然而,现有的工作将深度防伪作为一个端到端的过程,忽略了特征提取和图像重建之间的功能差异。在这项工作中,我们提出了一种新型的特征-输出集合通用干扰器(FOUND)来对付人脸操纵网络,它探索了一种新的观点,将攻击特征提取(编码)模块作为深度防伪干扰的关键任务。我们进行了有效的两阶段破坏过程。我们首先对多模型编码器进行集合干扰,最大化对抗性攻击前后特征之间的瓦瑟斯坦距离。然后开发一种梯度-集合策略,通过简化破坏集合端到端模型的复杂优化问题来增强破坏效果。大量实验表明,用少量面部图像生成的一个 FOUND 可以成功地破坏跨属性和跨面部图像上的多个人脸操作模型,在成功率和效率上都超过了最先进的通用破坏器。
{"title":"Feature Extraction Matters More: An Effective and Efficient Universal Deepfake Disruptor","authors":"Long Tang, Dengpan Ye, Zhenhao Lu, Yunming Zhang, Chuanxi Chen","doi":"10.1145/3653457","DOIUrl":"https://doi.org/10.1145/3653457","url":null,"abstract":"<p>Face manipulation can modify a victim’s facial attributes, e.g., age or hair color, in an image, which is an important component of DeepFakes. Adversarial examples are an emerging approach to combat the threat of visual misinformation to society. To efficiently protect facial images from being forged, designing a universal face anti-manipulation disruptor is essential. However, existing works treat deepfake disruption as an end-to-end process, ignoring the functional difference between feature extraction and image reconstruction. In this work, we propose a novel <b>F</b>eature-<b>O</b>utput ensemble <b>UN</b>iversal <b>D</b>isruptor (FOUND) against face manipulation networks, which explores a new opinion considering attacking feature-extraction (encoding) modules as the critical task in deepfake disruption. We conduct an effective two-stage disruption process. We first perform ensemble disruption on multi-model encoders, maximizing the Wasserstein distance between features before and after the adversarial attack. Then develop a gradient-ensemble strategy to enhance the disruption effect by simplifying the complex optimization problem of disrupting ensemble end-to-end models. Extensive experiments indicate that one FOUND generated with a few facial images can successfully disrupt multiple face manipulation models on cross-attribute and cross-face images, surpassing state-of-the-art universal disruptors in both success rate and efficiency.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"22 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140166015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Celard, E. L. Iglesias, J. M. Sorribes-Fdez, L. Borrajo, A. Seara Vieira
Image generative models have advanced in many areas to produce synthetic images of high resolution and detail. This success has enabled its use in the biomedical field, paving the way for the generation of videos showing the biological evolution of its content. Despite the power of generative video models, their use has not yet extended to time-based development, focusing almost exclusively on generating motion in space. This situation is largely due to the lack of specific data sets and metrics to measure the individual quality of videos, particularly when there is no ground truth available for comparison. We propose a new dataset, called GoldenDOT, which tracks the evolution of apples cut in parallel over 10 days, allowing to observe their progress over time while remaining static. In addition, four new metrics are proposed that provide different analyses of the generated videos as a whole and individually. In this paper, the proposed dataset and measures are used to study three state of the art video generative models and their feasibility for video generation with biological development: TemporalGAN (TGANv2), Low Dimensional Video Discriminator GAN (LDVDGAN), and Video Diffusion Model (VDM). Among them, the TGANv2 model has managed to obtain the best results in the vast majority of metrics, including those already known in the state of the art, demonstrating the viability of the new proposed metrics and their congruence with these standard measures.
图像生成模型在许多领域都取得了进展,可以生成高分辨率和高细节的合成图像。这一成功使其得以应用于生物医学领域,为生成显示其内容生物进化的视频铺平了道路。尽管生成式视频模型功能强大,但其应用尚未扩展到基于时间的开发领域,几乎只侧重于生成空间运动。造成这种情况的主要原因是缺乏特定的数据集和衡量标准来衡量视频的质量,尤其是在没有地面实况可供比较的情况下。我们提出了一个名为 GoldenDOT 的新数据集,该数据集可在 10 天内并行跟踪苹果切割的演变过程,从而在保持静态的同时观察它们随时间的变化。此外,我们还提出了四个新指标,分别对生成的视频整体和个体进行不同的分析。在本文中,提出的数据集和衡量标准被用于研究三种最先进的视频生成模型及其在生成生物发育视频方面的可行性:这三种模型是:TemporalGAN (TGANv2)、Low Dimensional Video Discriminator GAN (LDVDGAN) 和 Video Diffusion Model (VDM)。在这些模型中,TGANv2 模型在绝大多数指标中都取得了最好的结果,包括那些已知的技术指标,这证明了新提出的指标的可行性及其与这些标准指标的一致性。
{"title":"New Metrics and Dataset for Biological Development Video Generation","authors":"P. Celard, E. L. Iglesias, J. M. Sorribes-Fdez, L. Borrajo, A. Seara Vieira","doi":"10.1145/3653456","DOIUrl":"https://doi.org/10.1145/3653456","url":null,"abstract":"<p>Image generative models have advanced in many areas to produce synthetic images of high resolution and detail. This success has enabled its use in the biomedical field, paving the way for the generation of videos showing the biological evolution of its content. Despite the power of generative video models, their use has not yet extended to time-based development, focusing almost exclusively on generating motion in space. This situation is largely due to the lack of specific data sets and metrics to measure the individual quality of videos, particularly when there is no ground truth available for comparison. We propose a new dataset, called GoldenDOT, which tracks the evolution of apples cut in parallel over 10 days, allowing to observe their progress over time while remaining static. In addition, four new metrics are proposed that provide different analyses of the generated videos as a whole and individually. In this paper, the proposed dataset and measures are used to study three state of the art video generative models and their feasibility for video generation with biological development: TemporalGAN (TGANv2), Low Dimensional Video Discriminator GAN (LDVDGAN), and Video Diffusion Model (VDM). Among them, the TGANv2 model has managed to obtain the best results in the vast majority of metrics, including those already known in the state of the art, demonstrating the viability of the new proposed metrics and their congruence with these standard measures.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"103 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140165844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang
Training generative adversarial networks (GANs) for noise-to-image synthesis is a challenge task, primarily due to the instability of GANs’ training process. One of the key issues is the generator’s sensitivity to input data, which can cause sudden fluctuations in the generator’s loss value with certain inputs. This sensitivity suggests an inadequate ability to resist disturbances in the generator, causing the discriminator’s loss value to oscillate and negatively impacting the discriminator. Then, the negative feedback of discriminator is also not conducive to updating generator’s parameters, leading to suboptimal image generation quality. In response to this challenge, we present an innovative GANs model equipped with a learnable auxiliary module that processes auxiliary noise. The core objective of this module is to enhance the stability of both the generator and discriminator throughout the training process. To achieve this target, we incorporate a learnable auxiliary penalty and an augmented discriminator, designed to control the generator and reinforce the discriminator’s stability, respectively. We further apply our method to the Hinge and LSGANs loss functions, illustrating its efficacy in reducing the instability of both the generator and the discriminator. The tests we conducted on LSUN, CelebA, Market-1501 and Creative Senz3D datasets serve as proof of our method’s ability to improve the training stability and overall performance of the baseline methods.
{"title":"Generative Adversarial Networks with Learnable Auxiliary Module for Image Synthesis","authors":"Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang","doi":"10.1145/3653021","DOIUrl":"https://doi.org/10.1145/3653021","url":null,"abstract":"<p>Training generative adversarial networks (GANs) for noise-to-image synthesis is a challenge task, primarily due to the instability of GANs’ training process. One of the key issues is the generator’s sensitivity to input data, which can cause sudden fluctuations in the generator’s loss value with certain inputs. This sensitivity suggests an inadequate ability to resist disturbances in the generator, causing the discriminator’s loss value to oscillate and negatively impacting the discriminator. Then, the negative feedback of discriminator is also not conducive to updating generator’s parameters, leading to suboptimal image generation quality. In response to this challenge, we present an innovative GANs model equipped with a learnable auxiliary module that processes auxiliary noise. The core objective of this module is to enhance the stability of both the generator and discriminator throughout the training process. To achieve this target, we incorporate a learnable auxiliary penalty and an augmented discriminator, designed to control the generator and reinforce the discriminator’s stability, respectively. We further apply our method to the Hinge and LSGANs loss functions, illustrating its efficacy in reducing the instability of both the generator and the discriminator. The tests we conducted on LSUN, CelebA, Market-1501 and Creative Senz3D datasets serve as proof of our method’s ability to improve the training stability and overall performance of the baseline methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyu Deng, Wanyi Zhang, Jie Zhao, Zhu Wang, Mingliang Zhou, Jun Luo, Chao Chen
The proliferation of multimodal big data in cities provides unprecedented opportunities for modeling and forecasting urban problems, e.g., crime prediction and house price prediction, through data-driven approaches. A fundamental and critical issue in modeling and forecasting urban problems lies in identifying suitable spatial analysis units, also known as city region partition. Existing works rely on subjective domain knowledge for static partitions, which is general and universal for all tasks. In fact, different tasks may need different city region partitions. To address this issue, we propose a task-oriented framework for Joint Learning of region Partition and Representation (JLPR for short hereafter). To make partition fit task, JLPR integrates the region partition into the representation model training and learns region partitions using the supervision signal from the downstream task. We evaluate the framework on two prediction tasks (i.e., crime prediction and housing price prediction) in Chicago. Experiments show that JLPR consistently outperforms state-of-the-art partitioning methods in both tasks, which achieves above 25% and 70% performance improvements in terms of Mean Absolute Error (MAE) for crime prediction and house price prediction tasks, respectively. Additionally, we meticulously undertake three visualization case studies, which yield profound and illuminating findings from diverse perspectives, demonstrating the remarkable effectiveness and superiority of our approach.
{"title":"Make Partition Fit Task: A Novel Framework for Joint Learning of City Region Partition and Representation","authors":"Mingyu Deng, Wanyi Zhang, Jie Zhao, Zhu Wang, Mingliang Zhou, Jun Luo, Chao Chen","doi":"10.1145/3652857","DOIUrl":"https://doi.org/10.1145/3652857","url":null,"abstract":"<p>The proliferation of multimodal big data in cities provides unprecedented opportunities for modeling and forecasting urban problems, e.g., crime prediction and house price prediction, through data-driven approaches. A fundamental and critical issue in modeling and forecasting urban problems lies in identifying suitable spatial analysis units, also known as city region partition. Existing works rely on subjective domain knowledge for static partitions, which is general and universal for all tasks. In fact, different tasks may need different city region partitions. To address this issue, we propose a task-oriented framework for <underline><b>J</b></underline>oint <underline><b>L</b></underline>earning of region <underline><b>P</b></underline>artition and <underline><b>R</b></underline>epresentation (<b>JLPR</b> for short hereafter). To make partition fit task, <b>JLPR</b> integrates the region partition into the representation model training and learns region partitions using the supervision signal from the downstream task. We evaluate the framework on two prediction tasks (i.e., crime prediction and housing price prediction) in Chicago. Experiments show that <b>JLPR</b> consistently outperforms state-of-the-art partitioning methods in both tasks, which achieves above 25% and 70% performance improvements in terms of Mean Absolute Error (MAE) for crime prediction and house price prediction tasks, respectively. Additionally, we meticulously undertake three visualization case studies, which yield profound and illuminating findings from diverse perspectives, demonstrating the remarkable effectiveness and superiority of our approach.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"33 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140165958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly
Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.
深度学习技术的进步使基于语言的搜索和检索成为可能,例如在云端对用户照片进行搜索和检索。出于隐私考虑,许多用户更愿意将照片存储在家中。因此,需要能在资源有限的设备上执行跨模态搜索的模型。最先进的跨模态检索模型通过学习纠缠表征来实现语言查询和图像之间的精细相似性计算,从而达到很高的准确率,但代价是检索延迟过高。另外,还有一类新方法,其性能好、延迟低,但需要更多的计算资源和数量级更高的训练数据(即由数百万图像标题对组成的大型网络抓取数据集),因此无法用于商业用途。从实用的角度来看,现有的方法都不适合在低资源设备上开发低延迟跨模态检索的商业应用。我们提出的 CrispSearch 是一种级联方法,可大大降低检索延迟,同时将基于设备语言的图像检索的排序准确性损失降至最低。我们这种方法的理念是将轻量级、运行效率高的粗略模型与精细的重新排序阶段相结合。在给定语言查询的情况下,粗略模型可以有效地过滤掉许多不相关的候选图像。经过过滤后,只有少数强候选图片会被选中并发送给精细模型进行重新排序。在标准基准数据集上使用两种 SOTA 模型进行精细重新排序的大量实验结果表明,CrispSearch 比 SOTA 精细方法的速度提高了 38 倍,而性能下降几乎可以忽略不计。此外,我们的方法不需要数百万个训练实例,因此是设备搜索和检索的实用解决方案。
{"title":"Realizing Efficient On-Device Language-based Image Retrieval","authors":"Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly","doi":"10.1145/3649896","DOIUrl":"https://doi.org/10.1145/3649896","url":null,"abstract":"<p>Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"99 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The popularization of video streaming brings challenges in satisfying diverse Quality of Service (QoS) requirements. The multipath extension of the Quick UDP Internet Connection (QUIC) protocol, also called MPQUIC, has the potential to improve video streaming performance with multiple simultaneously transmitting paths. The multipath scheduler of MPQUIC determines how to distribute the packets onto different paths. However, while applying current multipath schedulers into MPQUIC, our experimental results show that they fail to adapt to various receive buffer sizes of different devices and comprehensive QoS requirements of video streaming. These problems are especially severe under heterogeneous and dynamic network environments. To tackle these problems, we propose MARS, a Multi-Agent deep Reinforcement learning (MADRL) based Multipath QUIC Scheduler, which is able to promptly adapt to dynamic network environments. It exploits the MADRL method to learn a neural network for each path and generate scheduling policy. Besides, it introduces a novel multi-objective reward function that takes out-of-order (OFO) queue size and different QoS metrics into consideration to realize adaptive scheduling optimization. We implement MARS in an MPQUIC prototype and deploy in Dynamic Adaptive Streaming over HTTP (DASH) system. Then we compare it with the state-of-the-art multipath schedulers in both emulated and real-world networks. Experimental results show that MARS outperforms the other schedulers with better adaptive capability regarding the receive buffer sizes and QoS.
{"title":"Multi-Agent DRL-based Multipath Scheduling for Video Streaming with QUIC","authors":"Xueqiang Han, Biao Han, Jinrong Li, Congxi Song","doi":"10.1145/3649139","DOIUrl":"https://doi.org/10.1145/3649139","url":null,"abstract":"<p>The popularization of video streaming brings challenges in satisfying diverse Quality of Service (QoS) requirements. The multipath extension of the Quick UDP Internet Connection (QUIC) protocol, also called MPQUIC, has the potential to improve video streaming performance with multiple simultaneously transmitting paths. The multipath scheduler of MPQUIC determines how to distribute the packets onto different paths. However, while applying current multipath schedulers into MPQUIC, our experimental results show that they fail to adapt to various receive buffer sizes of different devices and comprehensive QoS requirements of video streaming. These problems are especially severe under heterogeneous and dynamic network environments. To tackle these problems, we propose MARS, a <underline>M</underline>ulti-<underline>A</underline>gent deep <underline>R</underline>einforcement learning (MADRL) based Multipath QUIC <underline>S</underline>cheduler, which is able to promptly adapt to dynamic network environments. It exploits the MADRL method to learn a neural network for each path and generate scheduling policy. Besides, it introduces a novel multi-objective reward function that takes out-of-order (OFO) queue size and different QoS metrics into consideration to realize adaptive scheduling optimization. We implement MARS in an MPQUIC prototype and deploy in Dynamic Adaptive Streaming over HTTP (DASH) system. Then we compare it with the state-of-the-art multipath schedulers in both emulated and real-world networks. Experimental results show that MARS outperforms the other schedulers with better adaptive capability regarding the receive buffer sizes and QoS.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"154 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma
Invisible watermarking can be used as an important tool for copyright certification in the Metaverse. However, with the advent of deep learning, Deep Neural Networks (DNNs) have posed new threats to this technique. For example, artificially trained DNNs can perform unauthorized content analysis and achieve illegal access to protected images. Furthermore, some specially crafted DNNs may even erase invisible watermarks embedded within the protected images, which eventually leads to the collapse of this protection and certification mechanism. To address these issues, inspired by the adversarial attack, we introduce Invisible Adversarial Watermarking (IAW), a novel security mechanism to enhance the copyright protection efficacy of watermarks. Specifically, we design an Adversarial Watermarking Fusion Model (AWFM) to efficiently generate Invisible Adversarial Watermark Images (IAWIs). By modeling the embedding of watermarks and adversarial perturbations as a unified task, the generated IAWIs can effectively defend against unauthorized identification, access, and erase via DNNs, and identify the ownership by extracting the embedded watermark. Experimental results show that the proposed IAW presents superior extraction accuracy, attack ability, and robustness on different DNNs, and the protected images maintain good visual quality, which ensures its effectiveness as an image protection mechanism.
{"title":"Invisible Adversarial Watermarking: A Novel Security Mechanism for Enhancing Copyright Protection","authors":"Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma","doi":"10.1145/3652608","DOIUrl":"https://doi.org/10.1145/3652608","url":null,"abstract":"<p>Invisible watermarking can be used as an important tool for copyright certification in the Metaverse. However, with the advent of deep learning, Deep Neural Networks (DNNs) have posed new threats to this technique. For example, artificially trained DNNs can perform unauthorized content analysis and achieve illegal access to protected images. Furthermore, some specially crafted DNNs may even erase invisible watermarks embedded within the protected images, which eventually leads to the collapse of this protection and certification mechanism. To address these issues, inspired by the adversarial attack, we introduce Invisible Adversarial Watermarking (IAW), a novel security mechanism to enhance the copyright protection efficacy of watermarks. Specifically, we design an Adversarial Watermarking Fusion Model (AWFM) to efficiently generate Invisible Adversarial Watermark Images (IAWIs). By modeling the embedding of watermarks and adversarial perturbations as a unified task, the generated IAWIs can effectively defend against unauthorized identification, access, and erase via DNNs, and identify the ownership by extracting the embedded watermark. Experimental results show that the proposed IAW presents superior extraction accuracy, attack ability, and robustness on different DNNs, and the protected images maintain good visual quality, which ensures its effectiveness as an image protection mechanism.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"36 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The highly realistic avatar in the metaverse may lead to severe leakage of facial privacy. Malicious users can more easily obtain the 3D structure of faces, thus using Deepfake technology to create counterfeit videos with higher realism. To automatically discern facial videos forged with the advancing generation techniques, deepfake detectors need to achieve stronger generalization abilities. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks would provide fundamental features for deepfake detection. We propose a video-level deepfake detection method based on a temporal transformer with a self-supervised audio-visual contrastive learning approach for pre-training the deepfake detector. The proposed method learns motion representations in the mouth region by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. The deepfake detector adopts the pre-trained weights and partially fine-tunes on deepfake datasets. Extensive experiments show that our self-supervised pre-training method can effectively improve the accuracy and robustness of our deepfake detection model without extra human efforts. Compared with existing deepfake detection methods, our proposed method achieves better generalization ability in cross-dataset evaluations.
{"title":"Audio-Visual Contrastive Pre-train for Face Forgery Detection","authors":"Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Ying Guo, Zhen Cheng, Pengfei Yan, Nenghai Yu","doi":"10.1145/3651311","DOIUrl":"https://doi.org/10.1145/3651311","url":null,"abstract":"<p>The highly realistic avatar in the metaverse may lead to severe leakage of facial privacy. Malicious users can more easily obtain the 3D structure of faces, thus using Deepfake technology to create counterfeit videos with higher realism. To automatically discern facial videos forged with the advancing generation techniques, deepfake detectors need to achieve stronger generalization abilities. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks would provide fundamental features for deepfake detection. We propose a video-level deepfake detection method based on a temporal transformer with a self-supervised audio-visual contrastive learning approach for pre-training the deepfake detector. The proposed method learns motion representations in the mouth region by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. The deepfake detector adopts the pre-trained weights and partially fine-tunes on deepfake datasets. Extensive experiments show that our self-supervised pre-training method can effectively improve the accuracy and robustness of our deepfake detection model without extra human efforts. Compared with existing deepfake detection methods, our proposed method achieves better generalization ability in cross-dataset evaluations.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}