Image-based virtual try-on aims to fit a target garment to a reference person. Most existing methods are limited to solving the Garment-To-Person (G2P) try-on task that transfers a garment from a clean product image to the reference person and do not consider the Person-To-Person (P2P) try-on task that transfers a garment from a clothed person image to the reference person, which limits the practical applicability. The P2P try-on task is more challenging due to spatial discrepancies caused by different poses, body shapes, and views between the reference person and the target person. To address this issue, we propose a novel Keypoints-Driven Flow Based Virtual Try-On Network (KF-VTON) for handling both the G2P and P2P try-on tasks. Our KF-VTON has two key innovations: 1) We propose a new keypoints-driven flow based deformation model to warp the garment. This model establishes spatial correspondences between the target garment and reference person by combining the robustness of Thin-plate Spline (TPS) based deformation and the flexibility of appearance flow based deformation. 2) We investigate a powerful Context-aware Spatially Adaptive Normalization (CSAN) generative module to synthesize the final try-on image. Particularly, CSAN integrates rich contextual information with semantic parsing guidance to properly infer unobserved garment appearances. Extensive experiments demonstrate that our KF-VTON is capable of producing photo-realistic and high-fidelity try-on results for the G2P as well as P2P try-on tasks and surpasses previous state-of-the-art methods both quantitatively and qualitatively. Our code is available at https://github.com/OIUIU/KF-VTON.
{"title":"KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network","authors":"Zizhao Wu, Siyu Liu, Peioyan Lu, Ping Yang, Yongkang Wong, Xiaoling Gu, Mohan S. Kankanhalli","doi":"10.1145/3673903","DOIUrl":"https://doi.org/10.1145/3673903","url":null,"abstract":"<p>Image-based virtual try-on aims to fit a target garment to a reference person. Most existing methods are limited to solving the Garment-To-Person (G2P) try-on task that transfers a garment from a clean product image to the reference person and do not consider the Person-To-Person (P2P) try-on task that transfers a garment from a clothed person image to the reference person, which limits the practical applicability. The P2P try-on task is more challenging due to spatial discrepancies caused by different poses, body shapes, and views between the reference person and the target person. To address this issue, we propose a novel Keypoints-Driven Flow Based Virtual Try-On Network (KF-VTON) for handling both the G2P and P2P try-on tasks. Our KF-VTON has two key innovations: 1) We propose a new <i>keypoints-driven flow based deformation model</i> to warp the garment. This model establishes spatial correspondences between the target garment and reference person by combining the robustness of Thin-plate Spline (TPS) based deformation and the flexibility of appearance flow based deformation. 2) We investigate a powerful <i>Context-aware Spatially Adaptive Normalization (CSAN) generative module</i> to synthesize the final try-on image. Particularly, CSAN integrates rich contextual information with semantic parsing guidance to properly infer unobserved garment appearances. Extensive experiments demonstrate that our KF-VTON is capable of producing photo-realistic and high-fidelity try-on results for the G2P as well as P2P try-on tasks and surpasses previous state-of-the-art methods both quantitatively and qualitatively. Our code is available at https://github.com/OIUIU/KF-VTON.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"34 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The challenge of cross-domain few-shot learning (CD-FSL) stems from the substantial distribution disparities between target and source domain images, necessitating a model with robust generalization capabilities. In this work, we posit that large-scale pretrained models are pivotal in addressing the cross-domain few-shot learning task owing to their exceptional representational and generalization prowess. To our knowledge, no existing research comprehensively investigates the utility of large-scale pretrained models in the cross-domain few-shot learning context. Addressing this gap, our study presents an exhaustive empirical assessment of the CLIP model within the cross-domain few-shot learning task. We undertake a comparison spanning six dimensions: base model, transfer module, classifier, loss, data augmentation, and training schedule. Furthermore, we establish a straightforward baseline model, E-base, based on our empirical analysis, underscoring the importance of our investigation. Experimental results substantiate the efficacy of our model, yielding a mean gain of 1.2% in 5-way 5-shot evaluations on the BSCD dataset.
{"title":"Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning","authors":"Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, Yu-Gang Jiang","doi":"10.1145/3673231","DOIUrl":"https://doi.org/10.1145/3673231","url":null,"abstract":"<p>The challenge of cross-domain few-shot learning (CD-FSL) stems from the substantial distribution disparities between target and source domain images, necessitating a model with robust generalization capabilities. In this work, we posit that large-scale pretrained models are pivotal in addressing the cross-domain few-shot learning task owing to their exceptional representational and generalization prowess. To our knowledge, no existing research comprehensively investigates the utility of large-scale pretrained models in the cross-domain few-shot learning context. Addressing this gap, our study presents an exhaustive empirical assessment of the CLIP model within the cross-domain few-shot learning task. We undertake a comparison spanning six dimensions: base model, transfer module, classifier, loss, data augmentation, and training schedule. Furthermore, we establish a straightforward baseline model, E-base, based on our empirical analysis, underscoring the importance of our investigation. Experimental results substantiate the efficacy of our model, yielding a mean gain of 1.2% in 5-way 5-shot evaluations on the BSCD dataset.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"27 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Wen, Nan Jiang, Lang Li, Jie Zhou, Yanpei Li, Hualin Zhan, Guang Kou, Weihao Gu, Jiahui Zhao
With the rise of mobile Internet and AI, social media integrating short messages, images, and videos has developed rapidly. As a guarantee for the stable operation of social media, information security, especially graph anomaly detection (GAD), has become a hot issue inspired by the extensive attention of researchers. Most GAD methods are mainly limited to enhancing the homophily or considering homophily and heterophilic connections. Nevertheless, due to the deceptive nature of homophily connections among anomalies, the discriminative information of the anomalies can be eliminated. To alleviate the issue, we explore a novel method TA-Detector in GAD by introducing the concept of trust into the classification of connections. In particular, the proposed approach adopts a designed trust classier to distinguish trust and distrust connections with the supervision of labeled nodes. Then, we capture the latent factors related to GAD by graph neural networks (GNN), which integrate node interaction type information and node representation. Finally, to identify anomalies in the graph, we use the residual network mechanism to extract the deep semantic embedding information related to GAD. Experimental results on two real benchmark datasets verify that our proposed approach boosts the overall GAD performance in comparison to benchmark baselines.
{"title":"TA-Detector: A GNN-based Anomaly Detector via Trust Relationship","authors":"Jie Wen, Nan Jiang, Lang Li, Jie Zhou, Yanpei Li, Hualin Zhan, Guang Kou, Weihao Gu, Jiahui Zhao","doi":"10.1145/3672401","DOIUrl":"https://doi.org/10.1145/3672401","url":null,"abstract":"<p>With the rise of mobile Internet and AI, social media integrating short messages, images, and videos has developed rapidly. As a guarantee for the stable operation of social media, information security, especially graph anomaly detection (GAD), has become a hot issue inspired by the extensive attention of researchers. Most GAD methods are mainly limited to enhancing the homophily or considering homophily and heterophilic connections. Nevertheless, due to the deceptive nature of homophily connections among anomalies, the discriminative information of the anomalies can be eliminated. To alleviate the issue, we explore a novel method <i>TA-Detector</i> in GAD by introducing the concept of trust into the classification of connections. In particular, the proposed approach adopts a designed trust classier to distinguish trust and distrust connections with the supervision of labeled nodes. Then, we capture the latent factors related to GAD by graph neural networks (GNN), which integrate node interaction type information and node representation. Finally, to identify anomalies in the graph, we use the residual network mechanism to extract the deep semantic embedding information related to GAD. Experimental results on two real benchmark datasets verify that our proposed approach boosts the overall GAD performance in comparison to benchmark baselines.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"76 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhilei Liu, Xiaoxing Liu, Sen Chen, Jiaxing Liu, Longbiao Wang, Chongke Bi
Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips. The current talking face model can be divided into four parts: visual feature extraction, audio feature processing, multimodal feature fusion, and rendering module. For the visual feature extraction part, existing methods face the challenge of complex learning task with noisy features, this paper introduces an attention-based disentanglement module to disentangle the face into Audio-face and Identity-face using speech-related facial action unit (AU) information. For the multimodal feature fusion part, existing methods ignore not only the interaction and relationship of cross-modal information but also the local driving information of the mouth muscles. This study proposes a novel generative framework that incorporates a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to enhance the learning of cross-modal features. The proposed method employs both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can facilitate more accurate mouth movements. Given the high correlation between speech and speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. Finally, we present a diffusion model for the synthesis of talking face images. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy. Code is available at https://mftfg-au.github.io/Multimodal_Fusion/.
会说话的人脸生成是通过输入任意的人脸图像和相应的音频片段来合成唇语同步的会说话的人脸视频。目前的会说话的人脸模型可分为四个部分:视觉特征提取、音频特征处理、多模态特征融合和渲染模块。在视觉特征提取部分,现有方法面临着复杂的噪声特征学习任务,本文引入了基于注意力的分离模块,利用与语音相关的面部动作单元(AU)信息将人脸分离为音频人脸和身份人脸。对于多模态特征融合部分,现有方法不仅忽略了跨模态信息的交互和关系,也忽略了嘴部肌肉的局部驱动信息。本研究提出了一种新颖的生成框架,将扩张的非因果时空卷积自注意力网络作为多模态融合模块,以增强跨模态特征的学习。所提出的方法采用与音频和语音相关的面部动作单元(AU)作为驱动信息。与语音相关的 AU 信息可以促进更准确的嘴部动作。鉴于语音和语音相关 AU 之间的高度相关性,我们提出了一个音频到 AU 模块来预测语音相关 AU 信息。最后,我们提出了一个用于合成说话人脸图像的扩散模型。我们在 GRID 和 TCD-TIMIT 数据集上验证了所提模型的有效性。我们还进行了一项消融研究,以验证各组成部分的贡献。定量和定性实验结果表明,我们的方法在图像质量和唇语同步准确性方面都优于现有方法。代码见 https://mftfg-au.github.io/Multimodal_Fusion/。
{"title":"Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units","authors":"Zhilei Liu, Xiaoxing Liu, Sen Chen, Jiaxing Liu, Longbiao Wang, Chongke Bi","doi":"10.1145/3672565","DOIUrl":"https://doi.org/10.1145/3672565","url":null,"abstract":"<p>Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips. The current talking face model can be divided into four parts: visual feature extraction, audio feature processing, multimodal feature fusion, and rendering module. For the visual feature extraction part, existing methods face the challenge of complex learning task with noisy features, this paper introduces an attention-based disentanglement module to disentangle the face into Audio-face and Identity-face using speech-related facial action unit (AU) information. For the multimodal feature fusion part, existing methods ignore not only the interaction and relationship of cross-modal information but also the local driving information of the mouth muscles. This study proposes a novel generative framework that incorporates a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to enhance the learning of cross-modal features. The proposed method employs both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can facilitate more accurate mouth movements. Given the high correlation between speech and speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. Finally, we present a diffusion model for the synthesis of talking face images. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy. Code is available at https://mftfg-au.github.io/Multimodal_Fusion/.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiling Xu, Yujie Zhang, Qi Yang, Xiaozhong Xu, Shan Liu
In recent years, many standardized algorithms for point cloud compression (PCC) has been developed and achieved remarkable compression ratios. To provide guidance for rate-distortion optimization and codec evaluation, point cloud quality assessment (PCQA) has become a critical problem for PCC. Therefore, in order to achieve a more consistent correlation with human visual perception of a compressed point cloud, we propose a full-reference PCQA algorithm tailored for static point clouds in this paper, which can jointly measure geometry and attribute deformations. Specifically, we assume that the quality decision of compressed point clouds is determined by both global appearance (e.g., density, contrast, complexity) and local details (e.g., gradient, hole). Motivated by the nature of compression distortions and the properties of the human visual system, we derive perceptually effective features for the above two categories, such as content complexity, luminance/ geometry gradient, and hole probability. Through systematically incorporating measurements of variations in the local and global characteristics, we derive an effective quality index for the input compressed point clouds. Extensive experiments and analyses conducted on popular PCQA databases show the superiority of the proposed method in evaluating compression distortions. Subsequent investigations validate the efficacy of different components within the model design.
{"title":"Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details","authors":"Yiling Xu, Yujie Zhang, Qi Yang, Xiaozhong Xu, Shan Liu","doi":"10.1145/3672567","DOIUrl":"https://doi.org/10.1145/3672567","url":null,"abstract":"<p>In recent years, many standardized algorithms for point cloud compression (PCC) has been developed and achieved remarkable compression ratios. To provide guidance for rate-distortion optimization and codec evaluation, point cloud quality assessment (PCQA) has become a critical problem for PCC. Therefore, in order to achieve a more consistent correlation with human visual perception of a compressed point cloud, we propose a full-reference PCQA algorithm tailored for static point clouds in this paper, which can jointly measure geometry and attribute deformations. Specifically, we assume that the quality decision of compressed point clouds is determined by both global appearance (e.g., density, contrast, complexity) and local details (e.g., gradient, hole). Motivated by the nature of compression distortions and the properties of the human visual system, we derive perceptually effective features for the above two categories, such as content complexity, luminance/ geometry gradient, and hole probability. Through systematically incorporating measurements of variations in the local and global characteristics, we derive an effective quality index for the input compressed point clouds. Extensive experiments and analyses conducted on popular PCQA databases show the superiority of the proposed method in evaluating compression distortions. Subsequent investigations validate the efficacy of different components within the model design.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"167 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting depth maps from monocular images has made an impressive performance in the past years. However, most depth estimation methods are trained with paired image-depth map data or multi-view images (e.g., stereo pair and monocular sequence), which suffer from expensive annotation costs and poor transferability. Although unsupervised domain adaptation methods are introduced to mitigate the reliance on annotated data, rare works focus on the unsupervised cross-scenario indoor monocular depth estimation. In this paper, we propose to study the generalization of depth estimation models across different indoor scenarios in an adversarial-based domain adaptation paradigm. Concretely, a domain discriminator is designed for discriminating the representation from source and target domains, while the feature extractor aims to confuse the domain discriminator by capturing domain-invariant features. Further, we reconstruct depth maps from latent representations with the supervision of labeled source data. As a result, the feature extractor learned features possess the merit of both domain-invariant and low source risk, and the depth estimator can deal with the domain shift between source and target domains. We conduct the cross-scenario and cross-dataset experiments on the ScanNet and NYU-Depth-v2 datasets to verify the effectiveness of our method and achieve impressive performance.
{"title":"Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation Adaptation","authors":"Jiehua Zhang, Liang Li, Chenggang Yan, Zhan Wang, Changliang Xu, Jiyong Zhang, Chuqiao Chen","doi":"10.1145/3672397","DOIUrl":"https://doi.org/10.1145/3672397","url":null,"abstract":"<p>Predicting depth maps from monocular images has made an impressive performance in the past years. However, most depth estimation methods are trained with paired image-depth map data or multi-view images (e.g., stereo pair and monocular sequence), which suffer from expensive annotation costs and poor transferability. Although unsupervised domain adaptation methods are introduced to mitigate the reliance on annotated data, rare works focus on the unsupervised cross-scenario indoor monocular depth estimation. In this paper, we propose to study the generalization of depth estimation models across different indoor scenarios in an adversarial-based domain adaptation paradigm. Concretely, a domain discriminator is designed for discriminating the representation from source and target domains, while the feature extractor aims to confuse the domain discriminator by capturing domain-invariant features. Further, we reconstruct depth maps from latent representations with the supervision of labeled source data. As a result, the feature extractor learned features possess the merit of both domain-invariant and low source risk, and the depth estimator can deal with the domain shift between source and target domains. We conduct the cross-scenario and cross-dataset experiments on the ScanNet and NYU-Depth-v2 datasets to verify the effectiveness of our method and achieve impressive performance.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"36 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How to effectively utilize unlabeled data for training is a key problem in Semi-Supervised Learning (SSL). Existing SSL methods often consider the unlabeled data whose predictions are beyond a fixed threshold (e.g., 0.95), and discard those less than 0.95. We argue that these discarded data have a large proportion, are of hard sample, and will benefit the model training if used properly. In this paper, we propose a novel method to take full advantage of the unlabeled data, termed DTS-SimL, which includes two core designs: Dual-Threshold Screening and Similarity Learning. Except for the fixed threshold, DTS-SimL extracts another class-adaptive threshold from the labeled data. Such a class-adaptive threshold can screen many unlabeled data whose predictions are lower than 0.95 but over the extracted one for model training. On the other hand, we design a new similar loss to perform similarity learning for all the highly-similar unlabeled data, which can further mine the valuable information from the unlabeled data. Finally, for more effective training of DTS-SimL, we construct an overall loss function by assigning four different losses to four different types of data. Extensive experiments are conducted on five benchmark datasets, including CIFAR-10, CIFAR-100, SVHN, Mini-ImageNet, and DomainNet-Real. Experimental results show that the proposed DTS-SimL achieves state-of-the-art classification accuracy. The code is publicly available at https://github.com/GZHU-DVL/DTS-SimL.
{"title":"Boosting Semi-Supervised Learning with Dual-Threshold Screening and Similarity Learning","authors":"Zechen Liang, Yuan-Gen Wang, Wei Lu, Xiaochun Cao","doi":"10.1145/3672563","DOIUrl":"https://doi.org/10.1145/3672563","url":null,"abstract":"<p>How to effectively utilize unlabeled data for training is a key problem in Semi-Supervised Learning (SSL). Existing SSL methods often consider the unlabeled data whose predictions are beyond a fixed threshold (e.g., 0.95), and discard those less than 0.95. We argue that these discarded data have a large proportion, are of hard sample, and will benefit the model training if used properly. In this paper, we propose a novel method to take full advantage of the unlabeled data, termed DTS-SimL, which includes two core designs: Dual-Threshold Screening and Similarity Learning. Except for the fixed threshold, DTS-SimL extracts another class-adaptive threshold from the labeled data. Such a class-adaptive threshold can screen many unlabeled data whose predictions are lower than 0.95 but over the extracted one for model training. On the other hand, we design a new similar loss to perform similarity learning for all the highly-similar unlabeled data, which can further mine the valuable information from the unlabeled data. Finally, for more effective training of DTS-SimL, we construct an overall loss function by assigning four different losses to four different types of data. Extensive experiments are conducted on five benchmark datasets, including CIFAR-10, CIFAR-100, SVHN, Mini-ImageNet, and DomainNet-Real. Experimental results show that the proposed DTS-SimL achieves state-of-the-art classification accuracy. The code is publicly available at <i> https://github.com/GZHU-DVL/DTS-SimL.</i></p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"44 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Yang, Rushi Lan, Zhenrong Deng, Xiaonan Luo, Xiyan Sun
With the development of Metaverse technology, the avatar in Metaverse has faced serious security and privacy concerns. Analyzing facial features to distinguish between genuine and manipulated facial videos holds significant research importance for ensuring the authenticity of characters in the virtual world and for mitigating discrimination, as well as preventing malicious use of facial data. To address this issue, the Facial Feature Points and Ch-Transformer (FFP-ChT) deepfake video detection model is designed based on the clues of different facial feature points distribution in real and fake videos and different displacement distances of real and fake facial feature points between frames. The face video input is first detected by the BlazeFace model, and the face detection results are fed into the FaceMesh model to extract 468 facial feature points. Then the Lucas-Kanade (LK) optical flow method is used to track the points of the face, the face calibration algorithm is introduced to re-calibrate the facial feature points, and the jitter displacement is calculated by tracking the facial feature points between frames. Finally, the Class-head (Ch) is designed in the transformer, and the facial feature points and facial feature point displacement are jointly classified through the Ch-Transformer model. In this way, the designed Ch-Transformer classifier is able to accurately and effectively identify deepfake videos. Experiments on open datasets clearly demonstrate the effectiveness and generalization capabilities of our approach.
{"title":"Deepfake Video Detection Using Facial Feature Points and Ch-Transformer","authors":"Rui Yang, Rushi Lan, Zhenrong Deng, Xiaonan Luo, Xiyan Sun","doi":"10.1145/3672566","DOIUrl":"https://doi.org/10.1145/3672566","url":null,"abstract":"<p>With the development of Metaverse technology, the avatar in Metaverse has faced serious security and privacy concerns. Analyzing facial features to distinguish between genuine and manipulated facial videos holds significant research importance for ensuring the authenticity of characters in the virtual world and for mitigating discrimination, as well as preventing malicious use of facial data. To address this issue, the Facial Feature Points and Ch-Transformer (FFP-ChT) deepfake video detection model is designed based on the clues of different facial feature points distribution in real and fake videos and different displacement distances of real and fake facial feature points between frames. The face video input is first detected by the BlazeFace model, and the face detection results are fed into the FaceMesh model to extract 468 facial feature points. Then the Lucas-Kanade (LK) optical flow method is used to track the points of the face, the face calibration algorithm is introduced to re-calibrate the facial feature points, and the jitter displacement is calculated by tracking the facial feature points between frames. Finally, the Class-head (Ch) is designed in the transformer, and the facial feature points and facial feature point displacement are jointly classified through the Ch-Transformer model. In this way, the designed Ch-Transformer classifier is able to accurately and effectively identify deepfake videos. Experiments on open datasets clearly demonstrate the effectiveness and generalization capabilities of our approach.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"344 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The state-of-the-art G-PCC (geometry-based point cloud compression) (Octree) is the fine-grained approach, which uses the octree to partition point clouds into voxels and predicts them based on neighbor occupancy in narrower spaces. However, G-PCC (Octree) is less effective at compressing dense point clouds than multi-grained approaches (such as G-PCC (Trisoup)), which exploit the continuous point distribution in nodes partitioned by the pruned octree over larger spaces. Therefore, we propose a lossy multi-grained compression with extended octree and dual-model prediction. The extended octree, where each partitioned node contains intra-block and extra-block points, is applied to address poor prediction (such as overfitting) at the node edges of the octree partition. For the points of each multi-grained node, dual-model prediction fits surfaces and projects residuals onto the surfaces, reducing projection residuals for efficient 2D compression and fitting complexity. In addition, a hybrid DWT-DCT transform for 2D projection residuals mitigates the resolution degradation of DWT and the blocking effect of DCT during high compression. Experimental results demonstrate the superior performance of our method over advanced G-PCC (Octree), achieving BD-rate gains of 55.9% and 45.3% for point-to-point (D1) and point-to-plane (D2) distortions, respectively. Our approach also outperforms G-PCC (Octree) and G-PCC (Trisoup) in subjective evaluation.
{"title":"Multi-grained Point Cloud Geometry Compression via Dual-model Prediction with Extended Octree","authors":"Tai Qin, Ge Li, Wei Gao, Shan Liu","doi":"10.1145/3671001","DOIUrl":"https://doi.org/10.1145/3671001","url":null,"abstract":"<p>The state-of-the-art G-PCC (geometry-based point cloud compression) (Octree) is the fine-grained approach, which uses the octree to partition point clouds into voxels and predicts them based on neighbor occupancy in narrower spaces. However, G-PCC (Octree) is less effective at compressing dense point clouds than multi-grained approaches (such as G-PCC (Trisoup)), which exploit the continuous point distribution in nodes partitioned by the pruned octree over larger spaces. Therefore, we propose a lossy multi-grained compression with extended octree and dual-model prediction. The extended octree, where each partitioned node contains intra-block and extra-block points, is applied to address poor prediction (such as overfitting) at the node edges of the octree partition. For the points of each multi-grained node, dual-model prediction fits surfaces and projects residuals onto the surfaces, reducing projection residuals for efficient 2D compression and fitting complexity. In addition, a hybrid DWT-DCT transform for 2D projection residuals mitigates the resolution degradation of DWT and the blocking effect of DCT during high compression. Experimental results demonstrate the superior performance of our method over advanced G-PCC (Octree), achieving BD-rate gains of 55.9% and 45.3% for point-to-point (<i>D1</i>) and point-to-plane (<i>D2</i>) distortions, respectively. Our approach also outperforms G-PCC (Octree) and G-PCC (Trisoup) in subjective evaluation.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"39 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Chen, Lingfeng Qu, Hadi Amirpour, Xingjun Wang, Christian Timmerer, Zhihong Tian
With the growing applications of video, ensuring its security has become of utmost importance. Selective encryption (SE) has gained significant attention in the field of video content protection due to its compatibility with video codecs, favorable visual distortion, and low time complexity. However, few studies consider SE security under cryptographic attacks. To fill this gap, we analyze the security concerns of encrypted bitstreams by SE schemes and propose two known plaintext attacks (KPAs). Then the corresponding defense is presented against the KPAs. To validate the effectiveness of the KPA, it is applied to attack two existing SE schemes with superior visual degradation in HEVC videos. Firstly, the encrypted bitstreams are generated using the HEVC encoder with SE (HESE). Secondly, the video sequences are encoded using H.265/HEVC. During encoding, the selected syntax elements are recorded. Then the recorded syntax elements are imported into the HEVC decoder using decryption (HDD). By utilizing the encryption parameters and the imported data in the HDD, it becomes possible to reconstruct a significant portion of the original syntax elements before encryption. Finally, the reconstructed syntax elements are compared with the encrypted syntax elements in the HDD, allowing the design of a pseudo-key stream (PKS) through the inverse of the encryption operations. The PKS is used to decrypt the existing SE scheme, and the experimental results provide evidence that the two existing SE schemes are vulnerable to the proposed KPAs. In the case of single bitstream estimation (SBE), the average correct rate of key stream estimation exceeds 93%. Moreover, with multi-bitstream complementation (MBC), the average estimation accuracy can be further improved to 99%.
随着视频应用的不断增长,确保视频安全已变得至关重要。选择性加密(Selective encryption,SE)因其与视频编解码器的兼容性、良好的视觉失真和较低的时间复杂性,在视频内容保护领域获得了极大的关注。然而,很少有研究考虑选择性加密在加密攻击下的安全性。为了填补这一空白,我们分析了 SE 方案加密比特流的安全问题,并提出了两种已知明文攻击(KPA)。然后,针对 KPAs 提出了相应的防御措施。为了验证 KPA 的有效性,我们将其应用于攻击两种现有的 SE 方案,结果发现这两种方案在 HEVC 视频中的视觉降级效果极佳。首先,使用带 SE 的 HEVC 编码器(HESE)生成加密比特流。其次,使用 H.265/HEVC 对视频序列进行编码。在编码过程中,会记录所选的语法元素。然后,使用解密(HDD)将记录的语法元素导入 HEVC 解码器。通过利用 HDD 中的加密参数和导入数据,可以在加密前重建大部分原始语法元素。最后,将重构的语法元素与 HDD 中加密的语法元素进行比较,从而通过加密操作的逆过程设计出伪密钥流 (PKS)。PKS 被用于解密现有的 SE 方案,实验结果证明,现有的两种 SE 方案在拟议的 KPA面前不堪一击。在单比特流估计(SBE)情况下,密钥流估计的平均正确率超过 93%。此外,通过多比特流互补(MBC),平均估计正确率可进一步提高到 99%。
{"title":"On the Security of Selectively Encrypted HEVC Video Bitstreams","authors":"Chen Chen, Lingfeng Qu, Hadi Amirpour, Xingjun Wang, Christian Timmerer, Zhihong Tian","doi":"10.1145/3672568","DOIUrl":"https://doi.org/10.1145/3672568","url":null,"abstract":"<p>With the growing applications of video, ensuring its security has become of utmost importance. Selective encryption (SE) has gained significant attention in the field of video content protection due to its compatibility with video codecs, favorable visual distortion, and low time complexity. However, few studies consider SE security under cryptographic attacks. To fill this gap, we analyze the security concerns of encrypted bitstreams by SE schemes and propose two known plaintext attacks (KPAs). Then the corresponding defense is presented against the KPAs. To validate the effectiveness of the KPA, it is applied to attack two existing SE schemes with superior visual degradation in HEVC videos. Firstly, the encrypted bitstreams are generated using the HEVC encoder with SE (HESE). Secondly, the video sequences are encoded using H.265/HEVC. During encoding, the selected syntax elements are recorded. Then the recorded syntax elements are imported into the HEVC decoder using decryption (HDD). By utilizing the encryption parameters and the imported data in the HDD, it becomes possible to reconstruct a significant portion of the original syntax elements before encryption. Finally, the reconstructed syntax elements are compared with the encrypted syntax elements in the HDD, allowing the design of a pseudo-key stream (PKS) through the inverse of the encryption operations. The PKS is used to decrypt the existing SE scheme, and the experimental results provide evidence that the two existing SE schemes are vulnerable to the proposed KPAs. In the case of single bitstream estimation (SBE), the average correct rate of key stream estimation exceeds 93%. Moreover, with multi-bitstream complementation (MBC), the average estimation accuracy can be further improved to 99%.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"82 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}