ACM Transactions on Multimedia Computing Communications and Applications最新文献

KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network KF-VTON：关键点驱动的基于流量的虚拟试运行网络

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-19 DOI: 10.1145/3673903

Zizhao Wu, Siyu Liu, Peioyan Lu, Ping Yang, Yongkang Wong, Xiaoling Gu, Mohan S. Kankanhalli

Image-based virtual try-on aims to fit a target garment to a reference person. Most existing methods are limited to solving the Garment-To-Person (G2P) try-on task that transfers a garment from a clean product image to the reference person and do not consider the Person-To-Person (P2P) try-on task that transfers a garment from a clothed person image to the reference person, which limits the practical applicability. The P2P try-on task is more challenging due to spatial discrepancies caused by different poses, body shapes, and views between the reference person and the target person. To address this issue, we propose a novel Keypoints-Driven Flow Based Virtual Try-On Network (KF-VTON) for handling both the G2P and P2P try-on tasks. Our KF-VTON has two key innovations: 1) We propose a new keypoints-driven flow based deformation model to warp the garment. This model establishes spatial correspondences between the target garment and reference person by combining the robustness of Thin-plate Spline (TPS) based deformation and the flexibility of appearance flow based deformation. 2) We investigate a powerful Context-aware Spatially Adaptive Normalization (CSAN) generative module to synthesize the final try-on image. Particularly, CSAN integrates rich contextual information with semantic parsing guidance to properly infer unobserved garment appearances. Extensive experiments demonstrate that our KF-VTON is capable of producing photo-realistic and high-fidelity try-on results for the G2P as well as P2P try-on tasks and surpasses previous state-of-the-art methods both quantitatively and qualitatively. Our code is available at https://github.com/OIUIU/KF-VTON.

基于图像的虚拟试穿旨在使目标服装与参照人相匹配。现有的大多数方法仅限于解决将服装从干净的产品图像转移到参照人的服装对人（G2P）试穿任务，而没有考虑将服装从穿衣人图像转移到参照人的人对人（P2P）试穿任务，这限制了其实际应用性。P2P 试穿任务更具挑战性，因为参照人和目标人的姿势、体形和视角不同，会造成空间差异。为了解决这个问题，我们提出了一种新颖的基于关键点流的虚拟试穿网络（KF-VTON），用于处理 G2P 和 P2P 试穿任务。我们的 KF-VTON 有两个关键创新点：1) 我们提出了一种新的基于关键点驱动流量的变形模型来翘曲服装。该模型结合了基于薄板样条（TPS）变形的鲁棒性和基于外观流变形的灵活性，在目标服装和参照人之间建立空间对应关系。2) 我们研究了一个功能强大的上下文感知空间自适应归一化（CSAN）生成模块，用于合成最终的试穿图像。特别是，CSAN 将丰富的上下文信息与语义解析指导相结合，以正确推断未观察到的服装外观。广泛的实验证明，我们的 KF-VTON 能够为 G2P 和 P2P 试穿任务生成照片般逼真的高保真试穿结果，并在定量和定性方面超越了之前最先进的方法。我们的代码见 https://github.com/OIUIU/KF-VTON。

{"title":"KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network","authors":"Zizhao Wu, Siyu Liu, Peioyan Lu, Ping Yang, Yongkang Wong, Xiaoling Gu, Mohan S. Kankanhalli","doi":"10.1145/3673903","DOIUrl":"https://doi.org/10.1145/3673903","url":null,"abstract":"Image-based virtual try-on aims to fit a target garment to a reference person. Most existing methods are limited to solving the Garment-To-Person (G2P) try-on task that transfers a garment from a clean product image to the reference person and do not consider the Person-To-Person (P2P) try-on task that transfers a garment from a clothed person image to the reference person, which limits the practical applicability. The P2P try-on task is more challenging due to spatial discrepancies caused by different poses, body shapes, and views between the reference person and the target person. To address this issue, we propose a novel Keypoints-Driven Flow Based Virtual Try-On Network (KF-VTON) for handling both the G2P and P2P try-on tasks. Our KF-VTON has two key innovations: 1) We propose a new keypoints-driven flow based deformation model to warp the garment. This model establishes spatial correspondences between the target garment and reference person by combining the robustness of Thin-plate Spline (TPS) based deformation and the flexibility of appearance flow based deformation. 2) We investigate a powerful Context-aware Spatially Adaptive Normalization (CSAN) generative module to synthesize the final try-on image. Particularly, CSAN integrates rich contextual information with semantic parsing guidance to properly infer unobserved garment appearances. Extensive experiments demonstrate that our KF-VTON is capable of producing photo-realistic and high-fidelity try-on results for the G2P as well as P2P try-on tasks and surpasses previous state-of-the-art methods both quantitatively and qualitatively. Our code is available at https://github.com/OIUIU/KF-VTON.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"34 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning 大型预训练模型跨域快速学习统一视图实证研究

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-19 DOI: 10.1145/3673231

Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, Yu-Gang Jiang

The challenge of cross-domain few-shot learning (CD-FSL) stems from the substantial distribution disparities between target and source domain images, necessitating a model with robust generalization capabilities. In this work, we posit that large-scale pretrained models are pivotal in addressing the cross-domain few-shot learning task owing to their exceptional representational and generalization prowess. To our knowledge, no existing research comprehensively investigates the utility of large-scale pretrained models in the cross-domain few-shot learning context. Addressing this gap, our study presents an exhaustive empirical assessment of the CLIP model within the cross-domain few-shot learning task. We undertake a comparison spanning six dimensions: base model, transfer module, classifier, loss, data augmentation, and training schedule. Furthermore, we establish a straightforward baseline model, E-base, based on our empirical analysis, underscoring the importance of our investigation. Experimental results substantiate the efficacy of our model, yielding a mean gain of 1.2% in 5-way 5-shot evaluations on the BSCD dataset.

跨域少镜头学习（CD-FSL）的挑战源于目标域和源域图像之间巨大的分布差异，这就需要一个具有强大泛化能力的模型。在这项工作中，我们认为大规模预训练模型因其卓越的表征能力和泛化能力，在解决跨域少数镜头学习任务中起着关键作用。据我们所知，目前还没有任何研究全面调查了大规模预训练模型在跨域少量学习中的实用性。针对这一空白，我们的研究在跨域少量学习任务中对 CLIP 模型进行了详尽的实证评估。我们从六个方面进行了比较：基础模型、转移模块、分类器、损失、数据增强和训练计划。此外，我们还根据实证分析建立了一个简单明了的基准模型--E-base，强调了我们研究的重要性。实验结果证明了我们模型的有效性，在 BSCD 数据集的 5 路 5 次评估中，平均增益为 1.2%。

{"title":"Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning","authors":"Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, Yu-Gang Jiang","doi":"10.1145/3673231","DOIUrl":"https://doi.org/10.1145/3673231","url":null,"abstract":"The challenge of cross-domain few-shot learning (CD-FSL) stems from the substantial distribution disparities between target and source domain images, necessitating a model with robust generalization capabilities. In this work, we posit that large-scale pretrained models are pivotal in addressing the cross-domain few-shot learning task owing to their exceptional representational and generalization prowess. To our knowledge, no existing research comprehensively investigates the utility of large-scale pretrained models in the cross-domain few-shot learning context. Addressing this gap, our study presents an exhaustive empirical assessment of the CLIP model within the cross-domain few-shot learning task. We undertake a comparison spanning six dimensions: base model, transfer module, classifier, loss, data augmentation, and training schedule. Furthermore, we establish a straightforward baseline model, E-base, based on our empirical analysis, underscoring the importance of our investigation. Experimental results substantiate the efficacy of our model, yielding a mean gain of 1.2% in 5-way 5-shot evaluations on the BSCD dataset.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"27 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TA-Detector: A GNN-based Anomaly Detector via Trust Relationship TA-Detector：基于 GNN 的信任关系异常检测器

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-19 DOI: 10.1145/3672401

Jie Wen, Nan Jiang, Lang Li, Jie Zhou, Yanpei Li, Hualin Zhan, Guang Kou, Weihao Gu, Jiahui Zhao

With the rise of mobile Internet and AI, social media integrating short messages, images, and videos has developed rapidly. As a guarantee for the stable operation of social media, information security, especially graph anomaly detection (GAD), has become a hot issue inspired by the extensive attention of researchers. Most GAD methods are mainly limited to enhancing the homophily or considering homophily and heterophilic connections. Nevertheless, due to the deceptive nature of homophily connections among anomalies, the discriminative information of the anomalies can be eliminated. To alleviate the issue, we explore a novel method TA-Detector in GAD by introducing the concept of trust into the classification of connections. In particular, the proposed approach adopts a designed trust classier to distinguish trust and distrust connections with the supervision of labeled nodes. Then, we capture the latent factors related to GAD by graph neural networks (GNN), which integrate node interaction type information and node representation. Finally, to identify anomalies in the graph, we use the residual network mechanism to extract the deep semantic embedding information related to GAD. Experimental results on two real benchmark datasets verify that our proposed approach boosts the overall GAD performance in comparison to benchmark baselines.

随着移动互联网和人工智能的兴起，集短信、图片和视频于一体的社交媒体迅速发展。作为社交媒体稳定运行的保障，信息安全尤其是图异常检测（GAD）已成为研究人员广泛关注的热点问题。大多数 GAD 方法主要局限于增强同亲关系或考虑同亲关系和异亲关系。然而，由于异常点之间的同亲联系具有欺骗性，异常点的鉴别信息可能会被消除。为了缓解这一问题，我们在 GAD 中探索了一种新的 TA-Detector 方法，在连接分类中引入了信任的概念。具体来说，我们提出的方法采用了一个设计好的信任分类器，在标记节点的监督下区分信任和不信任连接。然后，我们通过图神经网络（GNN）捕捉与 GAD 相关的潜在因素，该网络整合了节点交互类型信息和节点表示。最后，为了识别图中的异常情况，我们利用残差网络机制提取与 GAD 相关的深层语义嵌入信息。在两个真实基准数据集上的实验结果证实，与基准基线相比，我们提出的方法提高了 GAD 的整体性能。

{"title":"TA-Detector: A GNN-based Anomaly Detector via Trust Relationship","authors":"Jie Wen, Nan Jiang, Lang Li, Jie Zhou, Yanpei Li, Hualin Zhan, Guang Kou, Weihao Gu, Jiahui Zhao","doi":"10.1145/3672401","DOIUrl":"https://doi.org/10.1145/3672401","url":null,"abstract":"With the rise of mobile Internet and AI, social media integrating short messages, images, and videos has developed rapidly. As a guarantee for the stable operation of social media, information security, especially graph anomaly detection (GAD), has become a hot issue inspired by the extensive attention of researchers. Most GAD methods are mainly limited to enhancing the homophily or considering homophily and heterophilic connections. Nevertheless, due to the deceptive nature of homophily connections among anomalies, the discriminative information of the anomalies can be eliminated. To alleviate the issue, we explore a novel method TA-Detector in GAD by introducing the concept of trust into the classification of connections. In particular, the proposed approach adopts a designed trust classier to distinguish trust and distrust connections with the supervision of labeled nodes. Then, we capture the latent factors related to GAD by graph neural networks (GNN), which integrate node interaction type information and node representation. Finally, to identify anomalies in the graph, we use the residual network mechanism to extract the deep semantic embedding information related to GAD. Experimental results on two real benchmark datasets verify that our proposed approach boosts the overall GAD performance in comparison to benchmark baselines.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"76 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units 利用语音相关面部动作单元的多模态融合技术生成会说话的人脸

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-17 DOI: 10.1145/3672565

Zhilei Liu, Xiaoxing Liu, Sen Chen, Jiaxing Liu, Longbiao Wang, Chongke Bi

Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips. The current talking face model can be divided into four parts: visual feature extraction, audio feature processing, multimodal feature fusion, and rendering module. For the visual feature extraction part, existing methods face the challenge of complex learning task with noisy features, this paper introduces an attention-based disentanglement module to disentangle the face into Audio-face and Identity-face using speech-related facial action unit (AU) information. For the multimodal feature fusion part, existing methods ignore not only the interaction and relationship of cross-modal information but also the local driving information of the mouth muscles. This study proposes a novel generative framework that incorporates a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to enhance the learning of cross-modal features. The proposed method employs both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can facilitate more accurate mouth movements. Given the high correlation between speech and speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. Finally, we present a diffusion model for the synthesis of talking face images. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy. Code is available at https://mftfg-au.github.io/Multimodal_Fusion/.

会说话的人脸生成是通过输入任意的人脸图像和相应的音频片段来合成唇语同步的会说话的人脸视频。目前的会说话的人脸模型可分为四个部分：视觉特征提取、音频特征处理、多模态特征融合和渲染模块。在视觉特征提取部分，现有方法面临着复杂的噪声特征学习任务，本文引入了基于注意力的分离模块，利用与语音相关的面部动作单元（AU）信息将人脸分离为音频人脸和身份人脸。对于多模态特征融合部分，现有方法不仅忽略了跨模态信息的交互和关系，也忽略了嘴部肌肉的局部驱动信息。本研究提出了一种新颖的生成框架，将扩张的非因果时空卷积自注意力网络作为多模态融合模块，以增强跨模态特征的学习。所提出的方法采用与音频和语音相关的面部动作单元（AU）作为驱动信息。与语音相关的 AU 信息可以促进更准确的嘴部动作。鉴于语音和语音相关 AU 之间的高度相关性，我们提出了一个音频到 AU 模块来预测语音相关 AU 信息。最后，我们提出了一个用于合成说话人脸图像的扩散模型。我们在 GRID 和 TCD-TIMIT 数据集上验证了所提模型的有效性。我们还进行了一项消融研究，以验证各组成部分的贡献。定量和定性实验结果表明，我们的方法在图像质量和唇语同步准确性方面都优于现有方法。代码见 https://mftfg-au.github.io/Multimodal_Fusion/。

{"title":"Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units","authors":"Zhilei Liu, Xiaoxing Liu, Sen Chen, Jiaxing Liu, Longbiao Wang, Chongke Bi","doi":"10.1145/3672565","DOIUrl":"https://doi.org/10.1145/3672565","url":null,"abstract":"Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips. The current talking face model can be divided into four parts: visual feature extraction, audio feature processing, multimodal feature fusion, and rendering module. For the visual feature extraction part, existing methods face the challenge of complex learning task with noisy features, this paper introduces an attention-based disentanglement module to disentangle the face into Audio-face and Identity-face using speech-related facial action unit (AU) information. For the multimodal feature fusion part, existing methods ignore not only the interaction and relationship of cross-modal information but also the local driving information of the mouth muscles. This study proposes a novel generative framework that incorporates a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to enhance the learning of cross-modal features. The proposed method employs both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can facilitate more accurate mouth movements. Given the high correlation between speech and speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. Finally, we present a diffusion model for the synthesis of talking face images. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy. Code is available at https://mftfg-au.github.io/Multimodal_Fusion/.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details 结合全局外观和局部细节的压缩点云质量指标

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-15 DOI: 10.1145/3672567

Yiling Xu, Yujie Zhang, Qi Yang, Xiaozhong Xu, Shan Liu

In recent years, many standardized algorithms for point cloud compression (PCC) has been developed and achieved remarkable compression ratios. To provide guidance for rate-distortion optimization and codec evaluation, point cloud quality assessment (PCQA) has become a critical problem for PCC. Therefore, in order to achieve a more consistent correlation with human visual perception of a compressed point cloud, we propose a full-reference PCQA algorithm tailored for static point clouds in this paper, which can jointly measure geometry and attribute deformations. Specifically, we assume that the quality decision of compressed point clouds is determined by both global appearance (e.g., density, contrast, complexity) and local details (e.g., gradient, hole). Motivated by the nature of compression distortions and the properties of the human visual system, we derive perceptually effective features for the above two categories, such as content complexity, luminance/ geometry gradient, and hole probability. Through systematically incorporating measurements of variations in the local and global characteristics, we derive an effective quality index for the input compressed point clouds. Extensive experiments and analyses conducted on popular PCQA databases show the superiority of the proposed method in evaluating compression distortions. Subsequent investigations validate the efficacy of different components within the model design.

近年来，许多标准化的点云压缩（PCC）算法被开发出来，并取得了显著的压缩率。为了给速率失真优化和编解码器评估提供指导，点云质量评估（PCQA）已成为 PCC 的一个关键问题。因此，为了实现压缩点云与人类视觉感知更一致的相关性，我们在本文中提出了一种为静态点云量身定制的全参考 PCQA 算法，该算法可联合测量几何和属性变形。具体来说，我们假设压缩点云的质量判定由全局外观（如密度、对比度、复杂度）和局部细节（如梯度、孔洞）共同决定。受压缩失真的性质和人类视觉系统特性的启发，我们为上述两类内容推导出了有效的感知特征，如内容复杂度、亮度/几何梯度和孔洞概率。通过系统地测量局部和全局特征的变化，我们得出了输入压缩点云的有效质量指标。在流行的 PCQA 数据库上进行的大量实验和分析表明，所提出的方法在评估压缩失真方面具有优越性。随后的研究验证了模型设计中不同组成部分的功效。

{"title":"Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details","authors":"Yiling Xu, Yujie Zhang, Qi Yang, Xiaozhong Xu, Shan Liu","doi":"10.1145/3672567","DOIUrl":"https://doi.org/10.1145/3672567","url":null,"abstract":"In recent years, many standardized algorithms for point cloud compression (PCC) has been developed and achieved remarkable compression ratios. To provide guidance for rate-distortion optimization and codec evaluation, point cloud quality assessment (PCQA) has become a critical problem for PCC. Therefore, in order to achieve a more consistent correlation with human visual perception of a compressed point cloud, we propose a full-reference PCQA algorithm tailored for static point clouds in this paper, which can jointly measure geometry and attribute deformations. Specifically, we assume that the quality decision of compressed point clouds is determined by both global appearance (e.g., density, contrast, complexity) and local details (e.g., gradient, hole). Motivated by the nature of compression distortions and the properties of the human visual system, we derive perceptually effective features for the above two categories, such as content complexity, luminance/ geometry gradient, and hole probability. Through systematically incorporating measurements of variations in the local and global characteristics, we derive an effective quality index for the input compressed point clouds. Extensive experiments and analyses conducted on popular PCQA databases show the superiority of the proposed method in evaluating compression distortions. Subsequent investigations validate the efficacy of different components within the model design.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"167 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation Adaptation 为无监督室内深度估计自适应学习领域不变特征

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-13 DOI: 10.1145/3672397

Jiehua Zhang, Liang Li, Chenggang Yan, Zhan Wang, Changliang Xu, Jiyong Zhang, Chuqiao Chen

Predicting depth maps from monocular images has made an impressive performance in the past years. However, most depth estimation methods are trained with paired image-depth map data or multi-view images (e.g., stereo pair and monocular sequence), which suffer from expensive annotation costs and poor transferability. Although unsupervised domain adaptation methods are introduced to mitigate the reliance on annotated data, rare works focus on the unsupervised cross-scenario indoor monocular depth estimation. In this paper, we propose to study the generalization of depth estimation models across different indoor scenarios in an adversarial-based domain adaptation paradigm. Concretely, a domain discriminator is designed for discriminating the representation from source and target domains, while the feature extractor aims to confuse the domain discriminator by capturing domain-invariant features. Further, we reconstruct depth maps from latent representations with the supervision of labeled source data. As a result, the feature extractor learned features possess the merit of both domain-invariant and low source risk, and the depth estimator can deal with the domain shift between source and target domains. We conduct the cross-scenario and cross-dataset experiments on the ScanNet and NYU-Depth-v2 datasets to verify the effectiveness of our method and achieve impressive performance.

在过去几年中，从单目图像预测深度图取得了令人瞩目的成就。然而，大多数深度估算方法都是通过成对图像深度图数据或多视角图像（如立体配对和单目序列）进行训练的，这些方法存在注释成本昂贵和可移植性差的问题。虽然引入了无监督域适应方法来减轻对注释数据的依赖，但很少有研究集中于无监督跨场景室内单目深度估计。在本文中，我们提出在基于对抗的领域适应范例中研究深度估计模型在不同室内场景中的泛化。具体来说，域判别器用于判别源域和目标域的表示，而特征提取器则旨在通过捕捉域不变特征来混淆域判别器。此外，在标注源数据的监督下，我们从潜在表征重建深度图。因此，特征提取器学习到的特征具有域不变和低源风险的优点，而深度估计器可以处理源域和目标域之间的域偏移。我们在 ScanNet 和 NYU-Depth-v2 数据集上进行了跨场景和跨数据集实验，验证了我们方法的有效性，并取得了令人印象深刻的性能。

{"title":"Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation Adaptation","authors":"Jiehua Zhang, Liang Li, Chenggang Yan, Zhan Wang, Changliang Xu, Jiyong Zhang, Chuqiao Chen","doi":"10.1145/3672397","DOIUrl":"https://doi.org/10.1145/3672397","url":null,"abstract":"Predicting depth maps from monocular images has made an impressive performance in the past years. However, most depth estimation methods are trained with paired image-depth map data or multi-view images (e.g., stereo pair and monocular sequence), which suffer from expensive annotation costs and poor transferability. Although unsupervised domain adaptation methods are introduced to mitigate the reliance on annotated data, rare works focus on the unsupervised cross-scenario indoor monocular depth estimation. In this paper, we propose to study the generalization of depth estimation models across different indoor scenarios in an adversarial-based domain adaptation paradigm. Concretely, a domain discriminator is designed for discriminating the representation from source and target domains, while the feature extractor aims to confuse the domain discriminator by capturing domain-invariant features. Further, we reconstruct depth maps from latent representations with the supervision of labeled source data. As a result, the feature extractor learned features possess the merit of both domain-invariant and low source risk, and the depth estimator can deal with the domain shift between source and target domains. We conduct the cross-scenario and cross-dataset experiments on the ScanNet and NYU-Depth-v2 datasets to verify the effectiveness of our method and achieve impressive performance.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"36 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Boosting Semi-Supervised Learning with Dual-Threshold Screening and Similarity Learning 利用双阈值筛选和相似性学习促进半监督学习

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-12 DOI: 10.1145/3672563

Zechen Liang, Yuan-Gen Wang, Wei Lu, Xiaochun Cao

How to effectively utilize unlabeled data for training is a key problem in Semi-Supervised Learning (SSL). Existing SSL methods often consider the unlabeled data whose predictions are beyond a fixed threshold (e.g., 0.95), and discard those less than 0.95. We argue that these discarded data have a large proportion, are of hard sample, and will benefit the model training if used properly. In this paper, we propose a novel method to take full advantage of the unlabeled data, termed DTS-SimL, which includes two core designs: Dual-Threshold Screening and Similarity Learning. Except for the fixed threshold, DTS-SimL extracts another class-adaptive threshold from the labeled data. Such a class-adaptive threshold can screen many unlabeled data whose predictions are lower than 0.95 but over the extracted one for model training. On the other hand, we design a new similar loss to perform similarity learning for all the highly-similar unlabeled data, which can further mine the valuable information from the unlabeled data. Finally, for more effective training of DTS-SimL, we construct an overall loss function by assigning four different losses to four different types of data. Extensive experiments are conducted on five benchmark datasets, including CIFAR-10, CIFAR-100, SVHN, Mini-ImageNet, and DomainNet-Real. Experimental results show that the proposed DTS-SimL achieves state-of-the-art classification accuracy. The code is publicly available at https://github.com/GZHU-DVL/DTS-SimL.

如何有效利用未标注数据进行训练是半监督学习（SSL）的一个关键问题。现有的半监督学习方法通常会考虑预测结果超过固定阈值（如 0.95）的未标记数据，并舍弃小于 0.95 的数据。我们认为，这些被丢弃的数据比例很大，属于硬样本，如果使用得当，将有利于模型训练。在本文中，我们提出了一种充分利用未标记数据的新方法，称为 DTS-SimL，其中包括两个核心设计：它包括两个核心设计：双阈值筛选和相似性学习。除了固定阈值外，DTS-SimL 还从标记数据中提取了另一个类自适应阈值。这种类自适应阈值可以筛选出许多预测值低于 0.95 但高于提取阈值的未标注数据，用于模型训练。另一方面，我们设计了一种新的相似损失，对所有高度相似的未标注数据进行相似性学习，从而进一步挖掘未标注数据中的有价值信息。最后，为了更有效地训练 DTS-SimL，我们为四种不同类型的数据分配了四种不同的损失，从而构建了一个整体损失函数。我们在五个基准数据集上进行了广泛的实验，包括 CIFAR-10、CIFAR-100、SVHN、Mini-ImageNet 和 DomainNet-Real。实验结果表明，所提出的 DTS-SimL 达到了最先进的分类精度。代码可在 https://github.com/GZHU-DVL/DTS-SimL 上公开获取。

{"title":"Boosting Semi-Supervised Learning with Dual-Threshold Screening and Similarity Learning","authors":"Zechen Liang, Yuan-Gen Wang, Wei Lu, Xiaochun Cao","doi":"10.1145/3672563","DOIUrl":"https://doi.org/10.1145/3672563","url":null,"abstract":"How to effectively utilize unlabeled data for training is a key problem in Semi-Supervised Learning (SSL). Existing SSL methods often consider the unlabeled data whose predictions are beyond a fixed threshold (e.g., 0.95), and discard those less than 0.95. We argue that these discarded data have a large proportion, are of hard sample, and will benefit the model training if used properly. In this paper, we propose a novel method to take full advantage of the unlabeled data, termed DTS-SimL, which includes two core designs: Dual-Threshold Screening and Similarity Learning. Except for the fixed threshold, DTS-SimL extracts another class-adaptive threshold from the labeled data. Such a class-adaptive threshold can screen many unlabeled data whose predictions are lower than 0.95 but over the extracted one for model training. On the other hand, we design a new similar loss to perform similarity learning for all the highly-similar unlabeled data, which can further mine the valuable information from the unlabeled data. Finally, for more effective training of DTS-SimL, we construct an overall loss function by assigning four different losses to four different types of data. Extensive experiments are conducted on five benchmark datasets, including CIFAR-10, CIFAR-100, SVHN, Mini-ImageNet, and DomainNet-Real. Experimental results show that the proposed DTS-SimL achieves state-of-the-art classification accuracy. The code is publicly available at https://github.com/GZHU-DVL/DTS-SimL.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"44 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deepfake Video Detection Using Facial Feature Points and Ch-Transformer 利用面部特征点和 Ch 变换器进行深度伪造视频检测

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-12 DOI: 10.1145/3672566

Rui Yang, Rushi Lan, Zhenrong Deng, Xiaonan Luo, Xiyan Sun

With the development of Metaverse technology, the avatar in Metaverse has faced serious security and privacy concerns. Analyzing facial features to distinguish between genuine and manipulated facial videos holds significant research importance for ensuring the authenticity of characters in the virtual world and for mitigating discrimination, as well as preventing malicious use of facial data. To address this issue, the Facial Feature Points and Ch-Transformer (FFP-ChT) deepfake video detection model is designed based on the clues of different facial feature points distribution in real and fake videos and different displacement distances of real and fake facial feature points between frames. The face video input is first detected by the BlazeFace model, and the face detection results are fed into the FaceMesh model to extract 468 facial feature points. Then the Lucas-Kanade (LK) optical flow method is used to track the points of the face, the face calibration algorithm is introduced to re-calibrate the facial feature points, and the jitter displacement is calculated by tracking the facial feature points between frames. Finally, the Class-head (Ch) is designed in the transformer, and the facial feature points and facial feature point displacement are jointly classified through the Ch-Transformer model. In this way, the designed Ch-Transformer classifier is able to accurately and effectively identify deepfake videos. Experiments on open datasets clearly demonstrate the effectiveness and generalization capabilities of our approach.

随着 Metaverse 技术的发展，Metaverse 中的化身面临着严重的安全和隐私问题。分析面部特征以区分真实和伪造的面部视频，对于确保虚拟世界中角色的真实性、减少歧视以及防止恶意使用面部数据具有重要的研究意义。针对这一问题，我们根据真假视频中面部特征点的不同分布以及真假面部特征点在帧间的不同位移距离等线索，设计了面部特征点和Ch-变换器（FFP-ChT）深度防伪视频检测模型。首先由 BlazeFace 模型对输入的人脸视频进行检测，然后将人脸检测结果输入 FaceMesh 模型，提取出 468 个人脸特征点。然后使用卢卡斯-卡纳德（LK）光流方法跟踪人脸点，引入人脸校准算法重新校准人脸特征点，并通过跟踪帧间人脸特征点计算抖动位移。最后，在变换器中设计分类头（Ch），通过 Ch-Transformer 模型对人脸特征点和人脸特征点位移进行联合分类。这样，所设计的 Ch-Transformer 分类器就能准确有效地识别深度伪造视频。在开放数据集上的实验清楚地证明了我们的方法的有效性和泛化能力。

{"title":"Deepfake Video Detection Using Facial Feature Points and Ch-Transformer","authors":"Rui Yang, Rushi Lan, Zhenrong Deng, Xiaonan Luo, Xiyan Sun","doi":"10.1145/3672566","DOIUrl":"https://doi.org/10.1145/3672566","url":null,"abstract":"With the development of Metaverse technology, the avatar in Metaverse has faced serious security and privacy concerns. Analyzing facial features to distinguish between genuine and manipulated facial videos holds significant research importance for ensuring the authenticity of characters in the virtual world and for mitigating discrimination, as well as preventing malicious use of facial data. To address this issue, the Facial Feature Points and Ch-Transformer (FFP-ChT) deepfake video detection model is designed based on the clues of different facial feature points distribution in real and fake videos and different displacement distances of real and fake facial feature points between frames. The face video input is first detected by the BlazeFace model, and the face detection results are fed into the FaceMesh model to extract 468 facial feature points. Then the Lucas-Kanade (LK) optical flow method is used to track the points of the face, the face calibration algorithm is introduced to re-calibrate the facial feature points, and the jitter displacement is calculated by tracking the facial feature points between frames. Finally, the Class-head (Ch) is designed in the transformer, and the facial feature points and facial feature point displacement are jointly classified through the Ch-Transformer model. In this way, the designed Ch-Transformer classifier is able to accurately and effectively identify deepfake videos. Experiments on open datasets clearly demonstrate the effectiveness and generalization capabilities of our approach.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"344 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-grained Point Cloud Geometry Compression via Dual-model Prediction with Extended Octree 通过扩展八叉树双模型预测实现多粒度点云几何压缩

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-12 DOI: 10.1145/3671001

Tai Qin, Ge Li, Wei Gao, Shan Liu

The state-of-the-art G-PCC (geometry-based point cloud compression) (Octree) is the fine-grained approach, which uses the octree to partition point clouds into voxels and predicts them based on neighbor occupancy in narrower spaces. However, G-PCC (Octree) is less effective at compressing dense point clouds than multi-grained approaches (such as G-PCC (Trisoup)), which exploit the continuous point distribution in nodes partitioned by the pruned octree over larger spaces. Therefore, we propose a lossy multi-grained compression with extended octree and dual-model prediction. The extended octree, where each partitioned node contains intra-block and extra-block points, is applied to address poor prediction (such as overfitting) at the node edges of the octree partition. For the points of each multi-grained node, dual-model prediction fits surfaces and projects residuals onto the surfaces, reducing projection residuals for efficient 2D compression and fitting complexity. In addition, a hybrid DWT-DCT transform for 2D projection residuals mitigates the resolution degradation of DWT and the blocking effect of DCT during high compression. Experimental results demonstrate the superior performance of our method over advanced G-PCC (Octree), achieving BD-rate gains of 55.9% and 45.3% for point-to-point (D1) and point-to-plane (D2) distortions, respectively. Our approach also outperforms G-PCC (Octree) and G-PCC (Trisoup) in subjective evaluation.

最先进的 G-PCC（基于几何图形的点云压缩）（Octree）是细粒度方法，它使用八叉树将点云划分为体素，并根据较窄空间中的邻居占用率对其进行预测。然而，与多粒度方法（如 G-PCC（Trisoup））相比，G-PCC（Octree）在压缩密集点云方面的效果较差，因为多粒度方法利用的是经修剪的八叉树在较大空间内分割的节点中的连续点分布。因此，我们提出了一种采用扩展八叉树和双模型预测的有损多粒度压缩方法。扩展八叉树（每个分区节点包含块内和块外点）用于解决八叉树分区节点边缘的不良预测（如过拟合）。对于每个多粒度节点的点，双模型预测会拟合曲面并将残差投影到曲面上，从而减少投影残差，实现高效的二维压缩和拟合复杂度。此外，针对二维投影残差的 DWT-DCT 混合变换减轻了高压缩过程中 DWT 的分辨率下降和 DCT 的阻塞效应。实验结果表明，我们的方法比先进的 G-PCC（Octree）性能更优越，在处理点到点（D1）和点到面（D2）失真时，BD 速率分别提高了 55.9% 和 45.3%。在主观评价方面，我们的方法也优于 G-PCC (Octree) 和 G-PCC (Trisoup)。

{"title":"Multi-grained Point Cloud Geometry Compression via Dual-model Prediction with Extended Octree","authors":"Tai Qin, Ge Li, Wei Gao, Shan Liu","doi":"10.1145/3671001","DOIUrl":"https://doi.org/10.1145/3671001","url":null,"abstract":"The state-of-the-art G-PCC (geometry-based point cloud compression) (Octree) is the fine-grained approach, which uses the octree to partition point clouds into voxels and predicts them based on neighbor occupancy in narrower spaces. However, G-PCC (Octree) is less effective at compressing dense point clouds than multi-grained approaches (such as G-PCC (Trisoup)), which exploit the continuous point distribution in nodes partitioned by the pruned octree over larger spaces. Therefore, we propose a lossy multi-grained compression with extended octree and dual-model prediction. The extended octree, where each partitioned node contains intra-block and extra-block points, is applied to address poor prediction (such as overfitting) at the node edges of the octree partition. For the points of each multi-grained node, dual-model prediction fits surfaces and projects residuals onto the surfaces, reducing projection residuals for efficient 2D compression and fitting complexity. In addition, a hybrid DWT-DCT transform for 2D projection residuals mitigates the resolution degradation of DWT and the blocking effect of DCT during high compression. Experimental results demonstrate the superior performance of our method over advanced G-PCC (Octree), achieving BD-rate gains of 55.9% and 45.3% for point-to-point (D1) and point-to-plane (D2) distortions, respectively. Our approach also outperforms G-PCC (Octree) and G-PCC (Trisoup) in subjective evaluation.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"39 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Security of Selectively Encrypted HEVC Video Bitstreams 论选择性加密 HEVC 视频比特流的安全性

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-12 DOI: 10.1145/3672568

Chen Chen, Lingfeng Qu, Hadi Amirpour, Xingjun Wang, Christian Timmerer, Zhihong Tian

With the growing applications of video, ensuring its security has become of utmost importance. Selective encryption (SE) has gained significant attention in the field of video content protection due to its compatibility with video codecs, favorable visual distortion, and low time complexity. However, few studies consider SE security under cryptographic attacks. To fill this gap, we analyze the security concerns of encrypted bitstreams by SE schemes and propose two known plaintext attacks (KPAs). Then the corresponding defense is presented against the KPAs. To validate the effectiveness of the KPA, it is applied to attack two existing SE schemes with superior visual degradation in HEVC videos. Firstly, the encrypted bitstreams are generated using the HEVC encoder with SE (HESE). Secondly, the video sequences are encoded using H.265/HEVC. During encoding, the selected syntax elements are recorded. Then the recorded syntax elements are imported into the HEVC decoder using decryption (HDD). By utilizing the encryption parameters and the imported data in the HDD, it becomes possible to reconstruct a significant portion of the original syntax elements before encryption. Finally, the reconstructed syntax elements are compared with the encrypted syntax elements in the HDD, allowing the design of a pseudo-key stream (PKS) through the inverse of the encryption operations. The PKS is used to decrypt the existing SE scheme, and the experimental results provide evidence that the two existing SE schemes are vulnerable to the proposed KPAs. In the case of single bitstream estimation (SBE), the average correct rate of key stream estimation exceeds 93%. Moreover, with multi-bitstream complementation (MBC), the average estimation accuracy can be further improved to 99%.

随着视频应用的不断增长，确保视频安全已变得至关重要。选择性加密（Selective encryption，SE）因其与视频编解码器的兼容性、良好的视觉失真和较低的时间复杂性，在视频内容保护领域获得了极大的关注。然而，很少有研究考虑选择性加密在加密攻击下的安全性。为了填补这一空白，我们分析了 SE 方案加密比特流的安全问题，并提出了两种已知明文攻击（KPA）。然后，针对 KPAs 提出了相应的防御措施。为了验证 KPA 的有效性，我们将其应用于攻击两种现有的 SE 方案，结果发现这两种方案在 HEVC 视频中的视觉降级效果极佳。首先，使用带 SE 的 HEVC 编码器（HESE）生成加密比特流。其次，使用 H.265/HEVC 对视频序列进行编码。在编码过程中，会记录所选的语法元素。然后，使用解密（HDD）将记录的语法元素导入 HEVC 解码器。通过利用 HDD 中的加密参数和导入数据，可以在加密前重建大部分原始语法元素。最后，将重构的语法元素与 HDD 中加密的语法元素进行比较，从而通过加密操作的逆过程设计出伪密钥流 (PKS)。PKS 被用于解密现有的 SE 方案，实验结果证明，现有的两种 SE 方案在拟议的 KPA面前不堪一击。在单比特流估计（SBE）情况下，密钥流估计的平均正确率超过 93%。此外，通过多比特流互补（MBC），平均估计正确率可进一步提高到 99%。

{"title":"On the Security of Selectively Encrypted HEVC Video Bitstreams","authors":"Chen Chen, Lingfeng Qu, Hadi Amirpour, Xingjun Wang, Christian Timmerer, Zhihong Tian","doi":"10.1145/3672568","DOIUrl":"https://doi.org/10.1145/3672568","url":null,"abstract":"With the growing applications of video, ensuring its security has become of utmost importance. Selective encryption (SE) has gained significant attention in the field of video content protection due to its compatibility with video codecs, favorable visual distortion, and low time complexity. However, few studies consider SE security under cryptographic attacks. To fill this gap, we analyze the security concerns of encrypted bitstreams by SE schemes and propose two known plaintext attacks (KPAs). Then the corresponding defense is presented against the KPAs. To validate the effectiveness of the KPA, it is applied to attack two existing SE schemes with superior visual degradation in HEVC videos. Firstly, the encrypted bitstreams are generated using the HEVC encoder with SE (HESE). Secondly, the video sequences are encoded using H.265/HEVC. During encoding, the selected syntax elements are recorded. Then the recorded syntax elements are imported into the HEVC decoder using decryption (HDD). By utilizing the encryption parameters and the imported data in the HDD, it becomes possible to reconstruct a significant portion of the original syntax elements before encryption. Finally, the reconstructed syntax elements are compared with the encrypted syntax elements in the HDD, allowing the design of a pseudo-key stream (PKS) through the inverse of the encryption operations. The PKS is used to decrypt the existing SE scheme, and the experimental results provide evidence that the two existing SE schemes are vulnerable to the proposed KPAs. In the case of single bitstream estimation (SBE), the average correct rate of key stream estimation exceeds 93%. Moreover, with multi-bitstream complementation (MBC), the average estimation accuracy can be further improved to 99%.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"82 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0