Audio-Video Analysis Method of Public Speaking Videos to Detect Deepfake Threat

Robert Wolański, Karol Jędrasiak
{"title":"Audio-Video Analysis Method of Public Speaking Videos to Detect Deepfake Threat","authors":"Robert Wolański, Karol Jędrasiak","doi":"10.12845/sft.62.2.2023.10","DOIUrl":null,"url":null,"abstract":"Aim: The purpose of the article is to present the hypothesis that the use of discrepancies in audiovisual materials can significantly increase the effectiveness of detecting various types of deepfake and related threats. In order to verify this hypothesis, the authors proposed a new method that reveals inconsistencies in both multiple modalities simultaneously and within individual modalities separately, enabling them to effectively distinguish between authentic and altered public speaking videos. Project and methods: The proposed approach is to integrate audio and visual signals in a so-called fine-grained manner, and then carry out binary classification processes based on calculated adjustments to the classification results of each modality. The method has been tested using various network architectures, in particular Capsule networks – for deep anomaly detection and Swin Transformer – for image classification. Pre-processing included frame extraction and face detection using the MTCNN algorithm, as well as conversion of audio to mel spectrograms to better reflect human auditory perception. The proposed technique was tested on multimodal deepfake datasets, namely FakeAVCeleb and TMC, along with a custom dataset containing 4,700 recordings. The method has shown high performance in identifying deepfake threats in various test scenarios. Results: The method proposed by the authors achieved better AUC and accuracy compared to other reference methods, confirming its effectiveness in the analysis of multimodal artefacts. The test results confirm that it is effective in detecting modified videos in a variety of test scenarios which can be considered an advance over existing deepfake detection techniques. The results highlight the adaptability of the method in various architectures of feature extraction networks. Conclusions: The presented method of audiovisual deepfake detection uses fine inconsistencies of multimodal features to distinguish whether the material is authentic or synthetic. It is distinguished by its ability to point out inconsistencies in different types of deepfakes and, within each individual modality, can effectively distinguish authentic content from manipulated counterparts. The adaptability has been confirmed by the successful application of the method in various feature extraction network architectures. Moreover, its effectiveness has been proven in rigorous tests on two different audiovisual deepfake datasets. Keywords: analysis of audio-video stream, detection of deepfake threats, analysis of public speeches","PeriodicalId":113945,"journal":{"name":"Safety & Fire Technology","volume":"107 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Safety & Fire Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12845/sft.62.2.2023.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Aim: The purpose of the article is to present the hypothesis that the use of discrepancies in audiovisual materials can significantly increase the effectiveness of detecting various types of deepfake and related threats. In order to verify this hypothesis, the authors proposed a new method that reveals inconsistencies in both multiple modalities simultaneously and within individual modalities separately, enabling them to effectively distinguish between authentic and altered public speaking videos. Project and methods: The proposed approach is to integrate audio and visual signals in a so-called fine-grained manner, and then carry out binary classification processes based on calculated adjustments to the classification results of each modality. The method has been tested using various network architectures, in particular Capsule networks – for deep anomaly detection and Swin Transformer – for image classification. Pre-processing included frame extraction and face detection using the MTCNN algorithm, as well as conversion of audio to mel spectrograms to better reflect human auditory perception. The proposed technique was tested on multimodal deepfake datasets, namely FakeAVCeleb and TMC, along with a custom dataset containing 4,700 recordings. The method has shown high performance in identifying deepfake threats in various test scenarios. Results: The method proposed by the authors achieved better AUC and accuracy compared to other reference methods, confirming its effectiveness in the analysis of multimodal artefacts. The test results confirm that it is effective in detecting modified videos in a variety of test scenarios which can be considered an advance over existing deepfake detection techniques. The results highlight the adaptability of the method in various architectures of feature extraction networks. Conclusions: The presented method of audiovisual deepfake detection uses fine inconsistencies of multimodal features to distinguish whether the material is authentic or synthetic. It is distinguished by its ability to point out inconsistencies in different types of deepfakes and, within each individual modality, can effectively distinguish authentic content from manipulated counterparts. The adaptability has been confirmed by the successful application of the method in various feature extraction network architectures. Moreover, its effectiveness has been proven in rigorous tests on two different audiovisual deepfake datasets. Keywords: analysis of audio-video stream, detection of deepfake threats, analysis of public speeches
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用公开演讲视频的音视频分析方法检测 Deepfake 威胁
目的:文章旨在提出一个假设,即利用视听材料中的差异可以显著提高检测各种类型的深度伪造和相关威胁的有效性。为了验证这一假设,作者提出了一种新方法,既能同时揭示多种模态中的不一致性,又能分别揭示单个模态中的不一致性,从而能够有效区分公开演讲视频的真伪。项目和方法:所提出的方法是以所谓的细粒度方式整合音频和视频信号,然后根据对每种模态分类结果的计算调整结果进行二元分类处理。该方法使用各种网络架构进行了测试,特别是用于深度异常检测的胶囊网络和用于图像分类的斯温变换器。预处理包括使用 MTCNN 算法进行帧提取和人脸检测,以及将音频转换为 mel 频谱图,以更好地反映人类的听觉感知。所提出的技术在多模态深度伪造数据集(即 FakeAVCeleb 和 TMC)以及包含 4,700 个录音的自定义数据集上进行了测试。在各种测试场景中,该方法在识别深度伪造威胁方面表现出了很高的性能。结果:与其他参考方法相比,作者提出的方法取得了更好的 AUC 和准确率,证实了它在分析多模态伪装方面的有效性。测试结果证实,该方法能在各种测试场景中有效检测出修改过的视频,可以说是现有深度伪造检测技术的一大进步。结果凸显了该方法在各种特征提取网络架构中的适应性。结论所介绍的视听深度防伪检测方法利用多模态特征的细微不一致来区分材料是真实的还是合成的。该方法的显著特点是能够指出不同类型深度伪造内容中的不一致之处,并且在每种模式中都能有效区分真实内容和经过处理的对应内容。该方法在各种特征提取网络架构中的成功应用证实了其适应性。此外,在两个不同的视听深度伪造数据集上进行的严格测试也证明了该方法的有效性。关键词:音视频流分析、深度伪造威胁检测、公开演讲分析
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Student Scientific Club in Research for Road Safety Influence of External Factors on the Strength of Firefighting Hoses Used in Fire Protection Units Analysis and Formal and Substantive Evaluation of the Proposal of the European Regulation Authorizing the Marketing of Construction Products in the Harmonized Area Reducing Mercury Emissions from Small-Scale Coal-Fired Boilers Used in Residential Heating Technological Developments as a New Challenge for Modern Legislation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1