生成音频驱动视频会议的多模态语义通信

IF 4.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Wireless Communications Letters Pub Date : 2024-10-31 DOI:10.1109/LWC.2024.3488859

Haonan Tong;Haopeng Li;Hongyang Du;Zhaohui Yang;Changchuan Yin;Dusit Niyato

{"title":"生成音频驱动视频会议的多模态语义通信","authors":"Haonan Tong;Haopeng Li;Hongyang Du;Zhaohui Yang;Changchuan Yin;Dusit Niyato","doi":"10.1109/LWC.2024.3488859","DOIUrl":null,"url":null,"abstract":"This letter studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.","PeriodicalId":13343,"journal":{"name":"IEEE Wireless Communications Letters","volume":"14 1","pages":"93-97"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing\",\"authors\":\"Haonan Tong;Haopeng Li;Hongyang Du;Zhaohui Yang;Changchuan Yin;Dusit Niyato\",\"doi\":\"10.1109/LWC.2024.3488859\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This letter studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.\",\"PeriodicalId\":13343,\"journal\":{\"name\":\"IEEE Wireless Communications Letters\",\"volume\":\"14 1\",\"pages\":\"93-97\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Wireless Communications Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10740049/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Wireless Communications Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10740049/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

本文研究了一种高效的多模式视频会议数据通信方案。在我们所考虑的系统中，演讲者向观众发表演讲，并传输说话的头部视频和音频。由于说话者不经常改变姿势，并且需要高保真地传输音频（语音和音乐），因此存在冗余的视觉视频数据，可以通过从音频生成视频来删除。为此，我们提出了一种波转视频（Wav2Vid）系统，这是一种有效的视频传输框架，通过从音频生成说话头视频来减少传输数据。特别是，全时间音频和短时间视频数据通过无线信道同步传输，神经网络（NNs）提取和编码音频和视频语义。然后，接收器将解码的音频和视频数据结合起来，并使用基于生成对抗网络（GAN）的模型来生成说话者的嘴唇运动视频。仿真结果表明，所提出的Wav2Vid系统在保持生成的会议视频的感知质量的同时，可以减少高达83%的传输数据量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

This letter studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Wireless Communications Letters Engineering-Electrical and Electronic Engineering

CiteScore

12.30

自引率

6.30%

发文量

481

期刊介绍： IEEE Wireless Communications Letters publishes short papers in a rapid publication cycle on advances in the state-of-the-art of wireless communications. Both theoretical contributions (including new techniques, concepts, and analyses) and practical contributions (including system experiments and prototypes, and new applications) are encouraged. This journal focuses on the physical layer and the link layer of wireless communication systems.