生成音频驱动视频会议的多模态语义通信

IF 4.6 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Wireless Communications Letters Pub Date : 2024-10-31 DOI:10.1109/LWC.2024.3488859
Haonan Tong;Haopeng Li;Hongyang Du;Zhaohui Yang;Changchuan Yin;Dusit Niyato
{"title":"生成音频驱动视频会议的多模态语义通信","authors":"Haonan Tong;Haopeng Li;Hongyang Du;Zhaohui Yang;Changchuan Yin;Dusit Niyato","doi":"10.1109/LWC.2024.3488859","DOIUrl":null,"url":null,"abstract":"This letter studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.","PeriodicalId":13343,"journal":{"name":"IEEE Wireless Communications Letters","volume":"14 1","pages":"93-97"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing\",\"authors\":\"Haonan Tong;Haopeng Li;Hongyang Du;Zhaohui Yang;Changchuan Yin;Dusit Niyato\",\"doi\":\"10.1109/LWC.2024.3488859\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This letter studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.\",\"PeriodicalId\":13343,\"journal\":{\"name\":\"IEEE Wireless Communications Letters\",\"volume\":\"14 1\",\"pages\":\"93-97\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Wireless Communications Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10740049/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Wireless Communications Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10740049/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

本文研究了一种高效的多模式视频会议数据通信方案。在我们所考虑的系统中,演讲者向观众发表演讲,并传输说话的头部视频和音频。由于说话者不经常改变姿势,并且需要高保真地传输音频(语音和音乐),因此存在冗余的视觉视频数据,可以通过从音频生成视频来删除。为此,我们提出了一种波转视频(Wav2Vid)系统,这是一种有效的视频传输框架,通过从音频生成说话头视频来减少传输数据。特别是,全时间音频和短时间视频数据通过无线信道同步传输,神经网络(NNs)提取和编码音频和视频语义。然后,接收器将解码的音频和视频数据结合起来,并使用基于生成对抗网络(GAN)的模型来生成说话者的嘴唇运动视频。仿真结果表明,所提出的Wav2Vid系统在保持生成的会议视频的感知质量的同时,可以减少高达83%的传输数据量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
This letter studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Wireless Communications Letters
IEEE Wireless Communications Letters Engineering-Electrical and Electronic Engineering
CiteScore
12.30
自引率
6.30%
发文量
481
期刊介绍: IEEE Wireless Communications Letters publishes short papers in a rapid publication cycle on advances in the state-of-the-art of wireless communications. Both theoretical contributions (including new techniques, concepts, and analyses) and practical contributions (including system experiments and prototypes, and new applications) are encouraged. This journal focuses on the physical layer and the link layer of wireless communication systems.
期刊最新文献
Bayesian EM Digital Twins Channel Estimation Cute but Cunning: Circumventing Lognormal Statistic Complexities Double-RIS Assisted Cooperative Spectrum Sensing for Cognitive Radio Networks Block Iterative Support Detection for Uplink Grant-Free MIMO-NOMA Coded Fluid Antenna Multiple Access Over Fast Fading Channels
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1