VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning

IF 5.2 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-06-05 DOI:10.1145/3671002
Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng
{"title":"VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning","authors":"Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng","doi":"10.1145/3671002","DOIUrl":null,"url":null,"abstract":"<p>Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"67 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3671002","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
VoiceStyle:通过跨模态原型对比学习进行基于语音的人脸生成
我们能仅根据声音预测一个人的外貌吗?本文通过从未曾听过的声音片段生成人脸来探讨这一问题。我们提出的方法 VoiceStyle 将跨模态表征学习与生成建模相结合,使我们能够将语音语义线索纳入生成的人脸中。在第一阶段,我们引入了跨模态原型对比学习(CMPC),以建立声音与面部之间的关联。由于认识到真实世界无标记数据中存在假阴性和偏差阳性实例,我们不仅使用同一视频中的语音-人脸对,还通过无监督聚类构建了额外的语义阳性对,从而加强了学习过程。此外,我们还根据实例与另一种模式的聚类中心的相似性对实例进行重新校准。在第二阶段,我们利用 StyleGAN 强大的生成能力来生成人脸。我们以学习到的语音-人脸对齐为指导,优化 StyleGAN 潜在空间中的潜在代码。为了解决选择合适的优化起点这一重要问题,我们的目标是利用从语音输入中获得的人脸原型自动找到最佳起点。整个管道可以自监督方式实现,无需人工标注注释。通过大量实验,我们证明了 VoiceStyle 方法在跨模态表征学习和基于语音的人脸生成方面的有效性和性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.50
自引率
5.90%
发文量
285
审稿时长
7.5 months
期刊介绍: The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.
期刊最新文献
TA-Detector: A GNN-based Anomaly Detector via Trust Relationship KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1