VoiceStyle：通过跨模态原型对比学习进行基于语音的人脸生成

IF 5.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-06-05 DOI:10.1145/3671002

Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng

{"title":"VoiceStyle：通过跨模态原型对比学习进行基于语音的人脸生成","authors":"Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng","doi":"10.1145/3671002","DOIUrl":null,"url":null,"abstract":"<p>Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"67 1","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning\",\"authors\":\"Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng\",\"doi\":\"10.1145/3671002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.</p>\",\"PeriodicalId\":50937,\"journal\":{\"name\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"volume\":\"67 1\",\"pages\":\"\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2024-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3671002\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3671002","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

我们能仅根据声音预测一个人的外貌吗？本文通过从未曾听过的声音片段生成人脸来探讨这一问题。我们提出的方法 VoiceStyle 将跨模态表征学习与生成建模相结合，使我们能够将语音语义线索纳入生成的人脸中。在第一阶段，我们引入了跨模态原型对比学习（CMPC），以建立声音与面部之间的关联。由于认识到真实世界无标记数据中存在假阴性和偏差阳性实例，我们不仅使用同一视频中的语音-人脸对，还通过无监督聚类构建了额外的语义阳性对，从而加强了学习过程。此外，我们还根据实例与另一种模式的聚类中心的相似性对实例进行重新校准。在第二阶段，我们利用 StyleGAN 强大的生成能力来生成人脸。我们以学习到的语音-人脸对齐为指导，优化 StyleGAN 潜在空间中的潜在代码。为了解决选择合适的优化起点这一重要问题，我们的目标是利用从语音输入中获得的人脸原型自动找到最佳起点。整个管道可以自监督方式实现，无需人工标注注释。通过大量实验，我们证明了 VoiceStyle 方法在跨模态表征学习和基于语音的人脸生成方面的有效性和性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning

Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.