利用感知细节的神经辐射场实现高保真、高效的会话肖像合成。

IF 6.5 IEEE transactions on visualization and computer graphics Pub Date : 2024-10-31 DOI:10.1109/TVCG.2024.3488960

Muyu Wang;Sanyuan Zhao;Xingping Dong;Jianbing Shen

{"title":"利用感知细节的神经辐射场实现高保真、高效的会话肖像合成。","authors":"Muyu Wang;Sanyuan Zhao;Xingping Dong;Jianbing Shen","doi":"10.1109/TVCG.2024.3488960","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named <italic>HH-NeRF</i> that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. First, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods.","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"31 9","pages":"6022-6035"},"PeriodicalIF":6.5000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields\",\"authors\":\"Muyu Wang;Sanyuan Zhao;Xingping Dong;Jianbing Shen\",\"doi\":\"10.1109/TVCG.2024.3488960\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named <italic>HH-NeRF</i> that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. First, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods.\",\"PeriodicalId\":94035,\"journal\":{\"name\":\"IEEE transactions on visualization and computer graphics\",\"volume\":\"31 9\",\"pages\":\"6022-6035\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on visualization and computer graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10740602/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10740602/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们提出了一种基于神经辐射场（NeRF）的新型渲染框架，名为 HH-NeRF，它可以生成高保真、快速渲染的高分辨率音频驱动人像视频。具体来说，我们的框架包括一个细节感知 NeRF 模块和一个高效的条件超分辨率模块。首先，我们提出了一个细节感知 NeRF 模块，通过使用编码体积密度估算和音频眼睛感知颜色计算，高效生成高保真低分辨率的对话头像。该模块可以捕捉自然的眨眼和高频细节，并保持与以往快速方法相似的渲染时间。其次，我们在动态场景上提出了一个高效的条件超分辨率模块，利用低分辨率头部直接生成高分辨率人像。结合深度图和音频特征等先验信息，我们新提出的高效条件超分辨率模块可以采用轻量级网络，高效生成逼真、独特的高分辨率视频。广泛的实验证明，与最先进的方法相比，我们的方法能在高分辨率（900 × 900）视频上生成更清晰、更逼真的说话肖像。我们的代码见 https://github.com/muyuWang/HHNeRF。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields

In this paper, we propose a novel rendering framework based on neural radiance fields (NeRF) named HH-NeRF that can generate high-resolution audio-driven talking portrait videos with high fidelity and fast rendering. Specifically, our framework includes a detail-aware NeRF module and an efficient conditional super-resolution module. First, a detail-aware NeRF is proposed to efficiently generate a high-fidelity low-resolution talking head, by using the encoded volume density estimation and audio-eye-aware color calculation. This module can capture natural eye blinks and high-frequency details, and maintain a similar rendering time as previous fast methods. Secondly, we present an efficient conditional super-resolution module on the dynamic scene to directly generate the high-resolution portrait with our low-resolution head. Incorporated with the prior information, such as depth map and audio features, our new proposed efficient conditional super resolution module can adopt a lightweight network to efficiently generate realistic and distinct high-resolution videos. Extensive experiments demonstrate that our method can generate more distinct and fidelity talking portraits on high resolution (900 × 900) videos compared to state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量