{"title":"Multi-Reference Generative Face Video Compression with Contrastive Learning","authors":"Goluck Konuko, Giuseppe Valenzise","doi":"arxiv-2409.01029","DOIUrl":null,"url":null,"abstract":"Generative face video coding (GFVC) has been demonstrated as a potential\napproach to low-latency, low bitrate video conferencing. GFVC frameworks\nachieve an extreme gain in coding efficiency with over 70% bitrate savings when\ncompared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET\nstandardization efforts, all the information required to reconstruct video\nsequences using GFVC frameworks are adopted as part of the supplemental\nenhancement information (SEI) in existing compression pipelines. In light of\nthis development, we aim to address a challenge that has been weakly addressed\nin prior GFVC frameworks, i.e., reconstruction drift as the distance between\nthe reference and target frames increases. This challenge creates the need to\nupdate the reference buffer more frequently by transmitting more Intra-refresh\nframes, which are the most expensive element of the GFVC bitstream. To overcome\nthis problem, we propose instead multiple reference animation as a robust\napproach to minimizing reconstruction drift, especially when used in a\nbi-directional prediction mode. Further, we propose a contrastive learning\nformulation for multi-reference animation. We observe that using a contrastive\nlearning framework enhances the representation capabilities of the animation\ngenerator. The resulting framework, MRDAC (Multi-Reference Deep Animation\nCodec) can therefore be used to compress longer sequences with fewer reference\nframes or achieve a significant gain in reconstruction accuracy at comparable\nbitrates to previous frameworks. Quantitative and qualitative results show\nsignificant coding and reconstruction quality gains compared to previous GFVC\nmethods, and more accurate animation quality in presence of large pose and\nfacial expression changes.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Generative face video coding (GFVC) has been demonstrated as a potential
approach to low-latency, low bitrate video conferencing. GFVC frameworks
achieve an extreme gain in coding efficiency with over 70% bitrate savings when
compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET
standardization efforts, all the information required to reconstruct video
sequences using GFVC frameworks are adopted as part of the supplemental
enhancement information (SEI) in existing compression pipelines. In light of
this development, we aim to address a challenge that has been weakly addressed
in prior GFVC frameworks, i.e., reconstruction drift as the distance between
the reference and target frames increases. This challenge creates the need to
update the reference buffer more frequently by transmitting more Intra-refresh
frames, which are the most expensive element of the GFVC bitstream. To overcome
this problem, we propose instead multiple reference animation as a robust
approach to minimizing reconstruction drift, especially when used in a
bi-directional prediction mode. Further, we propose a contrastive learning
formulation for multi-reference animation. We observe that using a contrastive
learning framework enhances the representation capabilities of the animation
generator. The resulting framework, MRDAC (Multi-Reference Deep Animation
Codec) can therefore be used to compress longer sequences with fewer reference
frames or achieve a significant gain in reconstruction accuracy at comparable
bitrates to previous frameworks. Quantitative and qualitative results show
significant coding and reconstruction quality gains compared to previous GFVC
methods, and more accurate animation quality in presence of large pose and
facial expression changes.