Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
{"title":"具有增强同步性的屏蔽式生成视频音频转换器","authors":"Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà","doi":"arxiv-2407.10387","DOIUrl":null,"url":null,"abstract":"Video-to-audio (V2A) generation leverages visual-only video features to\nrender plausible sounds that match the scene. Importantly, the generated sound\nonsets should match the visual actions that are aligned with them, otherwise\nunnatural synchronization artifacts arise. Recent works have explored the\nprogression of conditioning sound generators on still images and then video\nfeatures, focusing on quality and semantic matching while ignoring\nsynchronization, or by sacrificing some amount of quality to focus on improving\nsynchronization only. In this work, we propose a V2A generative model, named\nMaskVAT, that interconnects a full-band high-quality general audio codec with a\nsequence-to-sequence masked generative model. This combination allows modeling\nboth high audio quality, semantic matching, and temporal synchronicity at the\nsame time. Our results show that, by combining a high-quality codec with the\nproper pre-trained audio-visual features and a sequence-to-sequence parallel\nstructure, we are able to yield highly synchronized results on one hand, whilst\nbeing competitive with the state of the art of non-codec generative audio\nmodels. Sample videos and generated audios are available at\nhttps://maskvat.github.io .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity\",\"authors\":\"Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà\",\"doi\":\"arxiv-2407.10387\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video-to-audio (V2A) generation leverages visual-only video features to\\nrender plausible sounds that match the scene. Importantly, the generated sound\\nonsets should match the visual actions that are aligned with them, otherwise\\nunnatural synchronization artifacts arise. Recent works have explored the\\nprogression of conditioning sound generators on still images and then video\\nfeatures, focusing on quality and semantic matching while ignoring\\nsynchronization, or by sacrificing some amount of quality to focus on improving\\nsynchronization only. In this work, we propose a V2A generative model, named\\nMaskVAT, that interconnects a full-band high-quality general audio codec with a\\nsequence-to-sequence masked generative model. This combination allows modeling\\nboth high audio quality, semantic matching, and temporal synchronicity at the\\nsame time. Our results show that, by combining a high-quality codec with the\\nproper pre-trained audio-visual features and a sequence-to-sequence parallel\\nstructure, we are able to yield highly synchronized results on one hand, whilst\\nbeing competitive with the state of the art of non-codec generative audio\\nmodels. Sample videos and generated audios are available at\\nhttps://maskvat.github.io .\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.10387\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
Video-to-audio (V2A) generation leverages visual-only video features to
render plausible sounds that match the scene. Importantly, the generated sound
onsets should match the visual actions that are aligned with them, otherwise
unnatural synchronization artifacts arise. Recent works have explored the
progression of conditioning sound generators on still images and then video
features, focusing on quality and semantic matching while ignoring
synchronization, or by sacrificing some amount of quality to focus on improving
synchronization only. In this work, we propose a V2A generative model, named
MaskVAT, that interconnects a full-band high-quality general audio codec with a
sequence-to-sequence masked generative model. This combination allows modeling
both high audio quality, semantic matching, and temporal synchronicity at the
same time. Our results show that, by combining a high-quality codec with the
proper pre-trained audio-visual features and a sequence-to-sequence parallel
structure, we are able to yield highly synchronized results on one hand, whilst
being competitive with the state of the art of non-codec generative audio
models. Sample videos and generated audios are available at
https://maskvat.github.io .