Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
{"title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":null,"url":null,"abstract":"We present a framework for learning to generate background music from video\ninputs. Unlike existing works that rely on symbolic musical annotations, which\nare limited in quantity and diversity, our method leverages large-scale web\nvideos accompanied by background music. This enables our model to learn to\ngenerate realistic and diverse music. To accomplish this goal, we develop a\ngenerative video-music Transformer with a novel semantic video-music alignment\nscheme. Our model uses a joint autoregressive and contrastive learning\nobjective, which encourages the generation of music aligned with high-level\nvideo content. We also introduce a novel video-beat alignment scheme to match\nthe generated music beats with the low-level motions in the video. Lastly, to\ncapture fine-grained visual cues in a video needed for realistic background\nmusic generation, we introduce a new temporal video encoder architecture,\nallowing us to efficiently process videos consisting of many densely sampled\nframes. We train our framework on our newly curated DISCO-MV dataset,\nconsisting of 2.2M video-music samples, which is orders of magnitude larger\nthan any prior datasets used for video music generation. Our method outperforms\nexisting approaches on the DISCO-MV and MusicCaps datasets according to various\nmusic generation evaluation metrics, including human evaluation. Results are\navailable at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07450","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present a framework for learning to generate background music from video
inputs. Unlike existing works that rely on symbolic musical annotations, which
are limited in quantity and diversity, our method leverages large-scale web
videos accompanied by background music. This enables our model to learn to
generate realistic and diverse music. To accomplish this goal, we develop a
generative video-music Transformer with a novel semantic video-music alignment
scheme. Our model uses a joint autoregressive and contrastive learning
objective, which encourages the generation of music aligned with high-level
video content. We also introduce a novel video-beat alignment scheme to match
the generated music beats with the low-level motions in the video. Lastly, to
capture fine-grained visual cues in a video needed for realistic background
music generation, we introduce a new temporal video encoder architecture,
allowing us to efficiently process videos consisting of many densely sampled
frames. We train our framework on our newly curated DISCO-MV dataset,
consisting of 2.2M video-music samples, which is orders of magnitude larger
than any prior datasets used for video music generation. Our method outperforms
existing approaches on the DISCO-MV and MusicCaps datasets according to various
music generation evaluation metrics, including human evaluation. Results are
available at https://genjib.github.io/project_page/VMAs/index.html