{"title":"VidMuse:具有长期短期建模功能的简单视频音乐生成框架","authors":"Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo","doi":"arxiv-2406.04321","DOIUrl":null,"url":null,"abstract":"In this work, we systematically study music generation conditioned solely on\nthe video. First, we present a large-scale dataset comprising 190K video-music\npairs, including various genres such as movie trailers, advertisements, and\ndocumentaries. Furthermore, we propose VidMuse, a simple framework for\ngenerating music aligned with video inputs. VidMuse stands out by producing\nhigh-fidelity music that is both acoustically and semantically aligned with the\nvideo. By incorporating local and global visual cues, VidMuse enables the\ncreation of musically coherent audio tracks that consistently match the video\ncontent through Long-Short-Term modeling. Through extensive experiments,\nVidMuse outperforms existing models in terms of audio quality, diversity, and\naudio-visual alignment. The code and datasets will be available at\nhttps://github.com/ZeyueT/VidMuse/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling\",\"authors\":\"Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo\",\"doi\":\"arxiv-2406.04321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we systematically study music generation conditioned solely on\\nthe video. First, we present a large-scale dataset comprising 190K video-music\\npairs, including various genres such as movie trailers, advertisements, and\\ndocumentaries. Furthermore, we propose VidMuse, a simple framework for\\ngenerating music aligned with video inputs. VidMuse stands out by producing\\nhigh-fidelity music that is both acoustically and semantically aligned with the\\nvideo. By incorporating local and global visual cues, VidMuse enables the\\ncreation of musically coherent audio tracks that consistently match the video\\ncontent through Long-Short-Term modeling. Through extensive experiments,\\nVidMuse outperforms existing models in terms of audio quality, diversity, and\\naudio-visual alignment. The code and datasets will be available at\\nhttps://github.com/ZeyueT/VidMuse/.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.04321\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.04321","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 190K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/.