{"title":"NES 视频音乐数据库:与游戏视频配对的符号电子游戏音乐数据集","authors":"Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira","doi":"arxiv-2404.04420","DOIUrl":null,"url":null,"abstract":"Neural models are one of the most popular approaches for music generation,\nyet there aren't standard large datasets tailored for learning music directly\nfrom game data. To address this research gap, we introduce a novel dataset\nnamed NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each\npaired with its original soundtrack in symbolic format (MIDI). NES-VMDB is\nbuilt upon the Nintendo Entertainment System Music Database (NES-MDB),\nencompassing 5,278 music pieces from 397 NES games. Our approach involves\ncollecting long-play videos for 389 games of the original dataset, slicing them\ninto 15-second-long clips, and extracting the audio from each clip.\nSubsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to\nautomatically identify the corresponding piece in the NES-MDB dataset.\nAdditionally, we introduce a baseline method based on the Controllable Music\nTransformer to generate NES music conditioned on gameplay clips. We evaluated\nthis approach with objective metrics, and the results showed that the\nconditional CMT improves musical structural quality when compared to its\nunconditional counterpart. Moreover, we used a neural classifier to predict the\ngame genre of the generated pieces. Results showed that the CMT generator can\nlearn correlations between gameplay videos and game genres, but further\nresearch has to be conducted to achieve human-level performance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos\",\"authors\":\"Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira\",\"doi\":\"arxiv-2404.04420\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural models are one of the most popular approaches for music generation,\\nyet there aren't standard large datasets tailored for learning music directly\\nfrom game data. To address this research gap, we introduce a novel dataset\\nnamed NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each\\npaired with its original soundtrack in symbolic format (MIDI). NES-VMDB is\\nbuilt upon the Nintendo Entertainment System Music Database (NES-MDB),\\nencompassing 5,278 music pieces from 397 NES games. Our approach involves\\ncollecting long-play videos for 389 games of the original dataset, slicing them\\ninto 15-second-long clips, and extracting the audio from each clip.\\nSubsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to\\nautomatically identify the corresponding piece in the NES-MDB dataset.\\nAdditionally, we introduce a baseline method based on the Controllable Music\\nTransformer to generate NES music conditioned on gameplay clips. We evaluated\\nthis approach with objective metrics, and the results showed that the\\nconditional CMT improves musical structural quality when compared to its\\nunconditional counterpart. Moreover, we used a neural classifier to predict the\\ngame genre of the generated pieces. Results showed that the CMT generator can\\nlearn correlations between gameplay videos and game genres, but further\\nresearch has to be conducted to achieve human-level performance.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2404.04420\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.04420","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
神经模型是最流行的音乐生成方法之一,但目前还没有直接从游戏数据中学习音乐的标准大型数据集。为了填补这一研究空白,我们引入了一个名为 NES-VMDB 的新型数据集,其中包含 389 款 NES 游戏的 98,940 个游戏视频,每个视频都配有符号格式(MIDI)的原始配乐。NES-VMDB 基于任天堂娱乐系统音乐数据库(Nintendo Entertainment System Music Database,NES-MDB),包含 397 款 NES 游戏中的 5,278 首乐曲。我们的方法包括收集原始数据集中 389 款游戏的长播放视频,将其切成 15 秒长的片段,并从每个片段中提取音频。随后,我们应用音频指纹识别算法(类似于 Shazam)自动识别 NES-MDB 数据集中的相应乐曲。我们用客观指标对这种方法进行了评估,结果表明,条件式 CMT 与无条件式 CMT 相比,提高了音乐结构质量。此外,我们还使用神经分类器来预测生成乐曲的游戏流派。结果表明,CMT 生成器可以学习游戏视频和游戏类型之间的相关性,但要达到人类水平的性能,还需要进行进一步的研究。
The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos
Neural models are one of the most popular approaches for music generation,
yet there aren't standard large datasets tailored for learning music directly
from game data. To address this research gap, we introduce a novel dataset
named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each
paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is
built upon the Nintendo Entertainment System Music Database (NES-MDB),
encompassing 5,278 music pieces from 397 NES games. Our approach involves
collecting long-play videos for 389 games of the original dataset, slicing them
into 15-second-long clips, and extracting the audio from each clip.
Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to
automatically identify the corresponding piece in the NES-MDB dataset.
Additionally, we introduce a baseline method based on the Controllable Music
Transformer to generate NES music conditioned on gameplay clips. We evaluated
this approach with objective metrics, and the results showed that the
conditional CMT improves musical structural quality when compared to its
unconditional counterpart. Moreover, we used a neural classifier to predict the
game genre of the generated pieces. Results showed that the CMT generator can
learn correlations between gameplay videos and game genres, but further
research has to be conducted to achieve human-level performance.