PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley
{"title":"PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing","authors":"Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley","doi":"arxiv-2409.10831","DOIUrl":null,"url":null,"abstract":"The recent explosion of generative AI-Music systems has raised numerous\nconcerns over data copyright, licensing music from musicians, and the conflict\nbetween open-source AI and large prestige companies. Such issues highlight the\nneed for publicly available, copyright-free musical data, in which there is a\nlarge shortage, particularly for symbolic music data. To alleviate this issue,\nwe present PDMX: a large-scale open-source dataset of over 250K public domain\nMusicXML scores collected from the score-sharing forum MuseScore, making it the\nlargest available copyright-free symbolic music dataset to our knowledge. PDMX\nadditionally includes a wealth of both tag and user interaction metadata,\nallowing us to efficiently analyze the dataset and filter for high quality\nuser-generated scores. Given the additional metadata afforded by our data\ncollection process, we conduct multitrack music generation experiments\nevaluating how different representative subsets of PDMX lead to different\nbehaviors in downstream models, and how user-rating statistics can be used as\nan effective measure of data quality. Examples can be found at\nhttps://pnlong.github.io/PDMX.demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PDMX:用于符号音乐处理的大规模公共领域音乐 XML 数据集
最近,人工智能音乐生成系统的爆炸式发展引发了许多关于数据版权、音乐家音乐授权以及开源人工智能与大型知名公司之间冲突的问题。这些问题凸显了对可公开获取、无版权限制的音乐数据的需求,而这方面的数据尤其是符号音乐数据非常缺乏。为了缓解这一问题,我们推出了 PDMX:一个大型开源数据集,其中包含从乐谱共享论坛 MuseScore 收集的超过 25 万个公共领域的 MusicXML 乐谱,是我们所知的最大的可用无版权符号音乐数据集。PDMX 还包括大量标签和用户交互元数据,使我们能够高效地分析数据集,并筛选出高质量的用户生成乐谱。鉴于我们的数据收集过程提供了额外的元数据,我们进行了多轨音乐生成实验,评估 PDMX 的不同代表性子集如何导致下游模型的不同行为,以及用户评分统计如何用作数据质量的有效衡量标准。示例可在https://pnlong.github.io/PDMX.demo/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems Conformal Prediction for Manifold-based Source Localization with Gaussian Processes Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1