{"title":"自动音乐合成与变压器","authors":"Yi-Hsuan Yang","doi":"10.1145/3463946.3469111","DOIUrl":null,"url":null,"abstract":"In this talk, I will first give a brief overview of recent deep learning-based approaches for automatic music generation in the symbolic domain. I will then talk about our own research that employs self-attention based architectures, a.k.a. Transformers, for symbolic music generation. A naive approach with Transformers would treat music as a sequence of text-like tokens. But, our research demonstrates that Transformers can generate higher-quality music when music is not treated simply as text. In particular, our Pop Music Transformer model, published at ACM Multimedia 2020, employs a novel beat-based representation of music that informs self-attention models with the bar-beat metrical structure present in music. This approach greatly improves the rhythmic structure of the generated music. A more recent model we published at AAAI 2021, named the Compound Word Transformer, exploits the fact that a musical note is associated with multiple attributes such as pitch, duration and velocity. Instead of predicting tokens corresponding to these different attributes one-by-one at inference time, the Compound Word Transformer predicts them altogether jointly, greatly reducing the sequence length needed to model a full-length song and also making it easier to model the dependency among these attributes.","PeriodicalId":43265,"journal":{"name":"International Journal of Mobile Computing and Multimedia Communications","volume":"41 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Automatic Music Composition with Transformers\",\"authors\":\"Yi-Hsuan Yang\",\"doi\":\"10.1145/3463946.3469111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this talk, I will first give a brief overview of recent deep learning-based approaches for automatic music generation in the symbolic domain. I will then talk about our own research that employs self-attention based architectures, a.k.a. Transformers, for symbolic music generation. A naive approach with Transformers would treat music as a sequence of text-like tokens. But, our research demonstrates that Transformers can generate higher-quality music when music is not treated simply as text. In particular, our Pop Music Transformer model, published at ACM Multimedia 2020, employs a novel beat-based representation of music that informs self-attention models with the bar-beat metrical structure present in music. This approach greatly improves the rhythmic structure of the generated music. A more recent model we published at AAAI 2021, named the Compound Word Transformer, exploits the fact that a musical note is associated with multiple attributes such as pitch, duration and velocity. Instead of predicting tokens corresponding to these different attributes one-by-one at inference time, the Compound Word Transformer predicts them altogether jointly, greatly reducing the sequence length needed to model a full-length song and also making it easier to model the dependency among these attributes.\",\"PeriodicalId\":43265,\"journal\":{\"name\":\"International Journal of Mobile Computing and Multimedia Communications\",\"volume\":\"41 1\",\"pages\":\"\"},\"PeriodicalIF\":0.4000,\"publicationDate\":\"2021-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Mobile Computing and Multimedia Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3463946.3469111\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Mobile Computing and Multimedia Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3463946.3469111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 1
摘要
在这次演讲中,我将首先简要概述最近在符号领域中基于深度学习的自动音乐生成方法。然后我将谈论我们自己的研究,使用基于自我关注的架构,也就是变形金刚,来生成符号音乐。对于《变形金刚》,一种幼稚的方法是将音乐视为一系列类似文本的符号。但是,我们的研究表明,当音乐不被简单地视为文本时,变形金刚可以产生更高质量的音乐。特别是,我们在ACM Multimedia 2020上发表的流行音乐转换器模型,采用了一种新颖的基于节拍的音乐表示,将音乐中存在的小节节奏结构告知自我注意模型。这种方法极大地改善了生成音乐的节奏结构。我们在AAAI 2021上发布了一个最新的模型,名为Compound Word Transformer,它利用了一个音符与多个属性(如音高、持续时间和速度)相关的事实。复合词转换器不是在推理时一个接一个地预测与这些不同属性相对应的标记,而是将它们一起预测,这大大减少了为全长歌曲建模所需的序列长度,也使建模这些属性之间的依赖关系变得更容易。
In this talk, I will first give a brief overview of recent deep learning-based approaches for automatic music generation in the symbolic domain. I will then talk about our own research that employs self-attention based architectures, a.k.a. Transformers, for symbolic music generation. A naive approach with Transformers would treat music as a sequence of text-like tokens. But, our research demonstrates that Transformers can generate higher-quality music when music is not treated simply as text. In particular, our Pop Music Transformer model, published at ACM Multimedia 2020, employs a novel beat-based representation of music that informs self-attention models with the bar-beat metrical structure present in music. This approach greatly improves the rhythmic structure of the generated music. A more recent model we published at AAAI 2021, named the Compound Word Transformer, exploits the fact that a musical note is associated with multiple attributes such as pitch, duration and velocity. Instead of predicting tokens corresponding to these different attributes one-by-one at inference time, the Compound Word Transformer predicts them altogether jointly, greatly reducing the sequence length needed to model a full-length song and also making it easier to model the dependency among these attributes.