{"title":"Automatic Music Composition with Transformers","authors":"Yi-Hsuan Yang","doi":"10.1145/3463946.3469111","DOIUrl":null,"url":null,"abstract":"In this talk, I will first give a brief overview of recent deep learning-based approaches for automatic music generation in the symbolic domain. I will then talk about our own research that employs self-attention based architectures, a.k.a. Transformers, for symbolic music generation. A naive approach with Transformers would treat music as a sequence of text-like tokens. But, our research demonstrates that Transformers can generate higher-quality music when music is not treated simply as text. In particular, our Pop Music Transformer model, published at ACM Multimedia 2020, employs a novel beat-based representation of music that informs self-attention models with the bar-beat metrical structure present in music. This approach greatly improves the rhythmic structure of the generated music. A more recent model we published at AAAI 2021, named the Compound Word Transformer, exploits the fact that a musical note is associated with multiple attributes such as pitch, duration and velocity. Instead of predicting tokens corresponding to these different attributes one-by-one at inference time, the Compound Word Transformer predicts them altogether jointly, greatly reducing the sequence length needed to model a full-length song and also making it easier to model the dependency among these attributes.","PeriodicalId":43265,"journal":{"name":"International Journal of Mobile Computing and Multimedia Communications","volume":"41 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Mobile Computing and Multimedia Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3463946.3469111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 1
Abstract
In this talk, I will first give a brief overview of recent deep learning-based approaches for automatic music generation in the symbolic domain. I will then talk about our own research that employs self-attention based architectures, a.k.a. Transformers, for symbolic music generation. A naive approach with Transformers would treat music as a sequence of text-like tokens. But, our research demonstrates that Transformers can generate higher-quality music when music is not treated simply as text. In particular, our Pop Music Transformer model, published at ACM Multimedia 2020, employs a novel beat-based representation of music that informs self-attention models with the bar-beat metrical structure present in music. This approach greatly improves the rhythmic structure of the generated music. A more recent model we published at AAAI 2021, named the Compound Word Transformer, exploits the fact that a musical note is associated with multiple attributes such as pitch, duration and velocity. Instead of predicting tokens corresponding to these different attributes one-by-one at inference time, the Compound Word Transformer predicts them altogether jointly, greatly reducing the sequence length needed to model a full-length song and also making it easier to model the dependency among these attributes.