{"title":"利用卷积时空池网络进行音乐流派分类","authors":"Vijayameenakshi T. M, Swapna T. R","doi":"10.1007/s11042-024-20163-5","DOIUrl":null,"url":null,"abstract":"<p>Music genre classification is one of the most interesting topics in digital music. Classifying genres is basically subjective, and different listeners may perceive genres in various ways. Furthermore, it might be difficult to classify some songs accurately since they belong to numerous genres. Genres are incredibly wide and ill-defined categories, which makes them problematic. Thus, genre-based measures are inherently inaccurate and coarse. Moreover, not every piece of music cleanly fits into a particular genre. Many papers based on deep neural networks perform sound recognition and classification with input images of audio, which do not affect the time–frequency representation of a signal. The traditional method adds waveform augmentation to the audio signal, thereby increasing the network's training speed. This paper considers music genre classification with the convolution temporal pooling framework and explores the impact of adding the SpecAugment method to augment the spectrogram itself. The augmented spectrogram is then fed into a convolutional temporal pooling network. In this model, the temporal and pooling layers identify the genre pattern and classify the songs based on the genre. It also predicts these duplication that will occur in the given sample. We apply this model to the GTZAN dataset, a widely used dataset for music genre classification. This method improves the identification of Rock and Pop song and also eliminates the replication of the songs. The trained model reports an accuracy of 0.75 for training a 30-s audio file.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":"15 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Music genre classification using convolution temporal pooling network\",\"authors\":\"Vijayameenakshi T. M, Swapna T. R\",\"doi\":\"10.1007/s11042-024-20163-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Music genre classification is one of the most interesting topics in digital music. Classifying genres is basically subjective, and different listeners may perceive genres in various ways. Furthermore, it might be difficult to classify some songs accurately since they belong to numerous genres. Genres are incredibly wide and ill-defined categories, which makes them problematic. Thus, genre-based measures are inherently inaccurate and coarse. Moreover, not every piece of music cleanly fits into a particular genre. Many papers based on deep neural networks perform sound recognition and classification with input images of audio, which do not affect the time–frequency representation of a signal. The traditional method adds waveform augmentation to the audio signal, thereby increasing the network's training speed. This paper considers music genre classification with the convolution temporal pooling framework and explores the impact of adding the SpecAugment method to augment the spectrogram itself. The augmented spectrogram is then fed into a convolutional temporal pooling network. In this model, the temporal and pooling layers identify the genre pattern and classify the songs based on the genre. It also predicts these duplication that will occur in the given sample. We apply this model to the GTZAN dataset, a widely used dataset for music genre classification. This method improves the identification of Rock and Pop song and also eliminates the replication of the songs. The trained model reports an accuracy of 0.75 for training a 30-s audio file.</p>\",\"PeriodicalId\":18770,\"journal\":{\"name\":\"Multimedia Tools and Applications\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Multimedia Tools and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11042-024-20163-5\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20163-5","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Music genre classification using convolution temporal pooling network
Music genre classification is one of the most interesting topics in digital music. Classifying genres is basically subjective, and different listeners may perceive genres in various ways. Furthermore, it might be difficult to classify some songs accurately since they belong to numerous genres. Genres are incredibly wide and ill-defined categories, which makes them problematic. Thus, genre-based measures are inherently inaccurate and coarse. Moreover, not every piece of music cleanly fits into a particular genre. Many papers based on deep neural networks perform sound recognition and classification with input images of audio, which do not affect the time–frequency representation of a signal. The traditional method adds waveform augmentation to the audio signal, thereby increasing the network's training speed. This paper considers music genre classification with the convolution temporal pooling framework and explores the impact of adding the SpecAugment method to augment the spectrogram itself. The augmented spectrogram is then fed into a convolutional temporal pooling network. In this model, the temporal and pooling layers identify the genre pattern and classify the songs based on the genre. It also predicts these duplication that will occur in the given sample. We apply this model to the GTZAN dataset, a widely used dataset for music genre classification. This method improves the identification of Rock and Pop song and also eliminates the replication of the songs. The trained model reports an accuracy of 0.75 for training a 30-s audio file.
期刊介绍:
Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed.
Specific areas of interest include:
- Multimedia Tools:
- Multimedia Applications:
- Prototype multimedia systems and platforms