{"title":"利用 LLM 和音频信号处理增强音乐情绪识别:多模态方法","authors":"Prof. R.Y. Sable, Aqsa Sayyed, Baliraje Kalyane, Kosheen Sadhu, Prathamesh Ghatole","doi":"10.22214/ijraset.2024.63590","DOIUrl":null,"url":null,"abstract":"Abstract: Music Mood Recognition aims to allow computers to understand the emotions behind music the way humans do, in order to facilitate better perception of media by computers to aid in enhanced services like music recommendations, therapeutic interventions, and Human Computer Interaction. In this paper, we propose a novel approach to improving Music Mood Recognition using a multi-modal model that uses lyrical and audio features of a song. Lyrical features are analysed using stateof-the-art open-source Large Language Models like Microsoft Phi-3 to classify lyrics from one of the four possible emotion categories as per the James Russel Circumplex Model. Audio features are used to train a Deep Learning (ConvNet) model to predict emotion classes. A multimodal combiner model with Audio and Lyrics is then trained and deployed to enable accurate predictions. The dataset used in this research is “MoodyLyrics”, a collection of 2000+ songs classified with one of 4 possible emotion classes as per the James Russel Circumplex Model. Due to compute limitations, we are using a balanced set of 1000 songs to train and test our models. The workin this paper outperforms most other multimodal researches by allowing higher accuracies with universal language support","PeriodicalId":13718,"journal":{"name":"International Journal for Research in Applied Science and Engineering Technology","volume":"28 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Music Mood Recognition with LLMs and Audio Signal Processing: A Multimodal Approach\",\"authors\":\"Prof. R.Y. Sable, Aqsa Sayyed, Baliraje Kalyane, Kosheen Sadhu, Prathamesh Ghatole\",\"doi\":\"10.22214/ijraset.2024.63590\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract: Music Mood Recognition aims to allow computers to understand the emotions behind music the way humans do, in order to facilitate better perception of media by computers to aid in enhanced services like music recommendations, therapeutic interventions, and Human Computer Interaction. In this paper, we propose a novel approach to improving Music Mood Recognition using a multi-modal model that uses lyrical and audio features of a song. Lyrical features are analysed using stateof-the-art open-source Large Language Models like Microsoft Phi-3 to classify lyrics from one of the four possible emotion categories as per the James Russel Circumplex Model. Audio features are used to train a Deep Learning (ConvNet) model to predict emotion classes. A multimodal combiner model with Audio and Lyrics is then trained and deployed to enable accurate predictions. The dataset used in this research is “MoodyLyrics”, a collection of 2000+ songs classified with one of 4 possible emotion classes as per the James Russel Circumplex Model. Due to compute limitations, we are using a balanced set of 1000 songs to train and test our models. The workin this paper outperforms most other multimodal researches by allowing higher accuracies with universal language support\",\"PeriodicalId\":13718,\"journal\":{\"name\":\"International Journal for Research in Applied Science and Engineering Technology\",\"volume\":\"28 7\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal for Research in Applied Science and Engineering Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22214/ijraset.2024.63590\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal for Research in Applied Science and Engineering Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22214/ijraset.2024.63590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhancing Music Mood Recognition with LLMs and Audio Signal Processing: A Multimodal Approach
Abstract: Music Mood Recognition aims to allow computers to understand the emotions behind music the way humans do, in order to facilitate better perception of media by computers to aid in enhanced services like music recommendations, therapeutic interventions, and Human Computer Interaction. In this paper, we propose a novel approach to improving Music Mood Recognition using a multi-modal model that uses lyrical and audio features of a song. Lyrical features are analysed using stateof-the-art open-source Large Language Models like Microsoft Phi-3 to classify lyrics from one of the four possible emotion categories as per the James Russel Circumplex Model. Audio features are used to train a Deep Learning (ConvNet) model to predict emotion classes. A multimodal combiner model with Audio and Lyrics is then trained and deployed to enable accurate predictions. The dataset used in this research is “MoodyLyrics”, a collection of 2000+ songs classified with one of 4 possible emotion classes as per the James Russel Circumplex Model. Due to compute limitations, we are using a balanced set of 1000 songs to train and test our models. The workin this paper outperforms most other multimodal researches by allowing higher accuracies with universal language support