Enhancing Music Mood Recognition with LLMs and Audio Signal Processing: A Multimodal Approach

International Journal for Research in Applied Science and Engineering Technology Pub Date : 2024-07-31 DOI:10.22214/ijraset.2024.63590

Prof. R.Y. Sable, Aqsa Sayyed, Baliraje Kalyane, Kosheen Sadhu, Prathamesh Ghatole

{"title":"Enhancing Music Mood Recognition with LLMs and Audio Signal Processing: A Multimodal Approach","authors":"Prof. R.Y. Sable, Aqsa Sayyed, Baliraje Kalyane, Kosheen Sadhu, Prathamesh Ghatole","doi":"10.22214/ijraset.2024.63590","DOIUrl":null,"url":null,"abstract":"Abstract: Music Mood Recognition aims to allow computers to understand the emotions behind music the way humans do, in order to facilitate better perception of media by computers to aid in enhanced services like music recommendations, therapeutic interventions, and Human Computer Interaction. In this paper, we propose a novel approach to improving Music Mood Recognition using a multi-modal model that uses lyrical and audio features of a song. Lyrical features are analysed using stateof-the-art open-source Large Language Models like Microsoft Phi-3 to classify lyrics from one of the four possible emotion categories as per the James Russel Circumplex Model. Audio features are used to train a Deep Learning (ConvNet) model to predict emotion classes. A multimodal combiner model with Audio and Lyrics is then trained and deployed to enable accurate predictions. The dataset used in this research is “MoodyLyrics”, a collection of 2000+ songs classified with one of 4 possible emotion classes as per the James Russel Circumplex Model. Due to compute limitations, we are using a balanced set of 1000 songs to train and test our models. The workin this paper outperforms most other multimodal researches by allowing higher accuracies with universal language support","PeriodicalId":13718,"journal":{"name":"International Journal for Research in Applied Science and Engineering Technology","volume":"28 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal for Research in Applied Science and Engineering Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22214/ijraset.2024.63590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract: Music Mood Recognition aims to allow computers to understand the emotions behind music the way humans do, in order to facilitate better perception of media by computers to aid in enhanced services like music recommendations, therapeutic interventions, and Human Computer Interaction. In this paper, we propose a novel approach to improving Music Mood Recognition using a multi-modal model that uses lyrical and audio features of a song. Lyrical features are analysed using stateof-the-art open-source Large Language Models like Microsoft Phi-3 to classify lyrics from one of the four possible emotion categories as per the James Russel Circumplex Model. Audio features are used to train a Deep Learning (ConvNet) model to predict emotion classes. A multimodal combiner model with Audio and Lyrics is then trained and deployed to enable accurate predictions. The dataset used in this research is “MoodyLyrics”, a collection of 2000+ songs classified with one of 4 possible emotion classes as per the James Russel Circumplex Model. Due to compute limitations, we are using a balanced set of 1000 songs to train and test our models. The workin this paper outperforms most other multimodal researches by allowing higher accuracies with universal language support

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用 LLM 和音频信号处理增强音乐情绪识别：多模态方法

摘要：音乐情绪识别旨在让计算机能够像人类一样理解音乐背后的情感，从而促进计算机更好地感知媒体，帮助增强音乐推荐、治疗干预和人机交互等服务。在本文中，我们提出了一种利用歌曲的抒情和音频特征的多模态模型来改进音乐情绪识别的新方法。抒情特征使用最先进的开源大型语言模型（如 Microsoft Phi-3）进行分析，以根据詹姆斯-罗素环形模型将歌词从四种可能的情感类别中进行分类。音频特征用于训练深度学习（ConvNet）模型，以预测情感类别。然后训练并部署一个包含音频和歌词的多模态组合模型，以实现准确预测。本研究中使用的数据集是 "MoodyLyrics"，这是一个包含 2000 多首歌曲的集合，根据詹姆斯-罗素环形模型（James Russel Circumplex Model），这些歌曲被归类为 4 种可能的情感类别之一。由于计算能力有限，我们使用 1000 首歌曲的均衡集来训练和测试我们的模型。本文的研究成果优于大多数其他多模态研究成果，因为它能在通用语言支持下实现更高的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal for Research in Applied Science and Engineering Technology

自引率

0.00%

发文量