{"title":"MusicTalk: A Microservice Approach for Musical Instrument Recognition","authors":"Yi-Bing Lin;Chang-Chieh Cheng;Shih-Chuan Chiu","doi":"10.1109/OJCS.2024.3476416","DOIUrl":null,"url":null,"abstract":"Musical instrument recognition is the process of using machine learning or audio signal processing to identify and classify different musical instruments from an audio recording. This capability enables more precise analysis of musical pieces, aiding in tasks like transcription, music recommendation, and automated composition. The challenges include (1) recognition models not being accurate enough, (2) the need to retrain the entire model when a new instrument is added, and (3) differences in audio formats that prevent direct usage. To address these challenges, this article introduces MusicTalk, a microservice based musical instrument (MI) detection system, with several key contributions. Firstly, MusicTalk introduces a novel patchout mechanism named Brightness Characteristic Based Patchout for the ViT algorithm, which enhances MI detection accuracy compared to existing solutions. Secondly, MusicTalk integrates individual MI detectors as microservices, facilitating efficient interaction with other microservices. Thirdly, MusicTalk incorporates an audio shaper that unifies diverse music open datasets such as Audioset, Openmic-2018, MedleyDB, URMP, and INSTDB. By employing Grad-CAM analysis on Mel-Spectrograms, we elucidate the characteristics of the MI detection model. This analysis allows us to optimize ensemble combinations of ViT with patchout and CNNs within MusicTalk, resulting in high accuracy rates. For instance, the system achieves precision and recall rates of 96.17% and 95.77% respectively for violin detection, which are the highest among previous approaches. An additional advantage of MusicTalk lies in its microservice-driven visualization capabilities. By integrating MI detectors as microservices, MusicTalk enables seamless visualization of songs using animated avatars. In a case study featuring “Peter and the Wolf,” we demonstrate that improved MI detection accuracy enhances the visual storytelling impact of music. The overall F1-score improvement of MusicTalk over previous approaches for this song is up to 12%.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"5 ","pages":"612-623"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10709650","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10709650/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Musical instrument recognition is the process of using machine learning or audio signal processing to identify and classify different musical instruments from an audio recording. This capability enables more precise analysis of musical pieces, aiding in tasks like transcription, music recommendation, and automated composition. The challenges include (1) recognition models not being accurate enough, (2) the need to retrain the entire model when a new instrument is added, and (3) differences in audio formats that prevent direct usage. To address these challenges, this article introduces MusicTalk, a microservice based musical instrument (MI) detection system, with several key contributions. Firstly, MusicTalk introduces a novel patchout mechanism named Brightness Characteristic Based Patchout for the ViT algorithm, which enhances MI detection accuracy compared to existing solutions. Secondly, MusicTalk integrates individual MI detectors as microservices, facilitating efficient interaction with other microservices. Thirdly, MusicTalk incorporates an audio shaper that unifies diverse music open datasets such as Audioset, Openmic-2018, MedleyDB, URMP, and INSTDB. By employing Grad-CAM analysis on Mel-Spectrograms, we elucidate the characteristics of the MI detection model. This analysis allows us to optimize ensemble combinations of ViT with patchout and CNNs within MusicTalk, resulting in high accuracy rates. For instance, the system achieves precision and recall rates of 96.17% and 95.77% respectively for violin detection, which are the highest among previous approaches. An additional advantage of MusicTalk lies in its microservice-driven visualization capabilities. By integrating MI detectors as microservices, MusicTalk enables seamless visualization of songs using animated avatars. In a case study featuring “Peter and the Wolf,” we demonstrate that improved MI detection accuracy enhances the visual storytelling impact of music. The overall F1-score improvement of MusicTalk over previous approaches for this song is up to 12%.