{"title":"基于注意力机制的两阶段视听融合钢琴转写模型","authors":"Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng","doi":"10.1109/TASLP.2024.3426303","DOIUrl":null,"url":null,"abstract":"Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3618-3630"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10614622","citationCount":"0","resultStr":"{\"title\":\"A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism\",\"authors\":\"Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng\",\"doi\":\"10.1109/TASLP.2024.3426303\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"3618-3630\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10614622\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10614622/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10614622/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism
Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.