基于CRNN-HSMM混合模型的音频到乐谱唱歌转录

Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii
{"title":"基于CRNN-HSMM混合模型的音频到乐谱唱歌转录","authors":"Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii","doi":"10.1017/ATSIP.2021.4","DOIUrl":null,"url":null,"abstract":"This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.4","citationCount":"10","resultStr":"{\"title\":\"Audio-to-score singing transcription based on a CRNN-HSMM hybrid model\",\"authors\":\"Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii\",\"doi\":\"10.1017/ATSIP.2021.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.\",\"PeriodicalId\":44812,\"journal\":{\"name\":\"APSIPA Transactions on Signal and Information Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2021-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1017/ATSIP.2021.4\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"APSIPA Transactions on Signal and Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/ATSIP.2021.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2021.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 10

摘要

本文描述了一种自动歌唱转录(AST)方法,该方法根据输入音乐信号估计歌唱旋律的人类可读乐谱。由于歌声的音高和时间变化很大,估计F0轮廓并用估计的状态时间对其进行量化的天真级联方法无法避免许多音高和节奏误差。为了解决这个问题,我们建立了一个统一的音乐信号生成模型,该模型由表示以音乐键为条件的潜在音符生成过程的半马尔可夫语言模型和基于卷积递归神经网络(CRNN)的声学模型组成,该模型表示从音符中观察到的音乐信号的生成过程。由此产生的CRNN-HSMM混合模型使我们能够使用维特比算法从音乐信号中估计最可能的音符,同时利用关于音符的语法知识和CRNN的表达能力。实验结果表明,所提出的方法优于传统的最先进的方法,音乐语言模型与声学模型的集成对AST性能有积极影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Audio-to-score singing transcription based on a CRNN-HSMM hybrid model
This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
APSIPA Transactions on Signal and Information Processing
APSIPA Transactions on Signal and Information Processing ENGINEERING, ELECTRICAL & ELECTRONIC-
CiteScore
8.60
自引率
6.20%
发文量
30
审稿时长
40 weeks
期刊最新文献
A Comprehensive Overview of Computational Nuclei Segmentation Methods in Digital Pathology Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition GP-Net: A Lightweight Generative Convolutional Neural Network with Grasp Priority Reversible Data Hiding in Compressible Encrypted Images with Capacity Enhancement Convolutional Neural Networks Inference Memory Optimization with Receptive Field-Based Input Tiling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1