{"title":"Effects of F0 Estimation Algorithms on Ultrasound-Based Silent Speech Interfaces","authors":"Peng Dai, M. Al-Radhi, T. Csapó","doi":"10.1109/sped53181.2021.9587434","DOIUrl":null,"url":null,"abstract":"This paper shows recent Silent Speech Interface (SSI) progress that translates tongue motions into audible speech. In our previous work and also in the current study, the prediction of fundamental frequency (F0) from Ultra-Sound Tongue Images (UTI) was achieved using articulatory-to-acoustic mapping methods based on deep learning. Here we investigated several traditional discontinuous speech-based F0 estimation algorithms for the target of UTI-based SSI system. Besides, the vocoder parameters (F0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using deep neural networks, with UTI as input. We found that those discontinuous F0 algorithms are predicted with a lower error during the articulatory-to-acoustic mapping experiments. They result in slightly more natural synthesized speech than the baseline continuous F0 algorithm. Moreover, experimental results confirmed that discontinuous algorithms (e.g. Yin) are closest to original speech in objective metrics and subjective listening test.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sped53181.2021.9587434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper shows recent Silent Speech Interface (SSI) progress that translates tongue motions into audible speech. In our previous work and also in the current study, the prediction of fundamental frequency (F0) from Ultra-Sound Tongue Images (UTI) was achieved using articulatory-to-acoustic mapping methods based on deep learning. Here we investigated several traditional discontinuous speech-based F0 estimation algorithms for the target of UTI-based SSI system. Besides, the vocoder parameters (F0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using deep neural networks, with UTI as input. We found that those discontinuous F0 algorithms are predicted with a lower error during the articulatory-to-acoustic mapping experiments. They result in slightly more natural synthesized speech than the baseline continuous F0 algorithm. Moreover, experimental results confirmed that discontinuous algorithms (e.g. Yin) are closest to original speech in objective metrics and subjective listening test.