Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587350
Narjes Bozorg, Michael T. Johnson, M. Soleymanpour
In this paper we introduce a new speaker independent method for Acoustic-to-Articulatory Inversion. The proposed architecture, Speaker Independent-Articulatory WaveNet (SI-AWN), models the relationship between acoustic and articulatory features by conditioning the articulatory trajectories on acoustic features and then utilizes the structure for unseen target speakers. We evaluate the proposed SI-AWN on the Electro Magnetic Articulography corpus of Mandarin Accented English (EMA-MAE), using the pool of acoustic-articulatory information from 35 reference speakers and testing on target speakers that include male, female, native and non-native speakers. The results suggest that SI-AWN improves the performance of the acoustic-to-articulatory inversion process compared to the baseline Maximum Likelihood Regression-Parallel Reference Speaker Weighting (MLLR-PRSW) method by 21 percent. To the best of our knowledge, this is the first application of a WaveNet-like synthesis approach to the problem of Speaker Independent Acoustic-to-Articulatory Inversion, and results are comparable to or better than the best currently published systems.
{"title":"Autoregressive Articulatory WaveNet Flow for Speaker-Independent Acoustic-to-Articulatory Inversion","authors":"Narjes Bozorg, Michael T. Johnson, M. Soleymanpour","doi":"10.1109/sped53181.2021.9587350","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587350","url":null,"abstract":"In this paper we introduce a new speaker independent method for Acoustic-to-Articulatory Inversion. The proposed architecture, Speaker Independent-Articulatory WaveNet (SI-AWN), models the relationship between acoustic and articulatory features by conditioning the articulatory trajectories on acoustic features and then utilizes the structure for unseen target speakers. We evaluate the proposed SI-AWN on the Electro Magnetic Articulography corpus of Mandarin Accented English (EMA-MAE), using the pool of acoustic-articulatory information from 35 reference speakers and testing on target speakers that include male, female, native and non-native speakers. The results suggest that SI-AWN improves the performance of the acoustic-to-articulatory inversion process compared to the baseline Maximum Likelihood Regression-Parallel Reference Speaker Weighting (MLLR-PRSW) method by 21 percent. To the best of our knowledge, this is the first application of a WaveNet-like synthesis approach to the problem of Speaker Independent Acoustic-to-Articulatory Inversion, and results are comparable to or better than the best currently published systems.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126744968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-23DOI: 10.1109/sped53181.2021.9587391
Chunyan Ji, Yi Pan
From crying to babbling and then to speech, infants’ vocal tract goes through anatomic restructuring. In this paper, we propose a non-invasive fast method of using infant cry signals with convolutional neural network (CNN) based age classification to diagnose the abnormality of vocal tract development as early as 4-month age. We study F0, F1, F2, spectrograms of the audio signals and relate them to the postnatal development of infant vocalization. We perform two age classification experiments: vocal tract development experiment and vocal tract development diagnosis experiment. The vocal tract development experiment trained on Baby2020 database discovers the pattern and tendency of the vocal tract changes, and the result matches the anatomical development of the vocal tract. The vocal tract development diagnosis experiment predicts the abnormality of infant vocal tract by classifying the cry signals into younger age category. The diagnosis model is trained on healthy infant cries from Baby2020 database. Cries from other infants in Baby2020 and Baby Chillanto database are used as testing sets. The diagnosis experiment yields 79.20% accuracy on healthy infants, 84.80% asphyxiated infant cries and 91.20% deaf cries are diagnosed as cries younger than 4-month although they are from infants up to 9-month-old. The results indicate the delayed developed cries are associated with abnormal vocal tract development.
{"title":"Infant Vocal Tract Development Analysis and Diagnosis by Cry Signals with CNN Age Classification","authors":"Chunyan Ji, Yi Pan","doi":"10.1109/sped53181.2021.9587391","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587391","url":null,"abstract":"From crying to babbling and then to speech, infants’ vocal tract goes through anatomic restructuring. In this paper, we propose a non-invasive fast method of using infant cry signals with convolutional neural network (CNN) based age classification to diagnose the abnormality of vocal tract development as early as 4-month age. We study F0, F1, F2, spectrograms of the audio signals and relate them to the postnatal development of infant vocalization. We perform two age classification experiments: vocal tract development experiment and vocal tract development diagnosis experiment. The vocal tract development experiment trained on Baby2020 database discovers the pattern and tendency of the vocal tract changes, and the result matches the anatomical development of the vocal tract. The vocal tract development diagnosis experiment predicts the abnormality of infant vocal tract by classifying the cry signals into younger age category. The diagnosis model is trained on healthy infant cries from Baby2020 database. Cries from other infants in Baby2020 and Baby Chillanto database are used as testing sets. The diagnosis experiment yields 79.20% accuracy on healthy infants, 84.80% asphyxiated infant cries and 91.20% deaf cries are diagnosed as cries younger than 4-month although they are from infants up to 9-month-old. The results indicate the delayed developed cries are associated with abnormal vocal tract development.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123819171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-21DOI: 10.1109/sped53181.2021.9587434
Peng Dai, M. Al-Radhi, T. Csapó
This paper shows recent Silent Speech Interface (SSI) progress that translates tongue motions into audible speech. In our previous work and also in the current study, the prediction of fundamental frequency (F0) from Ultra-Sound Tongue Images (UTI) was achieved using articulatory-to-acoustic mapping methods based on deep learning. Here we investigated several traditional discontinuous speech-based F0 estimation algorithms for the target of UTI-based SSI system. Besides, the vocoder parameters (F0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using deep neural networks, with UTI as input. We found that those discontinuous F0 algorithms are predicted with a lower error during the articulatory-to-acoustic mapping experiments. They result in slightly more natural synthesized speech than the baseline continuous F0 algorithm. Moreover, experimental results confirmed that discontinuous algorithms (e.g. Yin) are closest to original speech in objective metrics and subjective listening test.
{"title":"Effects of F0 Estimation Algorithms on Ultrasound-Based Silent Speech Interfaces","authors":"Peng Dai, M. Al-Radhi, T. Csapó","doi":"10.1109/sped53181.2021.9587434","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587434","url":null,"abstract":"This paper shows recent Silent Speech Interface (SSI) progress that translates tongue motions into audible speech. In our previous work and also in the current study, the prediction of fundamental frequency (F0) from Ultra-Sound Tongue Images (UTI) was achieved using articulatory-to-acoustic mapping methods based on deep learning. Here we investigated several traditional discontinuous speech-based F0 estimation algorithms for the target of UTI-based SSI system. Besides, the vocoder parameters (F0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using deep neural networks, with UTI as input. We found that those discontinuous F0 algorithms are predicted with a lower error during the articulatory-to-acoustic mapping experiments. They result in slightly more natural synthesized speech than the baseline continuous F0 algorithm. Moreover, experimental results confirmed that discontinuous algorithms (e.g. Yin) are closest to original speech in objective metrics and subjective listening test.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129316861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}