Artificial vocal learning guided by speech recognition: What it may tell us about how children learn to speak

IF 1.9 1区 文学 0 LANGUAGE & LINGUISTICS Journal of Phonetics Pub Date : 2024-06-20 DOI:10.1016/j.wocn.2024.101338
Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Peter Birkholz , Santitham Prom-on , Lorna F. Halliday , Yi Xu
{"title":"Artificial vocal learning guided by speech recognition: What it may tell us about how children learn to speak","authors":"Anqi Xu ,&nbsp;Daniel R. van Niekerk ,&nbsp;Branislav Gerazov ,&nbsp;Paul Konstantin Krug ,&nbsp;Peter Birkholz ,&nbsp;Santitham Prom-on ,&nbsp;Lorna F. Halliday ,&nbsp;Yi Xu","doi":"10.1016/j.wocn.2024.101338","DOIUrl":null,"url":null,"abstract":"<div><p>It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children’s vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":null,"pages":null},"PeriodicalIF":1.9000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000445/pdfft?md5=941cb45273d2db483f6143ef8085a741&pid=1-s2.0-S0095447024000445-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Phonetics","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0095447024000445","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children’s vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语音识别指导下的人工发声学习:它能告诉我们儿童如何学习说话
长期以来,儿童如何在没有正式指令的情况下学习说话一直是个谜。以往的研究利用计算建模,通过直接模仿或照顾者的反馈来模拟发声学习,从而帮助解开这个谜团,但在克服说话者正常化问题上遇到了困难,即由于年龄相关的解剖学差异,儿童的发声与成人的发声存在差异。在这里,我们展示了通过识别引导的发声探索可以成功模拟发声学习,而无需明确的说话者归一化。我们使用成人和两个不同年龄儿童的三维声道模型对发音合成器进行了训练,以学习由 CVC 音节组成的单音节英语单词,训练基于共发音动态和两种听觉反馈:(i) 声学特征以模拟通用语音感知(或直接模仿),(ii) 基于深度学习的语音识别器以模拟母语语音感知。我们邀请母语听者以自然语音为基线参考,对学习到的合成语音进行评估。结果显示,使用语音识别器训练的英语单词比使用声学特征训练的单词更易懂,有时甚至接近自然语音。这项研究成功地模拟了发声学习,表明共发音动态和母语语音感知的结合可能对现实生活中的发声学习也至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
3.50
自引率
26.30%
发文量
49
期刊介绍: The Journal of Phonetics publishes papers of an experimental or theoretical nature that deal with phonetic aspects of language and linguistic communication processes. Papers dealing with technological and/or pathological topics, or papers of an interdisciplinary nature are also suitable, provided that linguistic-phonetic principles underlie the work reported. Regular articles, review articles, and letters to the editor are published. Themed issues are also published, devoted entirely to a specific subject of interest within the field of phonetics.
期刊最新文献
Talker variability versus variability of vowel context in training naïve learners on an unfamiliar class of foreign language contrasts Effects of syllable position and place of articulation on secondary dorsal contrasts: An ultrasound study of Irish On the target of phonetic convergence: Acoustic and linguistic aspects of pitch accent imitation Effects of word-level structure on oral stop realization in Hawaiian Lexically-guided perceptual recalibration from acoustically unambiguous input in second language learners
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1