Artificial vocal learning guided by speech recognition: What it may tell us about how children learn to speak

IF 2.4 1区文学 0 LANGUAGE & LINGUISTICS Journal of Phonetics Pub Date : 2024-07-01 Epub Date: 2024-06-20 DOI:10.1016/j.wocn.2024.101338

Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Peter Birkholz , Santitham Prom-on , Lorna F. Halliday , Yi Xu

{"title":"Artificial vocal learning guided by speech recognition: What it may tell us about how children learn to speak","authors":"Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Peter Birkholz , Santitham Prom-on , Lorna F. Halliday , Yi Xu","doi":"10.1016/j.wocn.2024.101338","DOIUrl":null,"url":null,"abstract":"<div><p>It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children’s vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"105 ","pages":"Article 101338"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000445/pdfft?md5=941cb45273d2db483f6143ef8085a741&pid=1-s2.0-S0095447024000445-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Phonetics","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0095447024000445","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/20 0:00:00","PubModel":"Epub","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children’s vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

语音识别指导下的人工发声学习：它能告诉我们儿童如何学习说话

长期以来，儿童如何在没有正式指令的情况下学习说话一直是个谜。以往的研究利用计算建模，通过直接模仿或照顾者的反馈来模拟发声学习，从而帮助解开这个谜团，但在克服说话者正常化问题上遇到了困难，即由于年龄相关的解剖学差异，儿童的发声与成人的发声存在差异。在这里，我们展示了通过识别引导的发声探索可以成功模拟发声学习，而无需明确的说话者归一化。我们使用成人和两个不同年龄儿童的三维声道模型对发音合成器进行了训练，以学习由 CVC 音节组成的单音节英语单词，训练基于共发音动态和两种听觉反馈：(i) 声学特征以模拟通用语音感知（或直接模仿），(ii) 基于深度学习的语音识别器以模拟母语语音感知。我们邀请母语听者以自然语音为基线参考，对学习到的合成语音进行评估。结果显示，使用语音识别器训练的英语单词比使用声学特征训练的单词更易懂，有时甚至接近自然语音。这项研究成功地模拟了发声学习，表明共发音动态和母语语音感知的结合可能对现实生活中的发声学习也至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Phonetics Multiple-

CiteScore

3.50

自引率

26.30%

发文量

期刊介绍： The Journal of Phonetics publishes papers of an experimental or theoretical nature that deal with phonetic aspects of language and linguistic communication processes. Papers dealing with technological and/or pathological topics, or papers of an interdisciplinary nature are also suitable, provided that linguistic-phonetic principles underlie the work reported. Regular articles, review articles, and letters to the editor are published. Themed issues are also published, devoted entirely to a specific subject of interest within the field of phonetics.