Decode, move and speak! Self-supervised learning of speech units, gestures, and sounds relationships using vocal imitation

IF 9.3 2区 计算机科学 Computational Linguistics Pub Date : 2024-07-30 DOI:10.1162/coli_a_00532
Marc-Antoine Georges, Marvin Lavechin, Jean-Luc Schwartz, Thomas Hueber
{"title":"Decode, move and speak! Self-supervised learning of speech units, gestures, and sounds relationships using vocal imitation","authors":"Marc-Antoine Georges, Marvin Lavechin, Jean-Luc Schwartz, Thomas Hueber","doi":"10.1162/coli_a_00532","DOIUrl":null,"url":null,"abstract":"Speech learning encompasses mastering a complex motor system to produce speech sounds from articulatory gestures while simultaneously uncovering discrete units that provide entry to the linguistic system. Remarkably, children acquire these associations between speech sounds, articulatory gestures, and linguistic units in a weakly supervised manner, without the need for explicit labeling of auditory inputs or access to target articulatory gestures. This study uses self-supervised deep learning to investigate the respective roles of sounds, gestures, and linguistic units in speech acquisition and control. In a first experiment, we analysed the quantized representations learned by vector-quantized variational autoencoders (VQ-VAE) from ground truth acoustic and articulatory data using ABX tests. We show an interesting complementarity between acoustic and articulatory modalities that may help in the discovery of phonemes. In a second experiment, we introduce a computational agent that repeats auditory speech inputs by controlling a virtual vocal apparatus. This agent integrates an articulatory synthesizer capable of reproducing diverse speech stimuli from interpretable parameters, along with two internal models implementing the articulatory-to-acoustic (forward) and acoustic-to-articulatory (inverse) mapping, respectively. Additionally, two inductive biases are used to regularize the ill-posed acoustic-to-articulatory inverse mapping. In line with the first experiment, we explore the complementarity between the auditory input and the articulatory parameters inferred by the agent. We also evaluate the impact of discretizing auditory inputs using VQ-VAE. While the majority of the agent’s productions are intelligible (according to perceptual evaluations), our analysis highlights inconsistencies in the underlying articulatory trajectories. In particular, we show that the agent’s productions only partially reproduce the complementarity between the auditory and articulatory modalities observed in humans.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"184 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00532","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Speech learning encompasses mastering a complex motor system to produce speech sounds from articulatory gestures while simultaneously uncovering discrete units that provide entry to the linguistic system. Remarkably, children acquire these associations between speech sounds, articulatory gestures, and linguistic units in a weakly supervised manner, without the need for explicit labeling of auditory inputs or access to target articulatory gestures. This study uses self-supervised deep learning to investigate the respective roles of sounds, gestures, and linguistic units in speech acquisition and control. In a first experiment, we analysed the quantized representations learned by vector-quantized variational autoencoders (VQ-VAE) from ground truth acoustic and articulatory data using ABX tests. We show an interesting complementarity between acoustic and articulatory modalities that may help in the discovery of phonemes. In a second experiment, we introduce a computational agent that repeats auditory speech inputs by controlling a virtual vocal apparatus. This agent integrates an articulatory synthesizer capable of reproducing diverse speech stimuli from interpretable parameters, along with two internal models implementing the articulatory-to-acoustic (forward) and acoustic-to-articulatory (inverse) mapping, respectively. Additionally, two inductive biases are used to regularize the ill-posed acoustic-to-articulatory inverse mapping. In line with the first experiment, we explore the complementarity between the auditory input and the articulatory parameters inferred by the agent. We also evaluate the impact of discretizing auditory inputs using VQ-VAE. While the majority of the agent’s productions are intelligible (according to perceptual evaluations), our analysis highlights inconsistencies in the underlying articulatory trajectories. In particular, we show that the agent’s productions only partially reproduce the complementarity between the auditory and articulatory modalities observed in humans.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
解码、移动和说话!利用声音模仿自我监督学习语音单元、手势和声音关系
语音学习包括掌握复杂的运动系统,通过发音手势发出语音,同时发现进入语言系统的离散单元。值得注意的是,儿童是以弱监督的方式获得这些语音、发音手势和语言单位之间的关联的,而不需要对听觉输入或目标发音手势进行明确标记。本研究利用自我监督深度学习来研究声音、手势和语言单位在语音习得和控制中的各自作用。在第一个实验中,我们使用 ABX 测试分析了向量量化变异自动编码器(VQ-VAE)从地面真实声学和发音数据中学习到的量化表征。我们展示了声学和发音模式之间有趣的互补性,这可能有助于音素的发现。在第二个实验中,我们引入了一个计算代理,通过控制虚拟发声器官来重复听觉语音输入。该代理集成了一个发音合成器,能够根据可解释参数重现各种语音刺激,同时还集成了两个内部模型,分别实现发音到声学(正向)和声学到发音(逆向)映射。此外,还使用了两个归纳偏置来正则化问题严重的声-动逆映射。与第一个实验一样,我们探索了听觉输入与代理推断的发音参数之间的互补性。我们还评估了使用 VQ-VAE 将听觉输入离散化的影响。虽然大多数语音代理的发音是可理解的(根据知觉评估),但我们的分析突出了基本发音轨迹的不一致性。特别是,我们发现,语音代理的发音只能部分再现人类听觉和发音模式之间的互补性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computational Linguistics
Computational Linguistics Computer Science-Artificial Intelligence
自引率
0.00%
发文量
45
期刊介绍: Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.
期刊最新文献
Dotless Arabic text for Natural Language Processing Humans Learn Language from Situated Communicative Interactions. What about Machines? Exploring temporal sensitivity in the brain using multi-timescale language models: an EEG decoding study Meaning beyond lexicality: Capturing Pseudoword Definitions with Language Models Perception of Phonological Assimilation by Neural Speech Recognition Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1