Learnability of English diphthongs: One dynamic target vs. two static targets

IF 3 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2025-05-01 Epub Date: 2025-03-05 DOI:10.1016/j.specom.2025.103225

Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Santitham Prom-on , Peter Birkholz , Yi Xu

{"title":"Learnability of English diphthongs: One dynamic target vs. two static targets","authors":"Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Santitham Prom-on , Peter Birkholz , Yi Xu","doi":"10.1016/j.specom.2025.103225","DOIUrl":null,"url":null,"abstract":"<div><div>As vowels with intrinsic movements, diphthongs are among the most elusive sounds of speech. Previous research has characterized diphthongs as a combination of two vowels, a vowel followed by a formant transition, or a constant rate of formant change. These accounts are based on acoustic patterns, perceptual cues, and either acoustic or articulatory synthesis, but no consensus has been reached. In this study, we explore the nature of diphthongs by exploring how they can be acquired through vocal learning. The acquisition is simulated by a three-dimensional (3D) vocal tract model with built-in target approximation dynamics, which can learn articulatory targets of phonetic categories under the guidance of a speech recognizer. The simulation attempts to learn to articulate diphthong-embedded monosyllabic English words with either a single dynamic target or two static targets, and the learned synthetic words were presented to native listeners for identification. The results showed that diphthongs learned with dynamic targets were consistently more intelligible across variable durations than those learned with two static targets, with only the exception of /aɪ/. From the perspective of learnability, therefore, English diphthongs are likely unitary vowels with dynamic targets.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103225"},"PeriodicalIF":3.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000408","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/5 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

As vowels with intrinsic movements, diphthongs are among the most elusive sounds of speech. Previous research has characterized diphthongs as a combination of two vowels, a vowel followed by a formant transition, or a constant rate of formant change. These accounts are based on acoustic patterns, perceptual cues, and either acoustic or articulatory synthesis, but no consensus has been reached. In this study, we explore the nature of diphthongs by exploring how they can be acquired through vocal learning. The acquisition is simulated by a three-dimensional (3D) vocal tract model with built-in target approximation dynamics, which can learn articulatory targets of phonetic categories under the guidance of a speech recognizer. The simulation attempts to learn to articulate diphthong-embedded monosyllabic English words with either a single dynamic target or two static targets, and the learned synthetic words were presented to native listeners for identification. The results showed that diphthongs learned with dynamic targets were consistently more intelligible across variable durations than those learned with two static targets, with only the exception of /aɪ/. From the perspective of learnability, therefore, English diphthongs are likely unitary vowels with dynamic targets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

英语双元音的易学性：一个动态目标vs.两个静态目标

作为具有内在运动的元音，双元音是最难以捉摸的语音之一。以前的研究将双元音描述为两个元音的组合，一个元音之后是一个形成峰的过渡，或者形成峰的恒定速率变化。这些说法是基于声学模式、感知线索以及声学或发音合成，但尚未达成共识。在这项研究中，我们通过探索如何通过声乐学习获得双元音来探索双元音的本质。通过内置目标逼近动力学的三维声道模型进行模拟，在语音识别器的引导下学习语音类别的发音目标。该模拟尝试通过单个动态目标或两个静态目标来学习发音包含双元音的单音节英语单词，并将学习到的合成单词呈现给母语听众进行识别。结果表明，除了/a / /外，在不同的持续时间内，用动态目标学习的双元音比用两个静态目标学习的双元音更容易理解。因此，从易学性的角度来看，英语双元音很可能是带有动态目标的单一元音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.