Speech Communication最新文献_第5页

A robust temporal map of speech monitoring from planning to articulation 从规划到发音的强大语音监测时间图谱

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-11-01 DOI: 10.1016/j.specom.2024.103146

Lydia Dorokhova , Benjamin Morillon , Cristina Baus , Pascal Belin , Anne-Sophie Dubarry , F.-Xavier Alario , Elin Runnqvist

Speakers continuously monitor their own speech to optimize fluent production, but the precise timing and underlying variables influencing speech monitoring remain insufficiently understood. Through two EEG experiments, this study aimed to provide a comprehensive temporal map of monitoring processes ranging from speech planning to articulation.

In both experiments, participants were primed to switch the consonant onsets of target word pairs read aloud, eliciting speech errors of either lexical or articulatory-phonetic (AP) origin. Experiment I used pairs of the same stimuli words, creating lexical or non-lexical errors when switching initial consonants, with the degree of shared AP features not fully balanced but considered in the analysis. Experiment II followed a similar methodology but used different words in pairs for the lexical and non-lexical conditions, fully orthogonalizing the number of shared AP features.

As error probability is higher in trials primed to result in lexical versus non-lexical errors and AP-close compared to AP-distant errors, more monitoring is required for these conditions. Similarly, error trials require more monitoring compared to correct trials. We used high versus low error probability on correct trials and errors versus correct trials as indices of monitoring.

Across both experiments, we observed that on correct trials, lexical error probability effects were present during initial stages of speech planning, while AP error probability effects emerged during speech motor preparation. In contrast, error trials showed differences from correct utterances in both early and late speech motor preparation and during articulation. These findings suggest that (a) response conflict on ultimately correct trials does not persist during articulation; (b) the timecourse of response conflict is restricted to the time window during which a given linguistic level is task-relevant (early on for response appropriateness-related variables and later for articulation-relevant variables); and (c) monitoring during the response is primarily triggered by pre-response monitoring failure. These results support that monitoring in language production is temporally distributed and rely on multiple mechanisms.

说话者会持续监测自己的语音，以优化流利的发音，但人们对影响语音监测的精确时间和潜在变量仍然了解不足。本研究旨在通过两项脑电图实验，为从语音规划到发音的监控过程提供全面的时间图谱。在这两项实验中，参与者被诱导切换朗读的目标词对的辅音首音，从而引发词汇或发音（AP）方面的语音错误。实验 I 使用了成对的相同刺激词，在切换首辅音时产生词汇或非词汇错误，共有 AP 特征的程度并不完全平衡，但在分析中也考虑到了这一点。实验二采用了类似的方法，但在词性和非词性条件下使用了不同的成对单词，共享 AP 特征的数量完全正交。由于在导致词性错误和非词性错误、AP-近距离错误和 AP-远距离错误的试验中，错误概率更高，因此这些条件下需要更多的监控。同样，错误试验比正确试验需要更多的监控。我们将正确试验中的高错误概率与低错误概率以及错误试验与正确试验作为监控指标。在这两项实验中，我们观察到，在正确试验中，词汇错误概率效应出现在言语计划的初始阶段，而 AP 错误概率效应则出现在言语运动准备阶段。与此相反，错误试验在言语运动准备的早期和晚期以及发音期间都显示出与正确语篇的差异。这些研究结果表明：(a) 最终正确试验中的反应冲突在衔接过程中不会持续；(b) 反应冲突的时间过程仅限于特定语言水平与任务相关的时间窗口（早期为反应适当性相关变量，后期为衔接相关变量）；(c) 反应过程中的监控主要由反应前监控失败触发。这些结果证明，语言生产中的监测是时间分布式的，并依赖于多种机制。

{"title":"A robust temporal map of speech monitoring from planning to articulation","authors":"Lydia Dorokhova , Benjamin Morillon , Cristina Baus , Pascal Belin , Anne-Sophie Dubarry , F.-Xavier Alario , Elin Runnqvist","doi":"10.1016/j.specom.2024.103146","DOIUrl":"10.1016/j.specom.2024.103146","url":null,"abstract":"<div><div>Speakers continuously monitor their own speech to optimize fluent production, but the precise timing and underlying variables influencing speech monitoring remain insufficiently understood. Through two EEG experiments, this study aimed to provide a comprehensive temporal map of monitoring processes ranging from speech planning to articulation.</div><div>In both experiments, participants were primed to switch the consonant onsets of target word pairs read aloud, eliciting speech errors of either lexical or articulatory-phonetic (AP) origin. Experiment I used pairs of the same stimuli words, creating lexical or non-lexical errors when switching initial consonants, with the degree of shared AP features not fully balanced but considered in the analysis. Experiment II followed a similar methodology but used different words in pairs for the lexical and non-lexical conditions, fully orthogonalizing the number of shared AP features.</div><div>As error probability is higher in trials primed to result in lexical versus non-lexical errors and AP-close compared to AP-distant errors, more monitoring is required for these conditions. Similarly, error trials require more monitoring compared to correct trials. We used high versus low error probability on correct trials and errors versus correct trials as indices of monitoring.</div><div>Across both experiments, we observed that on correct trials, lexical error probability effects were present during initial stages of speech planning, while AP error probability effects emerged during speech motor preparation. In contrast, error trials showed differences from correct utterances in both early and late speech motor preparation and during articulation. These findings suggest that (a) response conflict on ultimately correct trials does not persist during articulation; (b) the timecourse of response conflict is restricted to the time window during which a given linguistic level is task-relevant (early on for response appropriateness-related variables and later for articulation-relevant variables); and (c) monitoring during the response is primarily triggered by pre-response monitoring failure. These results support that monitoring in language production is temporally distributed and rely on multiple mechanisms.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103146"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones 双语和音乐性对听者感知非母语词调的综合影响

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-11-01 DOI: 10.1016/j.specom.2024.103147

Liang Zhang , Jiaqiang Zhu , Jing Shao , Caicai Zhang

Non-native lexical tone perception can be affected by listeners’ musical or linguistic experience, but it remains unclear of whether there will be combined effects and how these impacts will be modulated by different types of non-native tones. This study adopted an orthogonal design with four participant groups, namely, Mandarin-L1 monolinguals and Mandarin-L1 and Cantonese-L2 bilinguals with or without musical training, to investigate effects of bilingualism and musicianship on perception of non-native lexical tones. The closely matched four groups, each encompassing an equal number of 20 participants, attended a modified ABX discrimination task of lexical tones of Teochew, which was unknown to all participants and consists of multiple tone types of level tones, contour tones, and checked tones. The tone perceptual sensitivity index of d’ values and response times were calculated and compared using linear mixed-effects models. Results on tone sensitivity and response time revealed that all groups were more sensitive to contour tones than level tones, indicating the effect of native phonology of Mandarin tones on non-native tone perception. Besides, as compared to monolinguals, bilinguals obtained a higher d’ value when discriminating non-native tones, and musically trained bilinguals responded faster than their non-musician peers. It indicates that bilinguals enjoy a perceptual advantage in non-native tone perception, with musicianship further enhancing this advantage. This extends prior studies by showing that an L2 with a more intricate tone inventory than L1 could facilitate listeners’ non-native tone perception. The pedagogical implications were discussed.

听者的音乐或语言经验可能会影响非母语词汇音调感知，但是否会产生综合影响以及不同类型的非母语音调会如何调节这些影响，目前仍不清楚。本研究采用正交设计，将普通话-L1 单语听者、普通话-L1 和粤语-L2 双语听者（无论是否接受过音乐训练）分为四组，研究双语能力和音乐素养对非母语词调感知的影响。四组人数相等的20名受试者参加了潮州话词性声调的改良ABX辨别任务，所有受试者都不知道潮州话词性声调包括平调、轮廓调和检查调等多种声调类型。采用线性混合效应模型计算并比较了音调知觉敏感度指数 d'值和反应时间。音调敏感度和反应时间的结果表明，各组对等高线音调的敏感度均高于平舌音，这表明普通话音调的母语语音对非母语音调感知有影响。此外，与单语者相比，双语者在辨别非母语声调时获得了更高的d'值，而且受过音乐训练的双语者比非音乐家的同龄人反应更快。这表明，双语者在非母语音调感知方面享有感知优势，而音乐训练则进一步增强了这种优势。这扩展了之前的研究，表明与第一语言相比，第二语言具有更复杂的音调库，可以促进听者对非母语音调的感知。研究还讨论了其教学意义。

{"title":"The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones","authors":"Liang Zhang , Jiaqiang Zhu , Jing Shao , Caicai Zhang","doi":"10.1016/j.specom.2024.103147","DOIUrl":"10.1016/j.specom.2024.103147","url":null,"abstract":"<div><div>Non-native lexical tone perception can be affected by listeners’ musical or linguistic experience, but it remains unclear of whether there will be combined effects and how these impacts will be modulated by different types of non-native tones. This study adopted an orthogonal design with four participant groups, namely, Mandarin-L1 monolinguals and Mandarin-L1 and Cantonese-L2 bilinguals with or without musical training, to investigate effects of bilingualism and musicianship on perception of non-native lexical tones. The closely matched four groups, each encompassing an equal number of 20 participants, attended a modified ABX discrimination task of lexical tones of Teochew, which was unknown to all participants and consists of multiple tone types of level tones, contour tones, and checked tones. The tone perceptual sensitivity index of <em>d’</em> values and response times were calculated and compared using linear mixed-effects models. Results on tone sensitivity and response time revealed that all groups were more sensitive to contour tones than level tones, indicating the effect of native phonology of Mandarin tones on non-native tone perception. Besides, as compared to monolinguals, bilinguals obtained a higher <em>d’</em> value when discriminating non-native tones, and musically trained bilinguals responded faster than their non-musician peers. It indicates that bilinguals enjoy a perceptual advantage in non-native tone perception, with musicianship further enhancing this advantage. This extends prior studies by showing that an L2 with a more intricate tone inventory than L1 could facilitate listeners’ non-native tone perception. The pedagogical implications were discussed.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103147"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142656235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners 评估连续音高和语音节奏修改对熟悉和不熟悉听者感知说话者验证性能的影响

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-10-23 DOI: 10.1016/j.specom.2024.103145

Benjamin O’Brien , Christine Meunier , Alain Ghio

A study was conducted to evaluate the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar naive listeners. Speech recordings made by twelve male, native-French speakers were organised into three groups of four (two in-set, one out-of-set). Two groups of listeners participated, where one group was familiar with one in-set speaker group, while both groups were unfamiliar with the remaining in- and out-of-set speaker groups. Pitch and speech tempo were continuously modified, such that the first 75% of words spoken were modified with percentages of modification beginning at 100% and decaying linearly to 0%. Pitch modifications began at

\pm

600 cents, while speech tempo modifications started with word durations scaled 1:2 or 3:2. Participants evaluated a series of “go/no-go” task trials, where they were presented a modified speech recording with a face and tasked to respond as quickly as possible if they judged the stimuli to be continuous. The major findings revealed listeners overcame higher percentages of modification when presented familiar speaker stimuli. Familiar listeners outperformed unfamiliar listeners when evaluating continuously modified speech tempo stimuli, however, this effect was speaker-specific for pitch modified stimuli. Contrasting effects of modification direction were also observed. The findings suggest pitch is more useful to listeners when verifying familiar and unfamiliar voices.

本研究旨在评估连续音高和语音节奏修正对熟悉和不熟悉的天真听者感知说话者验证性能的影响。12 位以法语为母语的男性演讲者的演讲录音被分为三组，每组 4 人（两组在录音中，一组在录音外）。两组听者参与其中，其中一组熟悉一组入声说话者，而两组都不熟悉其余的入声和出声说话者。音调和语速被不断修改，因此前 75% 的词语被修改，修改百分比从 100% 开始，然后线性递减到 0%。音调修改从 ± 600 分开始，而语速修改则从单词持续时间按 1:2 或 3:2 的比例开始。受试者对一系列 "去/不去 "任务试验进行了评估，在这些试验中，受试者会看到经过修改的带有人脸的语音录音，如果他们判断刺激是连续的，就必须尽快做出反应。主要研究结果表明，当出现熟悉的说话者刺激时，听者能克服较高比例的修改。在评估连续修改的语音节奏刺激时，熟悉的听者的表现优于不熟悉的听者，但是，对于音调修改的刺激，这种效应是针对特定说话者的。此外，还观察到修改方向的对比效应。研究结果表明，在验证熟悉和不熟悉的声音时，音调对听者更有用。

{"title":"Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners","authors":"Benjamin O’Brien , Christine Meunier , Alain Ghio","doi":"10.1016/j.specom.2024.103145","DOIUrl":"10.1016/j.specom.2024.103145","url":null,"abstract":"<div><div>A study was conducted to evaluate the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar naive listeners. Speech recordings made by twelve male, native-French speakers were organised into three groups of four (two in-set, one out-of-set). Two groups of listeners participated, where one group was familiar with one in-set speaker group, while both groups were unfamiliar with the remaining in- and out-of-set speaker groups. Pitch and speech tempo were continuously modified, such that the first 75% of words spoken were modified with percentages of modification beginning at 100% and decaying linearly to 0%. Pitch modifications began at <span><math><mo>±</mo></math></span> 600 cents, while speech tempo modifications started with word durations scaled 1:2 or 3:2. Participants evaluated a series of “go/no-go” task trials, where they were presented a modified speech recording with a face and tasked to respond as quickly as possible if they judged the stimuli to be continuous. The major findings revealed listeners overcame higher percentages of modification when presented familiar speaker stimuli. Familiar listeners outperformed unfamiliar listeners when evaluating continuously modified speech tempo stimuli, however, this effect was speaker-specific for pitch modified stimuli. Contrasting effects of modification direction were also observed. The findings suggest pitch is more useful to listeners when verifying familiar and unfamiliar voices.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103145"},"PeriodicalIF":2.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142527431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments 用于噪声语音实验的语言均衡的丹麦语句子视听记录语料库

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-10-09 DOI: 10.1016/j.specom.2024.103141

Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May

A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.

在研发环境中，一个典型的噪声语音实验很容易包含多达 20 个条件，甚至更多，而且通常每个条件至少需要两个测试点。在丹麦语中，还没有一个句子测试包含足够多的句子，可以在不重复的情况下进行如此大量的测试。因此，我们开发了一个新的语料库，以帮助创建一个足够大的句子测试来满足这一需求。语料库本身由 1200 个语言平衡句子的音频和视听录音组成，所有句子均由两名女性和两名男性说话者说出。这些句子是用一种新颖的、基于模板的方法构建的，便于控制词频和句子结构。这些句子从音位分布、自然度和内涵等方面进行了语言学评估，然后进行录音、后处理，并根据其音频、视觉和发音质量进行评分。本文详细介绍了创建和描述该语料库的方法。

{"title":"A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments","authors":"Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May","doi":"10.1016/j.specom.2024.103141","DOIUrl":"10.1016/j.specom.2024.103141","url":null,"abstract":"<div><div>A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103141"},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Forms, factors and functions of phonetic convergence: Editorial 语音趋同的形式、因素和功能社论

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-30 DOI: 10.1016/j.specom.2024.103142

Elisa Pellegrino , Volker Dellwo , Jennifer S. Pardo , Bernd Möbius

This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.

这篇为 "语音趋同的形式、因素和功能 "特刊撰写的介绍性文章全面概述了与语音趋同相关的主流理论范式、诱导方法和计算方法，并讨论了影响说话者之间声学调整的既定因素的作用。本论文集中的九篇论文对生产和感知中的调和以及对调和的感知背后的基本机制、因素和功能提出了新的见解。通过整合声学、发音学和知觉对趋同的评估，并结合传统的实验语音分析和计算建模，这九篇论文（1）强调了认知适应性和语音变异性作为趋同触发因素的作用；（2）揭示了趋同感知机制和说话者识别机制之间的基本相似性；（3）阐明了人类和动物发声中适应性之间的进化联系。

引用次数: 0

Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study 利用元音的声学特征估算清醒时上气道横截面积的可行性：试点研究

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-28 DOI: 10.1016/j.specom.2024.103144

Shumit Saha , Keerthana Viswanathan , Anamika Saha , Azadeh Yadollahi

Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R² of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.

上气道尺寸评估在了解阻塞性睡眠呼吸暂停（OSA）的发病机理方面显示出巨大的前景。然而，目前的 OSA 筛查系统无法对上气道进行客观评估。对上气道的评估可以通过核磁共振成像或 CT 扫描来准确进行，但这些方法成本高昂且不易获得。声学咽喉测量法或超声波检查法可能是成本较低的技术，但这些技术需要训练有素的人员，因此不易获得，尤其是在诊所环境或手术前评估上气道时。在本研究中，我们旨在研究元音发音在评估正常呼吸时上气道尺寸方面的实用性。为此，我们通过声学咽喉测量法测量了上气道横截面积（UA-XSA），然后要求参与者在 3 秒钟内发出 5 个元音，并用麦克风记录下来。我们从所有元音中提取了 710 个声学特征，将这些特征与 UA-XSA 进行了比较，并建立了回归模型来估计 UA-XSA。结果表明，梅尔频率倒谱系数（MFCC）是元音最主要的特征，因为在主要特征集中，9 个特征中有 7 个来自 MFCC。多元回归分析表明，声学特征与人体测量特征的组合在估算 UA-XSA 时的 R2 值为 0.80。元音声学分析的重要优势在于其简单易行，可在可穿戴设备或移动应用中轻松实现。这种基于声学的技术可以在重症监护室等不同的临床环境中使用，也可以在偏远地区使用。因此，这些结果可用于开发用户友好型应用程序，利用声学特征和人口信息来估计 UA-XSA。

{"title":"Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study","authors":"Shumit Saha , Keerthana Viswanathan , Anamika Saha , Azadeh Yadollahi","doi":"10.1016/j.specom.2024.103144","DOIUrl":"10.1016/j.specom.2024.103144","url":null,"abstract":"<div><div>Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R<sup>2</sup> of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103144"},"PeriodicalIF":2.4,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-shot voice conversion based on feature disentanglement 基于特征分解的零镜头语音转换

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-27 DOI: 10.1016/j.specom.2024.103143

Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao

Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.

语音转换（VC）的目的是在不修改语言内容的情况下，将源说话者的语音转换为目标说话者的语音。零镜头语音转换在语音转换任务中备受关注，因为它可以实现对训练阶段未出现的说话者的转换。尽管之前的零镜头语音转换方法取得了重大进展，但在分离说话人信息和内容信息方面仍有改进空间。在本文中，我们提出了一种基于特征分离的零镜头 VC 方法。所提模型使用扬声器编码器提取扬声器嵌入，引入混合扬声器层归一化以消除内容编码中的残余扬声器信息，并采用自适应注意力权重归一化进行转换。此外，还引入了动态卷积，以改进语音内容建模，同时只需少量参数。实验证明，拟议模型的性能优于几种最先进的模型，既能实现与目标说话人的高度相似，又能实现可懂度。此外，我们模型的解码速度也远高于现有的先进模型。

{"title":"Zero-shot voice conversion based on feature disentanglement","authors":"Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao","doi":"10.1016/j.specom.2024.103143","DOIUrl":"10.1016/j.specom.2024.103143","url":null,"abstract":"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103143"},"PeriodicalIF":2.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-modal co-learning for silent speech recognition based on ultrasound tongue images 基于超声舌头图像的无声语音识别多模态协同学习

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-12 DOI: 10.1016/j.specom.2024.103140

Minghao Guo , Jianguo Wei , Ruiteng Zhang , Yu Zhao , Qiang Fang

Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.

无声语音识别（SSR）是人机交互中的一项重要任务，旨在从非声学模式中识别语音。无声语音识别面临的一个主要挑战是，由于非声学信号中缺少部分语音信息，因此会产生固有的输入模糊性。这种模糊性会导致同音字--输入相似但发音不同的单词。目前解决这一问题的方法要么是利用更丰富的附加输入，要么是训练额外的模型进行跨模态嵌入补偿。在本文中，我们提出了一种有效的多模态协同学习框架，通过多阶段训练提高无声语音表征的分辨能力。我们首先以超声舌部成像（UTI）为主要模态构建了 SSR 的骨干，然后引入了两种辅助模态：唇部视频和音频信号。利用模态剔除，该模型可从所有可用流中学习共享/特定特征，从而创建一个相同的语义空间，以更好地概括UTI 表征。鉴于跨模态的不平衡优化，我们强调了超参数设置和调制策略对 SSR 实现特定模态协同学习的重要性。实验结果表明，具有单一UTI输入的模态无关模型优于最先进的特定模态模型。基于音素/发音特征的混淆分析证实，共同学习的UTI表征包含区分同音字的宝贵信息。此外，我们的模型在两个未见测试集上表现良好，实现了单模态 SSR 任务的跨模态泛化。

{"title":"Multi-modal co-learning for silent speech recognition based on ultrasound tongue images","authors":"Minghao Guo , Jianguo Wei , Ruiteng Zhang , Yu Zhao , Qiang Fang","doi":"10.1016/j.specom.2024.103140","DOIUrl":"10.1016/j.specom.2024.103140","url":null,"abstract":"<div><p>Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103140"},"PeriodicalIF":2.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion CLESSR-VC：用于单次语音转换的对比学习增强型自监督表示法

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-10 DOI: 10.1016/j.specom.2024.103139

Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu

One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.

单次语音转换（VC）因其广阔的实际应用前景而受到越来越多的关注。在这项任务中，语音特征的表示能力和模型的泛化能力是关注的焦点。本文提出了一种名为 CLESSR-VC 的模型，该模型通过对比学习增强了预训练的自监督学习（SSL）表征，可用于单次 VC。首先，采用预训练 WavLM 第 23 层和第 9 层的 SSL 特征，分别提取内容嵌入和 SSL 说话者嵌入，以确保模型的泛化。然后，引入传统的声学特征 mel-spectrograms 和对比学习来增强语音特征的表示能力。具体来说，对比学习与音高偏移增强方法相结合，可以准确地从 SSL 特征中分离出内容信息。采用梅尔频谱图提取梅尔说话者嵌入。在 SSL 和 mel 说话者嵌入之间应用 AM-Softmax 和跨架构对比学习，以获得融合的说话者嵌入，这有助于提高语音质量和说话者相似度。在 VCTK 语料库上进行的客观和主观评估结果都证实，所提出的 VC 模型具有出色的性能和较少的可训练参数。

{"title":"CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion","authors":"Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu","doi":"10.1016/j.specom.2024.103139","DOIUrl":"10.1016/j.specom.2024.103139","url":null,"abstract":"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103139"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language CSLNSpeech：借助中文手语解决扩展语音分离问题

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-02 DOI: 10.1016/j.specom.2024.103131

Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu

Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech

以往的视听语音分离方法是将说话者的面部动作与视频中的语音同步，从而对语音分离进行自我监督。在本文中，我们提出了一个模型来解决由面部和手语共同辅助的语音分离问题，我们称之为扩展语音分离问题。我们设计了一个通用的深度学习网络来学习音频、人脸和手语三种模态信息的组合，从而更好地解决语音分离问题。我们引入了一个名为 "中国手语新闻语音（CSLNSpeech）"的大规模数据集来训练模型，其中音频、人脸和手语三种模态并存。实验结果表明，与普通的视听系统相比，所提出的模型性能更好、更稳健。此外，手语模式也可单独用于监督语音分离任务，引入手语有助于听障人士的学习和交流。最后，我们的模型是一个通用的语音分离框架，可以在两个开源视听数据集上实现极具竞争力的分离性能。代码见 https://github.com/iveveive/SLNSpeech

{"title":"CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language","authors":"Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu","doi":"10.1016/j.specom.2024.103131","DOIUrl":"10.1016/j.specom.2024.103131","url":null,"abstract":"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103131"},"PeriodicalIF":2.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0