首页 > 最新文献

Speech Communication最新文献

英文 中文
Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition 基于固定频率范围经验小波变换的语音情感识别声学和熵特征
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-11-14 DOI: 10.1016/j.specom.2024.103148
Siba Prasad Mishra, Pankaj Warule, Suman Deb
The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.
自动语音情感识别(SER)的主要目标是利用计算机和移动设备等机器,准确有效地识别语音信号中传达的每一种特定情感。三十年来,SER 在学术界的普及得到了广泛认可,这主要归功于它在实际场景中的广泛应用。事实证明,在医疗干预、加强安全策略、执行警戒功能、增强在线搜索引擎、加强道路安全、管理客户关系、加强机器与人类之间的联系等众多领域,使用 SER 都是有益的。许多研究人员使用了不同的方法,如整合不同的属性、使用不同的特征选择技术、设计使用多个分类器的混合或复杂模型等,以增强情绪分类的有效性。在我们的研究中,我们使用了一种名为固定频率范围经验小波变换(FFREWT)滤波器组分解法的新技术来提取特征,然后利用这些特征来准确识别语音信号中的每一种情绪。FFREWT 滤波器组方法将语音信号帧(SSF)分割成许多子信号或模式。我们使用基于 FFREWT 的每种分解模式来获取特征,如梅尔频率倒频谱系数 (MFCC)、近似熵 (ApEn)、置换熵 (PrEn) 和增量熵 (IrEn)。然后,我们使用所提出的基于 FFREWT 的特征集和深度神经网络 (DNN) 分类器的不同组合来对语音进行情感分类。在 EMO-DB、EMOVO 和 TESS 数据集上,我们提出的方法使用基于 FFREWT 的特征组合(MFCC + ApEn + PrEn + IrEn)分别帮助实现了 89.35%、84.69% 和 100%的情感分类准确率。我们将实验结果与其他方法进行了比较,发现所提出的基于 FFREWT 的特征组合与 DNN 分类器在 SER 中的表现优于最先进的方法。
{"title":"Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition","authors":"Siba Prasad Mishra,&nbsp;Pankaj Warule,&nbsp;Suman Deb","doi":"10.1016/j.specom.2024.103148","DOIUrl":"10.1016/j.specom.2024.103148","url":null,"abstract":"<div><div>The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103148"},"PeriodicalIF":2.4,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection AFP-Conformer:用于欺骗性语音检测的渐进特征金字塔构形器
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-11-10 DOI: 10.1016/j.specom.2024.103149
Yida Huang, Qian Shen, Jianfen Ma
The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.
现有的欺骗性语音检测方法大多以卷积神经网络或变换器架构为骨干,在特征提取过程中无法充分表现语音特征,导致模型的检测和泛化性能较差。为了解决这一局限,我们提出了一种基于 Conformer 架构的新型欺骗语音检测方法。该方法将卷积模块集成到 Conformer 框架中,以增强其局部特征建模能力,从而能够同时从语音信号中提取局部和全局信息。此外,为了缓解传统特征金字塔网络在特征融合过程中语义信息丢失或退化的问题,我们提出了一种基于渐近特征金字塔网络(AFPN)的特征融合方法,以融合多尺度特征,提高检测未知攻击的泛化能力。我们在 ASVspoof 2019 LA 数据集上进行的实验表明,我们提出的方法实现了 1.61 % 的等效错误率(EER)和 0.045 的最小串联检测成本函数(min t-DCF),有效提高了模型的检测性能,同时增强了模型对未知欺骗攻击的泛化能力。特别是,它在检测最具挑战性的 A17 攻击方面表现出了显著的性能提升。
{"title":"AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection","authors":"Yida Huang,&nbsp;Qian Shen,&nbsp;Jianfen Ma","doi":"10.1016/j.specom.2024.103149","DOIUrl":"10.1016/j.specom.2024.103149","url":null,"abstract":"<div><div>The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103149"},"PeriodicalIF":2.4,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A robust temporal map of speech monitoring from planning to articulation 从规划到发音的强大语音监测时间图谱
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-11-01 DOI: 10.1016/j.specom.2024.103146
Lydia Dorokhova , Benjamin Morillon , Cristina Baus , Pascal Belin , Anne-Sophie Dubarry , F.-Xavier Alario , Elin Runnqvist
Speakers continuously monitor their own speech to optimize fluent production, but the precise timing and underlying variables influencing speech monitoring remain insufficiently understood. Through two EEG experiments, this study aimed to provide a comprehensive temporal map of monitoring processes ranging from speech planning to articulation.
In both experiments, participants were primed to switch the consonant onsets of target word pairs read aloud, eliciting speech errors of either lexical or articulatory-phonetic (AP) origin. Experiment I used pairs of the same stimuli words, creating lexical or non-lexical errors when switching initial consonants, with the degree of shared AP features not fully balanced but considered in the analysis. Experiment II followed a similar methodology but used different words in pairs for the lexical and non-lexical conditions, fully orthogonalizing the number of shared AP features.
As error probability is higher in trials primed to result in lexical versus non-lexical errors and AP-close compared to AP-distant errors, more monitoring is required for these conditions. Similarly, error trials require more monitoring compared to correct trials. We used high versus low error probability on correct trials and errors versus correct trials as indices of monitoring.
Across both experiments, we observed that on correct trials, lexical error probability effects were present during initial stages of speech planning, while AP error probability effects emerged during speech motor preparation. In contrast, error trials showed differences from correct utterances in both early and late speech motor preparation and during articulation. These findings suggest that (a) response conflict on ultimately correct trials does not persist during articulation; (b) the timecourse of response conflict is restricted to the time window during which a given linguistic level is task-relevant (early on for response appropriateness-related variables and later for articulation-relevant variables); and (c) monitoring during the response is primarily triggered by pre-response monitoring failure. These results support that monitoring in language production is temporally distributed and rely on multiple mechanisms.
说话者会持续监测自己的语音,以优化流利的发音,但人们对影响语音监测的精确时间和潜在变量仍然了解不足。本研究旨在通过两项脑电图实验,为从语音规划到发音的监控过程提供全面的时间图谱。在这两项实验中,参与者被诱导切换朗读的目标词对的辅音首音,从而引发词汇或发音(AP)方面的语音错误。实验 I 使用了成对的相同刺激词,在切换首辅音时产生词汇或非词汇错误,共有 AP 特征的程度并不完全平衡,但在分析中也考虑到了这一点。实验二采用了类似的方法,但在词性和非词性条件下使用了不同的成对单词,共享 AP 特征的数量完全正交。由于在导致词性错误和非词性错误、AP-近距离错误和 AP-远距离错误的试验中,错误概率更高,因此这些条件下需要更多的监控。同样,错误试验比正确试验需要更多的监控。我们将正确试验中的高错误概率与低错误概率以及错误试验与正确试验作为监控指标。在这两项实验中,我们观察到,在正确试验中,词汇错误概率效应出现在言语计划的初始阶段,而 AP 错误概率效应则出现在言语运动准备阶段。与此相反,错误试验在言语运动准备的早期和晚期以及发音期间都显示出与正确语篇的差异。这些研究结果表明:(a) 最终正确试验中的反应冲突在衔接过程中不会持续;(b) 反应冲突的时间过程仅限于特定语言水平与任务相关的时间窗口(早期为反应适当性相关变量,后期为衔接相关变量);(c) 反应过程中的监控主要由反应前监控失败触发。这些结果证明,语言生产中的监测是时间分布式的,并依赖于多种机制。
{"title":"A robust temporal map of speech monitoring from planning to articulation","authors":"Lydia Dorokhova ,&nbsp;Benjamin Morillon ,&nbsp;Cristina Baus ,&nbsp;Pascal Belin ,&nbsp;Anne-Sophie Dubarry ,&nbsp;F.-Xavier Alario ,&nbsp;Elin Runnqvist","doi":"10.1016/j.specom.2024.103146","DOIUrl":"10.1016/j.specom.2024.103146","url":null,"abstract":"<div><div>Speakers continuously monitor their own speech to optimize fluent production, but the precise timing and underlying variables influencing speech monitoring remain insufficiently understood. Through two EEG experiments, this study aimed to provide a comprehensive temporal map of monitoring processes ranging from speech planning to articulation.</div><div>In both experiments, participants were primed to switch the consonant onsets of target word pairs read aloud, eliciting speech errors of either lexical or articulatory-phonetic (AP) origin. Experiment I used pairs of the same stimuli words, creating lexical or non-lexical errors when switching initial consonants, with the degree of shared AP features not fully balanced but considered in the analysis. Experiment II followed a similar methodology but used different words in pairs for the lexical and non-lexical conditions, fully orthogonalizing the number of shared AP features.</div><div>As error probability is higher in trials primed to result in lexical versus non-lexical errors and AP-close compared to AP-distant errors, more monitoring is required for these conditions. Similarly, error trials require more monitoring compared to correct trials. We used high versus low error probability on correct trials and errors versus correct trials as indices of monitoring.</div><div>Across both experiments, we observed that on correct trials, lexical error probability effects were present during initial stages of speech planning, while AP error probability effects emerged during speech motor preparation. In contrast, error trials showed differences from correct utterances in both early and late speech motor preparation and during articulation. These findings suggest that (a) response conflict on ultimately correct trials does not persist during articulation; (b) the timecourse of response conflict is restricted to the time window during which a given linguistic level is task-relevant (early on for response appropriateness-related variables and later for articulation-relevant variables); and (c) monitoring during the response is primarily triggered by pre-response monitoring failure. These results support that monitoring in language production is temporally distributed and rely on multiple mechanisms.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103146"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones 双语和音乐性对听者感知非母语词调的综合影响
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-11-01 DOI: 10.1016/j.specom.2024.103147
Liang Zhang , Jiaqiang Zhu , Jing Shao , Caicai Zhang
Non-native lexical tone perception can be affected by listeners’ musical or linguistic experience, but it remains unclear of whether there will be combined effects and how these impacts will be modulated by different types of non-native tones. This study adopted an orthogonal design with four participant groups, namely, Mandarin-L1 monolinguals and Mandarin-L1 and Cantonese-L2 bilinguals with or without musical training, to investigate effects of bilingualism and musicianship on perception of non-native lexical tones. The closely matched four groups, each encompassing an equal number of 20 participants, attended a modified ABX discrimination task of lexical tones of Teochew, which was unknown to all participants and consists of multiple tone types of level tones, contour tones, and checked tones. The tone perceptual sensitivity index of d’ values and response times were calculated and compared using linear mixed-effects models. Results on tone sensitivity and response time revealed that all groups were more sensitive to contour tones than level tones, indicating the effect of native phonology of Mandarin tones on non-native tone perception. Besides, as compared to monolinguals, bilinguals obtained a higher d’ value when discriminating non-native tones, and musically trained bilinguals responded faster than their non-musician peers. It indicates that bilinguals enjoy a perceptual advantage in non-native tone perception, with musicianship further enhancing this advantage. This extends prior studies by showing that an L2 with a more intricate tone inventory than L1 could facilitate listeners’ non-native tone perception. The pedagogical implications were discussed.
听者的音乐或语言经验可能会影响非母语词汇音调感知,但是否会产生综合影响以及不同类型的非母语音调会如何调节这些影响,目前仍不清楚。本研究采用正交设计,将普通话-L1 单语听者、普通话-L1 和粤语-L2 双语听者(无论是否接受过音乐训练)分为四组,研究双语能力和音乐素养对非母语词调感知的影响。四组人数相等的20名受试者参加了潮州话词性声调的改良ABX辨别任务,所有受试者都不知道潮州话词性声调包括平调、轮廓调和检查调等多种声调类型。采用线性混合效应模型计算并比较了音调知觉敏感度指数 d'值和反应时间。音调敏感度和反应时间的结果表明,各组对等高线音调的敏感度均高于平舌音,这表明普通话音调的母语语音对非母语音调感知有影响。此外,与单语者相比,双语者在辨别非母语声调时获得了更高的d'值,而且受过音乐训练的双语者比非音乐家的同龄人反应更快。这表明,双语者在非母语音调感知方面享有感知优势,而音乐训练则进一步增强了这种优势。这扩展了之前的研究,表明与第一语言相比,第二语言具有更复杂的音调库,可以促进听者对非母语音调的感知。研究还讨论了其教学意义。
{"title":"The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones","authors":"Liang Zhang ,&nbsp;Jiaqiang Zhu ,&nbsp;Jing Shao ,&nbsp;Caicai Zhang","doi":"10.1016/j.specom.2024.103147","DOIUrl":"10.1016/j.specom.2024.103147","url":null,"abstract":"<div><div>Non-native lexical tone perception can be affected by listeners’ musical or linguistic experience, but it remains unclear of whether there will be combined effects and how these impacts will be modulated by different types of non-native tones. This study adopted an orthogonal design with four participant groups, namely, Mandarin-L1 monolinguals and Mandarin-L1 and Cantonese-L2 bilinguals with or without musical training, to investigate effects of bilingualism and musicianship on perception of non-native lexical tones. The closely matched four groups, each encompassing an equal number of 20 participants, attended a modified ABX discrimination task of lexical tones of Teochew, which was unknown to all participants and consists of multiple tone types of level tones, contour tones, and checked tones. The tone perceptual sensitivity index of <em>d’</em> values and response times were calculated and compared using linear mixed-effects models. Results on tone sensitivity and response time revealed that all groups were more sensitive to contour tones than level tones, indicating the effect of native phonology of Mandarin tones on non-native tone perception. Besides, as compared to monolinguals, bilinguals obtained a higher <em>d’</em> value when discriminating non-native tones, and musically trained bilinguals responded faster than their non-musician peers. It indicates that bilinguals enjoy a perceptual advantage in non-native tone perception, with musicianship further enhancing this advantage. This extends prior studies by showing that an L2 with a more intricate tone inventory than L1 could facilitate listeners’ non-native tone perception. The pedagogical implications were discussed.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103147"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142656235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners 评估连续音高和语音节奏修改对熟悉和不熟悉听者感知说话者验证性能的影响
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-10-23 DOI: 10.1016/j.specom.2024.103145
Benjamin O’Brien , Christine Meunier , Alain Ghio
A study was conducted to evaluate the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar naive listeners. Speech recordings made by twelve male, native-French speakers were organised into three groups of four (two in-set, one out-of-set). Two groups of listeners participated, where one group was familiar with one in-set speaker group, while both groups were unfamiliar with the remaining in- and out-of-set speaker groups. Pitch and speech tempo were continuously modified, such that the first 75% of words spoken were modified with percentages of modification beginning at 100% and decaying linearly to 0%. Pitch modifications began at ± 600 cents, while speech tempo modifications started with word durations scaled 1:2 or 3:2. Participants evaluated a series of “go/no-go” task trials, where they were presented a modified speech recording with a face and tasked to respond as quickly as possible if they judged the stimuli to be continuous. The major findings revealed listeners overcame higher percentages of modification when presented familiar speaker stimuli. Familiar listeners outperformed unfamiliar listeners when evaluating continuously modified speech tempo stimuli, however, this effect was speaker-specific for pitch modified stimuli. Contrasting effects of modification direction were also observed. The findings suggest pitch is more useful to listeners when verifying familiar and unfamiliar voices.
本研究旨在评估连续音高和语音节奏修正对熟悉和不熟悉的天真听者感知说话者验证性能的影响。12 位以法语为母语的男性演讲者的演讲录音被分为三组,每组 4 人(两组在录音中,一组在录音外)。两组听者参与其中,其中一组熟悉一组入声说话者,而两组都不熟悉其余的入声和出声说话者。音调和语速被不断修改,因此前 75% 的词语被修改,修改百分比从 100% 开始,然后线性递减到 0%。音调修改从 ± 600 分开始,而语速修改则从单词持续时间按 1:2 或 3:2 的比例开始。受试者对一系列 "去/不去 "任务试验进行了评估,在这些试验中,受试者会看到经过修改的带有人脸的语音录音,如果他们判断刺激是连续的,就必须尽快做出反应。主要研究结果表明,当出现熟悉的说话者刺激时,听者能克服较高比例的修改。在评估连续修改的语音节奏刺激时,熟悉的听者的表现优于不熟悉的听者,但是,对于音调修改的刺激,这种效应是针对特定说话者的。此外,还观察到修改方向的对比效应。研究结果表明,在验证熟悉和不熟悉的声音时,音调对听者更有用。
{"title":"Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners","authors":"Benjamin O’Brien ,&nbsp;Christine Meunier ,&nbsp;Alain Ghio","doi":"10.1016/j.specom.2024.103145","DOIUrl":"10.1016/j.specom.2024.103145","url":null,"abstract":"<div><div>A study was conducted to evaluate the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar naive listeners. Speech recordings made by twelve male, native-French speakers were organised into three groups of four (two in-set, one out-of-set). Two groups of listeners participated, where one group was familiar with one in-set speaker group, while both groups were unfamiliar with the remaining in- and out-of-set speaker groups. Pitch and speech tempo were continuously modified, such that the first 75% of words spoken were modified with percentages of modification beginning at 100% and decaying linearly to 0%. Pitch modifications began at <span><math><mo>±</mo></math></span> 600 cents, while speech tempo modifications started with word durations scaled 1:2 or 3:2. Participants evaluated a series of “go/no-go” task trials, where they were presented a modified speech recording with a face and tasked to respond as quickly as possible if they judged the stimuli to be continuous. The major findings revealed listeners overcame higher percentages of modification when presented familiar speaker stimuli. Familiar listeners outperformed unfamiliar listeners when evaluating continuously modified speech tempo stimuli, however, this effect was speaker-specific for pitch modified stimuli. Contrasting effects of modification direction were also observed. The findings suggest pitch is more useful to listeners when verifying familiar and unfamiliar voices.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103145"},"PeriodicalIF":2.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142527431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments 用于噪声语音实验的语言均衡的丹麦语句子视听记录语料库
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-10-09 DOI: 10.1016/j.specom.2024.103141
Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May
A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.
在研发环境中,一个典型的噪声语音实验很容易包含多达 20 个条件,甚至更多,而且通常每个条件至少需要两个测试点。在丹麦语中,还没有一个句子测试包含足够多的句子,可以在不重复的情况下进行如此大量的测试。因此,我们开发了一个新的语料库,以帮助创建一个足够大的句子测试来满足这一需求。语料库本身由 1200 个语言平衡句子的音频和视听录音组成,所有句子均由两名女性和两名男性说话者说出。这些句子是用一种新颖的、基于模板的方法构建的,便于控制词频和句子结构。这些句子从音位分布、自然度和内涵等方面进行了语言学评估,然后进行录音、后处理,并根据其音频、视觉和发音质量进行评分。本文详细介绍了创建和描述该语料库的方法。
{"title":"A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments","authors":"Abigail Anne Kressner ,&nbsp;Kirsten Maria Jensen-Rico ,&nbsp;Johannes Kizach ,&nbsp;Brian Kai Loong Man ,&nbsp;Anja Kofoed Pedersen ,&nbsp;Lars Bramsløw ,&nbsp;Lise Bruun Hansen ,&nbsp;Laura Winther Balling ,&nbsp;Brent Kirkwood ,&nbsp;Tobias May","doi":"10.1016/j.specom.2024.103141","DOIUrl":"10.1016/j.specom.2024.103141","url":null,"abstract":"<div><div>A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103141"},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Forms, factors and functions of phonetic convergence: Editorial 语音趋同的形式、因素和功能社论
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-30 DOI: 10.1016/j.specom.2024.103142
Elisa Pellegrino , Volker Dellwo , Jennifer S. Pardo , Bernd Möbius
This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.
这篇为 "语音趋同的形式、因素和功能 "特刊撰写的介绍性文章全面概述了与语音趋同相关的主流理论范式、诱导方法和计算方法,并讨论了影响说话者之间声学调整的既定因素的作用。本论文集中的九篇论文对生产和感知中的调和以及对调和的感知背后的基本机制、因素和功能提出了新的见解。通过整合声学、发音学和知觉对趋同的评估,并结合传统的实验语音分析和计算建模,这九篇论文(1)强调了认知适应性和语音变异性作为趋同触发因素的作用;(2)揭示了趋同感知机制和说话者识别机制之间的基本相似性;(3)阐明了人类和动物发声中适应性之间的进化联系。
{"title":"Forms, factors and functions of phonetic convergence: Editorial","authors":"Elisa Pellegrino ,&nbsp;Volker Dellwo ,&nbsp;Jennifer S. Pardo ,&nbsp;Bernd Möbius","doi":"10.1016/j.specom.2024.103142","DOIUrl":"10.1016/j.specom.2024.103142","url":null,"abstract":"<div><div>This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103142"},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study 利用元音的声学特征估算清醒时上气道横截面积的可行性:试点研究
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-28 DOI: 10.1016/j.specom.2024.103144
Shumit Saha , Keerthana Viswanathan , Anamika Saha , Azadeh Yadollahi
Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R2 of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.
上气道尺寸评估在了解阻塞性睡眠呼吸暂停(OSA)的发病机理方面显示出巨大的前景。然而,目前的 OSA 筛查系统无法对上气道进行客观评估。对上气道的评估可以通过核磁共振成像或 CT 扫描来准确进行,但这些方法成本高昂且不易获得。声学咽喉测量法或超声波检查法可能是成本较低的技术,但这些技术需要训练有素的人员,因此不易获得,尤其是在诊所环境或手术前评估上气道时。在本研究中,我们旨在研究元音发音在评估正常呼吸时上气道尺寸方面的实用性。为此,我们通过声学咽喉测量法测量了上气道横截面积(UA-XSA),然后要求参与者在 3 秒钟内发出 5 个元音,并用麦克风记录下来。我们从所有元音中提取了 710 个声学特征,将这些特征与 UA-XSA 进行了比较,并建立了回归模型来估计 UA-XSA。结果表明,梅尔频率倒谱系数(MFCC)是元音最主要的特征,因为在主要特征集中,9 个特征中有 7 个来自 MFCC。多元回归分析表明,声学特征与人体测量特征的组合在估算 UA-XSA 时的 R2 值为 0.80。元音声学分析的重要优势在于其简单易行,可在可穿戴设备或移动应用中轻松实现。这种基于声学的技术可以在重症监护室等不同的临床环境中使用,也可以在偏远地区使用。因此,这些结果可用于开发用户友好型应用程序,利用声学特征和人口信息来估计 UA-XSA。
{"title":"Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study","authors":"Shumit Saha ,&nbsp;Keerthana Viswanathan ,&nbsp;Anamika Saha ,&nbsp;Azadeh Yadollahi","doi":"10.1016/j.specom.2024.103144","DOIUrl":"10.1016/j.specom.2024.103144","url":null,"abstract":"<div><div>Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R<sup>2</sup> of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103144"},"PeriodicalIF":2.4,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zero-shot voice conversion based on feature disentanglement 基于特征分解的零镜头语音转换
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-27 DOI: 10.1016/j.specom.2024.103143
Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao
Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.
语音转换(VC)的目的是在不修改语言内容的情况下,将源说话者的语音转换为目标说话者的语音。零镜头语音转换在语音转换任务中备受关注,因为它可以实现对训练阶段未出现的说话者的转换。尽管之前的零镜头语音转换方法取得了重大进展,但在分离说话人信息和内容信息方面仍有改进空间。在本文中,我们提出了一种基于特征分离的零镜头 VC 方法。所提模型使用扬声器编码器提取扬声器嵌入,引入混合扬声器层归一化以消除内容编码中的残余扬声器信息,并采用自适应注意力权重归一化进行转换。此外,还引入了动态卷积,以改进语音内容建模,同时只需少量参数。实验证明,拟议模型的性能优于几种最先进的模型,既能实现与目标说话人的高度相似,又能实现可懂度。此外,我们模型的解码速度也远高于现有的先进模型。
{"title":"Zero-shot voice conversion based on feature disentanglement","authors":"Na Guo ,&nbsp;Jianguo Wei ,&nbsp;Yongwei Li ,&nbsp;Wenhuan Lu ,&nbsp;Jianhua Tao","doi":"10.1016/j.specom.2024.103143","DOIUrl":"10.1016/j.specom.2024.103143","url":null,"abstract":"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103143"},"PeriodicalIF":2.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal co-learning for silent speech recognition based on ultrasound tongue images 基于超声舌头图像的无声语音识别多模态协同学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-12 DOI: 10.1016/j.specom.2024.103140
Minghao Guo , Jianguo Wei , Ruiteng Zhang , Yu Zhao , Qiang Fang

Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.

无声语音识别(SSR)是人机交互中的一项重要任务,旨在从非声学模式中识别语音。无声语音识别面临的一个主要挑战是,由于非声学信号中缺少部分语音信息,因此会产生固有的输入模糊性。这种模糊性会导致同音字--输入相似但发音不同的单词。目前解决这一问题的方法要么是利用更丰富的附加输入,要么是训练额外的模型进行跨模态嵌入补偿。在本文中,我们提出了一种有效的多模态协同学习框架,通过多阶段训练提高无声语音表征的分辨能力。我们首先以超声舌部成像(UTI)为主要模态构建了 SSR 的骨干,然后引入了两种辅助模态:唇部视频和音频信号。利用模态剔除,该模型可从所有可用流中学习共享/特定特征,从而创建一个相同的语义空间,以更好地概括UTI 表征。鉴于跨模态的不平衡优化,我们强调了超参数设置和调制策略对 SSR 实现特定模态协同学习的重要性。实验结果表明,具有单一UTI输入的模态无关模型优于最先进的特定模态模型。基于音素/发音特征的混淆分析证实,共同学习的UTI表征包含区分同音字的宝贵信息。此外,我们的模型在两个未见测试集上表现良好,实现了单模态 SSR 任务的跨模态泛化。
{"title":"Multi-modal co-learning for silent speech recognition based on ultrasound tongue images","authors":"Minghao Guo ,&nbsp;Jianguo Wei ,&nbsp;Ruiteng Zhang ,&nbsp;Yu Zhao ,&nbsp;Qiang Fang","doi":"10.1016/j.specom.2024.103140","DOIUrl":"10.1016/j.specom.2024.103140","url":null,"abstract":"<div><p>Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103140"},"PeriodicalIF":2.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1