首页 > 最新文献

Interspeech最新文献

英文 中文
Improving Spoken Language Understanding with Cross-Modal Contrastive Learning 运用跨模态对比学习提高口语理解能力
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-658
Jingjing Dong, Jiayi Fu, P. Zhou, Hao Li, Xiaorui Wang
Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.
口语理解(SLU)通常基于存在错误传播问题的流水线体系结构。为了缓解这个问题,提出了端到端(E2E)模型来直接将语音输入映射到期望的语义输出。同时,其他人试图通过采用多模态架构来利用声学信息之外的语言信息。在这项工作中,我们提出了一种新的多模态SLU方法,称为CMCL,该方法利用跨模态对比学习来学习更好的多模态表示。特别地,设计了一个双流多模态框架,并在语音和文本表示之间执行对比学习任务。此外,CMCL采用了多模式共享分类任务与对比学习任务相结合的方法来指导所学习的表示,以提高意图分类任务的性能。我们还研究了在预训练中使用跨模态对比学习的效果。CMCL在FSC和Smartlights数据集上的准确率分别达到99.69%和92.50%,优于最先进的比较方法。此外,当在FSC数据集的10%和1%上训练时,性能仅分别下降0.32%和2.8%,这表明它在少镜头场景下具有优势。
{"title":"Improving Spoken Language Understanding with Cross-Modal Contrastive Learning","authors":"Jingjing Dong, Jiayi Fu, P. Zhou, Hao Li, Xiaorui Wang","doi":"10.21437/interspeech.2022-658","DOIUrl":"https://doi.org/10.21437/interspeech.2022-658","url":null,"abstract":"Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2693-2697"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43733271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation 用于声学回声消除的扬声器和电话感知卷积变压器网络
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10077
Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li
Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pre-training model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Com-parative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.
最近的研究表明,基于深度学习(DL)的声学回声消除(AEC)方法在背景噪声和非线性失真场景中的有效性。然而,内容和说话者的变化降低了这种基于DL的AEC模型的性能。在这项研究中,我们提出了一个AEC模型,该模型以语音和说话人身份特征作为辅助输入,并提出了一种复杂的双路径卷积变换网络(DPCTNet)。给定输入信号,由作为自监督预训练模型的对比预测编码网络提取的语音和说话者身份特征,以及由短时间傅立叶变换生成的复频谱被视为DPCTNet的频谱模式输入。此外,DPCTNet应用了通过插入双路径转换器改进的编码器-解码器架构,以有效地对单个帧中提取的输入和连续帧之间的相关性进行建模。对比实验结果表明,通过明确考虑语音和说话人身份特征,可以提高AEC的性能。
{"title":"Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation","authors":"Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li","doi":"10.21437/interspeech.2022-10077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10077","url":null,"abstract":"Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pre-training model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Com-parative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2513-2517"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43855916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals 基于语音和颈部加速度计信号的语音质量分类卷积神经网络
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10513
Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku
Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).
在语音质量自动分类方面,以往的研究主要是研究以声学语音信号为输入的支持向量机(SVM)分类器。最近,一项语音质量分类研究以颈部表面加速度计(NSA)和语音信号为输入,并使用具有手工制作声门源特征的支持向量机进行。本研究使用卷积神经网络(cnn)作为分类器,对同时记录的NSA和语音信号进行了三种语音质量(呼吸、模态和按压)的分类。该研究有两个目标:(1)研究两种信号(NSA和speech)中哪一种在分类任务中更有用;(2)比较基于深度学习的CNN分类器与使用手工制作声门源特征的SVM分类器相比,具有谱图和mel-谱图特征的分类器是否能够提高分类精度。结果表明,NSA信号对语音质量的分类效果优于语音信号,CNN分类器的分类效果优于SVM分类器,且差值较大。以mel- spectrum作为CNN分类器的输入,其平均分类准确率最高(NSA为93.8%,speech为90.6%)。
{"title":"Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals","authors":"Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku","doi":"10.21437/interspeech.2022-10513","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10513","url":null,"abstract":"Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5253-5257"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43858994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition 用于语音情感识别的音频与文本相结合的循环多头注意力融合网络
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-888
C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse
{"title":"Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition","authors":"C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse","doi":"10.21437/interspeech.2022-888","DOIUrl":"https://doi.org/10.21437/interspeech.2022-888","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"744-748"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46927325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism 成年自闭症患者在会话中使用与转身和韵律不太同步的点头
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11388
K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue
Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers’ turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.
自闭症谱系障碍(ASD)是一种高度流行的神经发育障碍,其特征是缺乏沟通和社交。点头是一种视觉通道,用于共同构建对话,对顺利的社交互动至关重要。在本研究中,我们定量地分析了日语会话中点头与言语转折和韵律变化的关系。结果表明,在ASD参与者中,点头的频率较低,尤其是在说话者的转向转换前后,而在具有典型发展(TD)的个体中,点头在转向前后都很显著。使用16秒长时间滑动段的分析显示,TD组的点头频率和平均发声强度之间的同步性高于ASD组。使用这些提出的特征通过支持向量机(SVM)进行分类实现了高性能,准确率为91.1%,F测度为0.942。此外,研究结果表明,根据转弯结束和重点,点头是一种最佳的方式,可以为ASD患者的社交技能培训提供标准的参考或反馈。此外,研究结果所暗示的点头的自然时间也可以应用于开发人形机器人或计算机图形(CG)代理的互动反应。
{"title":"Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism","authors":"K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue","doi":"10.21437/interspeech.2022-11388","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11388","url":null,"abstract":"Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers’ turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1136-1140"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42124598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition 远程语音识别前端系统的弱监督神经全秩空间协方差分析
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11077
Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai
This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.
本文提出了一种弱监督多通道神经语音分离方法,用于实际会话语音混合的远程语音识别(DSR)。一种称为神经全秩空间协方差分析(FCA)的盲源分离(BSS)方法可以在没有任何监督的情况下通过使用深谱模型来精确分离多声道语音混合物。然而,神经FCA要求事先确定并知道声源的数量。这一要求使其在DSR前端系统中的应用复杂化,用于多扬声器对话,其中扬声器的数量动态变化。在本文中,我们提出了一种神经FCA的扩展,通过将目标说话者的时间语音活动作为辅助信息来处理动态变化的声源数量。我们使用多声道音频混合物及其语音活动的数据集,以弱监督的方式训练源分离网络。CHiME-6数据集的实验结果表明,我们的方法在单词错误率方面优于传统的基于BSS的系统,该数据集的任务是识别晚宴上的对话。
{"title":"Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition","authors":"Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai","doi":"10.21437/interspeech.2022-11077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11077","url":null,"abstract":"This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3824-3828"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41762710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification 母语语音致音干扰在二语元音加工:鼠标跟踪揭示认知冲突在识别
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-12
Yizhou Wang, R. Bundgaard-Nielsen, B. Baker, Olga Maxwell
Regularities of phoneme distribution in a listener’s native language (L1), i.e., L1 phonotactics, can at times induce interference in their perception of second language (L2) phonemes and phonemic strings. This paper presents a study examining phonological interference experienced by L1 Mandarin listeners in identifying the English /i/ vowel in three consonantal contexts /p, f, w/, which have different distributional patterns in Mandarin phonology: /pi/ is a licit sequence in Mandarin, */fi/ is illicit due to co-occurrence restrictions, and */wi/ is illicit due to Mandarin contextual allophony. L1 Mandarin listeners completed two versions of an identification experiment (keystroke and mouse-tracking), in which they identified vowels in different consonantal contexts. Analysis of error rates, response times, and hand motions in the tasks suggests that L1 co-occurrence restriction and contextual allophony induce different levels of phonological interference in L2 vowel perception compared to the licit control condition. In support of the dynamic theory of linguistic cognition, our results indicate that illicit phonotactic contexts can lead to more identification errors, longer decision processes, and spurious activation of a distractor category.
听者母语中音素分布的规律性,即母语音位策略,有时会干扰听者对第二语言音素和音位串的感知。本文研究了母语普通话听者在识别英语/i/元音/p、f、w/三个辅音上下文中所经历的语音干扰,这三个辅音上下文中/p、f、w/具有不同的语音分布模式:/pi/在普通话中是合法序列,*/fi/由于共现限制而是非法序列,而*/wi/由于汉语语境的辅音而是非法序列。L1普通话听者完成了两个版本的识别实验(键盘敲击和鼠标跟踪),他们在不同的辅音语境中识别元音。对任务错误率、反应时间和手部动作的分析表明,与正常对照条件相比,L1共现限制和语境异音对L2元音感知的语音干扰程度不同。为了支持语言认知的动态理论,我们的研究结果表明,非法的语音致读上下文会导致更多的识别错误,更长的决策过程,以及虚假的干扰类别激活。
{"title":"Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification","authors":"Yizhou Wang, R. Bundgaard-Nielsen, B. Baker, Olga Maxwell","doi":"10.21437/interspeech.2022-12","DOIUrl":"https://doi.org/10.21437/interspeech.2022-12","url":null,"abstract":"Regularities of phoneme distribution in a listener’s native language (L1), i.e., L1 phonotactics, can at times induce interference in their perception of second language (L2) phonemes and phonemic strings. This paper presents a study examining phonological interference experienced by L1 Mandarin listeners in identifying the English /i/ vowel in three consonantal contexts /p, f, w/, which have different distributional patterns in Mandarin phonology: /pi/ is a licit sequence in Mandarin, */fi/ is illicit due to co-occurrence restrictions, and */wi/ is illicit due to Mandarin contextual allophony. L1 Mandarin listeners completed two versions of an identification experiment (keystroke and mouse-tracking), in which they identified vowels in different consonantal contexts. Analysis of error rates, response times, and hand motions in the tasks suggests that L1 co-occurrence restriction and contextual allophony induce different levels of phonological interference in L2 vowel perception compared to the licit control condition. In support of the dynamic theory of linguistic cognition, our results indicate that illicit phonotactic contexts can lead to more identification errors, longer decision processes, and spurious activation of a distractor category.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5223-5227"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41796527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis 基于引文注释文本预测基于vqae的角色表演风格用于有声读物语音合成
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-638
Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, H. Saruwatari
We propose a speech-synthesis model for predicting appropriate voice styles on the basis of the character-annotated text for audiobook speech synthesis. An audiobook is more engaging when the narrator makes distinctive voices depending on the story characters. Our goal is to produce such distinctive voices in the speech-synthesis framework. However, such distinction has not been extensively investigated in audiobook speech synthesis. To enable the speech-synthesis model to achieve distinctive voices depending on characters with minimum extra anno-tation, we propose a speech synthesis model to predict character appropriate voices from quotation-annotated text. Our proposed model involves character-acting-style extraction based on a vector quantized variational autoencoder, and style prediction from quotation-annotated texts which enables us to automate audiobook creation with character-distinctive voices from quotation-annotated texts. To the best of our knowledge, this is the first attempt to model intra-speaker voice style depending on character acting for audiobook speech synthesis. We conducted subjective evaluations of our model, and the results indicate that the proposed model generated more distinctive character voices compared to models that do not use the explicit character-acting-style while maintaining the naturalness of synthetic speech.
我们提出了一种语音合成模型,用于在有声读物语音合成的字符注释文本的基础上预测合适的语音风格。当叙述者根据故事人物发出独特的声音时,有声读物会更有吸引力。我们的目标是在语音合成框架中产生这样独特的声音。然而,这种区别并没有在有声读物语音合成中得到广泛的研究。为了使语音合成模型能够以最小的额外注释实现依赖于字符的独特语音,我们提出了一种语音合成模型来从引用注释文本中预测适合字符的语音。我们提出的模型包括基于矢量量化变分自动编码器的角色表演风格提取,以及从引文注释文本中进行风格预测,这使我们能够使用引文注释文本的角色独特声音自动创建有声读物。据我们所知,这是第一次尝试根据有声读物语音合成中的角色表演来模拟说话者内部的声音风格。我们对我们的模型进行了主观评估,结果表明,与不使用明确的角色表演风格同时保持合成语音自然度的模型相比,所提出的模型生成了更具特色的角色声音。
{"title":"Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis","authors":"Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, H. Saruwatari","doi":"10.21437/interspeech.2022-638","DOIUrl":"https://doi.org/10.21437/interspeech.2022-638","url":null,"abstract":"We propose a speech-synthesis model for predicting appropriate voice styles on the basis of the character-annotated text for audiobook speech synthesis. An audiobook is more engaging when the narrator makes distinctive voices depending on the story characters. Our goal is to produce such distinctive voices in the speech-synthesis framework. However, such distinction has not been extensively investigated in audiobook speech synthesis. To enable the speech-synthesis model to achieve distinctive voices depending on characters with minimum extra anno-tation, we propose a speech synthesis model to predict character appropriate voices from quotation-annotated text. Our proposed model involves character-acting-style extraction based on a vector quantized variational autoencoder, and style prediction from quotation-annotated texts which enables us to automate audiobook creation with character-distinctive voices from quotation-annotated texts. To the best of our knowledge, this is the first attempt to model intra-speaker voice style depending on character acting for audiobook speech synthesis. We conducted subjective evaluations of our model, and the results indicate that the proposed model generated more distinctive character voices compared to models that do not use the explicit character-acting-style while maintaining the naturalness of synthetic speech.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4551-4555"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41835460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion Perception 人工耳蜗使用者的说话者特质增强:以说话者情绪知觉为例
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10951
Avamarie Brueggeman, J. Hansen
Despite significant progress in areas such as speech recognition, cochlear implant users still experience challenges related to identifying various speaker traits such as gender, age, emotion, accent, etc. In this study, we focus on emotion as one trait. We propose the use of emotion intensity conversion to perceptually enhance emotional speech with the goal of improving speech emotion recognition for cochlear implant users. To this end, we utilize a parallel speech dataset containing emotion and intensity labels to perform conversion from normal to high intensity emotional speech. A non-negative matrix factorization method is integrated to perform emotion intensity conversion via spectral mapping. We evaluate our emotional speech enhancement using a support vector machine model for emotion recognition. In addition, we perform an emotional speech recognition listener experiment with normal hearing listeners using vocoded audio. It is suggested that such enhancement will benefit speaker trait perception for cochlear implant users.
尽管在语音识别等领域取得了重大进展,但人工耳蜗使用者仍然面临着识别不同说话者特征(如性别、年龄、情感、口音等)的挑战。在这项研究中,我们把情感作为一种特征来关注。我们提出使用情绪强度转换来感知增强情绪语音,目的是提高人工耳蜗使用者的语音情绪识别。为此,我们利用一个包含情绪和强度标签的并行语音数据集来完成从正常到高强度情绪语音的转换。结合非负矩阵分解方法,通过谱映射实现情感强度转换。我们使用情感识别的支持向量机模型来评估我们的情感语音增强。此外,我们使用语音编码音频对正常听力的听众进行了情感语音识别听众实验。结果表明,这种增强有利于人工耳蜗使用者对说话人特征的感知。
{"title":"Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion Perception","authors":"Avamarie Brueggeman, J. Hansen","doi":"10.21437/interspeech.2022-10951","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10951","url":null,"abstract":"Despite significant progress in areas such as speech recognition, cochlear implant users still experience challenges related to identifying various speaker traits such as gender, age, emotion, accent, etc. In this study, we focus on emotion as one trait. We propose the use of emotion intensity conversion to perceptually enhance emotional speech with the goal of improving speech emotion recognition for cochlear implant users. To this end, we utilize a parallel speech dataset containing emotion and intensity labels to perform conversion from normal to high intensity emotional speech. A non-negative matrix factorization method is integrated to perform emotion intensity conversion via spectral mapping. We evaluate our emotional speech enhancement using a support vector machine model for emotion recognition. In addition, we perform an emotional speech recognition listener experiment with normal hearing listeners using vocoded audio. It is suggested that such enhancement will benefit speaker trait perception for cochlear implant users.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2268-2272"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46677917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantically Meaningful Metrics for Norwegian ASR Systems 挪威ASR系统的语义意义度量
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-817
J. Rugayan, T. Svendsen, G. Salvi
Evaluation metrics are important for quanitfying the perfor- mance of Automatic Speech Recognition (ASR) systems. How-ever, the widely used word error rate (WER) captures errors at the word-level only and weighs each error equally, which makes it insufficient to discern ASR system performance for down- stream tasks such as Natural Language Understanding (NLU) or information retrieval. We explore in this paper a more ro- bust and discriminative evaluation metric for Norwegian ASR systems through the use of semantic information modeled by a transformer-based language model. We propose Aligned Semantic Distance (ASD) which employs dynamic programming to quantify the similarity between the reference and hypothesis text. First, embedding vectors are generated using the Nor- BERT model. Afterwards, the minimum global distance of the optimal alignment between these vectors is obtained and nor- malized by the sequence length of the reference embedding vec-tor. In addition, we present results using Semantic Distance (SemDist), and compare them with ASD. Results show that for the same WER, ASD and SemDist values can vary significantly, thus, exemplifying that not all recognition errors can be consid-ered equally important. We investigate the resulting data, and present examples which demonstrate the nuances of both metrics in evaluating various transcription errors.
评价指标是评价自动语音识别系统性能的重要指标。然而,广泛使用的单词错误率(WER)仅捕获单词级别的错误,并对每个错误进行平均加权,这使得它不足以区分ASR系统在下游任务(如自然语言理解(NLU)或信息检索)中的性能。在本文中,我们通过使用基于转换器的语言模型建模的语义信息,为挪威ASR系统探索了一个更具活力和判别性的评估指标。我们提出了对齐语义距离(ASD),它采用动态规划来量化参考文本和假设文本之间的相似度。首先,利用Nor- BERT模型生成嵌入向量。然后,得到这些向量之间最优对齐的最小全局距离,并且不被参考嵌入向量的序列长度化。此外,我们提出了使用语义距离(SemDist)的结果,并将其与ASD进行比较。结果表明,对于相同的WER, ASD和SemDist值可能会有显着差异,因此,说明并非所有识别错误都可以被视为同等重要。我们调查了结果数据,并提出了一些例子,证明了在评估各种转录错误时这两个指标的细微差别。
{"title":"Semantically Meaningful Metrics for Norwegian ASR Systems","authors":"J. Rugayan, T. Svendsen, G. Salvi","doi":"10.21437/interspeech.2022-817","DOIUrl":"https://doi.org/10.21437/interspeech.2022-817","url":null,"abstract":"Evaluation metrics are important for quanitfying the perfor- mance of Automatic Speech Recognition (ASR) systems. How-ever, the widely used word error rate (WER) captures errors at the word-level only and weighs each error equally, which makes it insufficient to discern ASR system performance for down- stream tasks such as Natural Language Understanding (NLU) or information retrieval. We explore in this paper a more ro- bust and discriminative evaluation metric for Norwegian ASR systems through the use of semantic information modeled by a transformer-based language model. We propose Aligned Semantic Distance (ASD) which employs dynamic programming to quantify the similarity between the reference and hypothesis text. First, embedding vectors are generated using the Nor- BERT model. Afterwards, the minimum global distance of the optimal alignment between these vectors is obtained and nor- malized by the sequence length of the reference embedding vec-tor. In addition, we present results using Semantic Distance (SemDist), and compare them with ASD. Results show that for the same WER, ASD and SemDist values can vary significantly, thus, exemplifying that not all recognition errors can be consid-ered equally important. We investigate the resulting data, and present examples which demonstrate the nuances of both metrics in evaluating various transcription errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2283-2287"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46159419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1