首页 > 最新文献

Interspeech最新文献

英文 中文
Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese 普通话伦巴第格:一个类似伦巴第格的标准汉语语料库
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-854
Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai
{"title":"Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese","authors":"Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai","doi":"10.21437/interspeech.2022-854","DOIUrl":"https://doi.org/10.21437/interspeech.2022-854","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3078-3082"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45598645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Prosody of Cheering in Sport Events 体育赛事中欢呼的韵律
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10982
Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka
Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners.
励志演讲通常传达一种高度情绪化的信息,其目的是激发行动。本文的目的是研究一种特殊类型的欢呼,即在体育赛事中(这里是长跑),用那个人的名字来煽动对那个人的欢呼。31名以德语为母语的人参加了这项实验。他们被要求在一个以视频为代表的体育赛事中,通过说出他或她的名字(1-5个音节长)来为一位马拉松运动员加油。出于比较的原因,参与者还单独说出了相同的名字和携带句。我们的研究结果表明,说话者使用不同的策略来实现他们的激励交流目标:一些说话者通过把跑步者的名字分成几个音节来念,而另一些人则尽可能快地念出名字,更多地强调第一个音节。一些发言者采取了混合策略。与我们的预期相反,造成不同说话风格(欢呼与中性)差异的主要原因并不是强度,至少在我们使用的方法上是这样。相反,参与者在为马拉松运动员欢呼时使用了更高的基本频率和更长的持续时间。
{"title":"The Prosody of Cheering in Sport Events","authors":"Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka","doi":"10.21437/interspeech.2022-10982","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10982","url":null,"abstract":"Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5283-5287"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43143290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training 计算机辅助发音训练中孤立英语单词的声重音检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-197
Vera Bernhard, Sandra Schwab, J. Goldman
We propose a system for automatic lexical stress detection in isolated English words. It is designed to be part of the computer-assisted pronunciation training application MIAPARLE (“https://miaparle.unige.ch”) that specifically focuses on stress contrasts acquisition. Training lexical stress cannot be disregarded in language education as the accuracy in production highly affects the intelligibility and perceived fluency of an L2 speaker. The pipeline automatically segments audio input into syllables over which duration, intensity, pitch, and spectral information is calculated. Since the stress of a syllable is defined relative to its neighboring syllables, the values obtained over the syllables are complemented with differential values to the preceding and following syllables. The resulting feature vectors, retrieved from 1011 recordings of single words spoken by English natives, are used to train a Voting Classifier composed of four supervised classifiers, namely a Support Vector Machine, a Neural Net, a K Nearest Neighbor, and a Random Forest classifier. The approach determines syllables of a single word as stressed or unstressed with an F1 score of 94% and an accuracy of 96%.
我们提出了一个英语孤立词的自动词法重音检测系统。它被设计成计算机辅助发音训练应用程序MIAPARLE(“https://miaparle.unige.ch”)的一部分,特别侧重于重读对比习得。训练词汇重音在语言教育中不可忽视,因为词汇重音的准确性对二语说话者的可理解性和感知流畅性有很大影响。该管道自动将音频输入分割成音节,并计算其持续时间、强度、音高和频谱信息。由于一个音节的重音是相对于它的邻近音节来定义的,因此在音节上获得的值与前一个音节和后一个音节的值相辅相成。所得到的特征向量是从1011个英语母语者说的单个单词的录音中检索出来的,用于训练由四个监督分类器组成的投票分类器,即支持向量机、神经网络、K近邻和随机森林分类器。该方法确定单个单词的音节是重读还是非重读,F1得分为94%,准确率为96%。
{"title":"Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training","authors":"Vera Bernhard, Sandra Schwab, J. Goldman","doi":"10.21437/interspeech.2022-197","DOIUrl":"https://doi.org/10.21437/interspeech.2022-197","url":null,"abstract":"We propose a system for automatic lexical stress detection in isolated English words. It is designed to be part of the computer-assisted pronunciation training application MIAPARLE (“https://miaparle.unige.ch”) that specifically focuses on stress contrasts acquisition. Training lexical stress cannot be disregarded in language education as the accuracy in production highly affects the intelligibility and perceived fluency of an L2 speaker. The pipeline automatically segments audio input into syllables over which duration, intensity, pitch, and spectral information is calculated. Since the stress of a syllable is defined relative to its neighboring syllables, the values obtained over the syllables are complemented with differential values to the preceding and following syllables. The resulting feature vectors, retrieved from 1011 recordings of single words spoken by English natives, are used to train a Voting Classifier composed of four supervised classifiers, namely a Support Vector Machine, a Neural Net, a K Nearest Neighbor, and a Random Forest classifier. The approach determines syllables of a single word as stressed or unstressed with an F1 score of 94% and an accuracy of 96%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3143-3147"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49006649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network 基于多任务学习的子带自适应注意时间卷积神经网络非侵入性语音质量评价
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10315
Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang
In terms of subjective evaluations, speech quality has been gen-erally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overfitting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In ConferencingSpeech 2022 Challenge, our method achieves a mean Pearson’s correlation coefficient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.
在主观评价方面,语音质量通常由平均意见得分(MOS)来描述。近年来,非侵入式语音质量评估通过利用深度学习技术取得了积极进展。在本文中,我们提出了一种新的基于多任务学习的模型,称为子带自适应注意力时间卷积神经网络(SAA-TCN),以在MOS值区间检测器(VID)辅助任务的帮助下进行非侵入性语音质量评估。该模型不使用全频带幅度谱图,而是以子带幅度谱图为输入,以减少模型参数并防止过度拟合。为了有效地利用子带频率维度上的能量分布信息,采用子带自适应注意力(SAA)来增强TCN模型。实验结果表明,该方法在预测MOS值方面具有良好的性能。在ConferencingSpeech 2022挑战赛中,我们的方法获得了0.763的平均皮尔逊相关系数(PCC)分数,并比挑战赛基线方法高0.233。
{"title":"Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network","authors":"Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang","doi":"10.21437/interspeech.2022-10315","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10315","url":null,"abstract":"In terms of subjective evaluations, speech quality has been gen-erally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overfitting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In ConferencingSpeech 2022 Challenge, our method achieves a mean Pearson’s correlation coefficient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3298-3302"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49153770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On Breathing Pattern Information in Synthetic Speech 合成语音中的呼吸模式信息
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10271
Z. Mostaani, M. Magimai.-Doss
The respiratory system is an integral part of human speech production. As a consequence, there is a close relation between respiration and speech signal, and the produced speech signal carries breathing pattern related information. Speech can also be generated using speech synthesis systems. In this paper, we investigate whether synthetic speech carries breathing pattern related information in the same way as natural human speech. We address this research question in the framework of logical-access presentation attack detection using embeddings extracted from neural networks pre-trained for speech breathing pattern estimation. Our studies on ASVSpoof 2019 challenge data show that there is a clear distinction between the extracted breathing pattern embedding of natural human speech and syn-thesized speech, indicating that speech synthesis systems tend to not carry breathing pattern related information in the same way as human speech. Whilst, this is not the case with voice conversion of natural human speech.
呼吸系统是人类语言产生的一个组成部分。因此,呼吸与语音信号之间存在着密切的联系,产生的语音信号携带着与呼吸方式相关的信息。语音也可以使用语音合成系统生成。在本文中,我们研究了合成语音是否以与人类自然语音相同的方式携带呼吸模式相关信息。我们在逻辑访问表示攻击检测的框架中解决了这个研究问题,使用从语音呼吸模式估计预训练的神经网络中提取的嵌入。我们对ASVSpoof 2019挑战数据的研究表明,提取的人类自然语音和合成语音的呼吸模式嵌入之间存在明显的区别,这表明语音合成系统往往不像人类语音那样携带呼吸模式相关信息。然而,自然人类语言的语音转换并非如此。
{"title":"On Breathing Pattern Information in Synthetic Speech","authors":"Z. Mostaani, M. Magimai.-Doss","doi":"10.21437/interspeech.2022-10271","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10271","url":null,"abstract":"The respiratory system is an integral part of human speech production. As a consequence, there is a close relation between respiration and speech signal, and the produced speech signal carries breathing pattern related information. Speech can also be generated using speech synthesis systems. In this paper, we investigate whether synthetic speech carries breathing pattern related information in the same way as natural human speech. We address this research question in the framework of logical-access presentation attack detection using embeddings extracted from neural networks pre-trained for speech breathing pattern estimation. Our studies on ASVSpoof 2019 challenge data show that there is a clear distinction between the extracted breathing pattern embedding of natural human speech and syn-thesized speech, indicating that speech synthesis systems tend to not carry breathing pattern related information in the same way as human speech. Whilst, this is not the case with voice conversion of natural human speech.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2768-2772"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48554971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding 基于多粒度对齐的跨模态迁移学习用于端到端口语理解
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11378
Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He
End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.
通过跨模态(文本到音频)迁移学习,端到端口语理解(E2E-SLU)取得了令人印象深刻的进步。然而,目前的方法大多侧重于具有简单损失的粗粒度序列级文本到音频的知识转移,而忽略了两种模式之间的细粒度时间对齐。在这项工作中,我们提出了一种新的用于E2E-SLU的多粒度跨模态迁移学习框架。具体来说,我们设计了一个交叉注意力模块来将文本的标记与语音的框架特征对齐,鼓励模型在传递语义信息的过程中针对每个标记所涉及的显著声学特征。我们还利用对比学习来促进句子层面的跨模态表征学习。最后,我们探索了各种数据扩充方法,以缓解E2E-SLU训练中大量标记数据的不足。在英文和中文SLU数据集上进行了大量实验,以验证我们提出的方法的有效性。实验结果和详细分析证明了该模型的优越性和竞争力。
{"title":"Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding","authors":"Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He","doi":"10.21437/interspeech.2022-11378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11378","url":null,"abstract":"End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1131-1135"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44733411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Neural correlates of acoustic and semantic cues during speech segmentation in French 法语语音切分过程中声学和语义线索的神经关联
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10986
Maria del Mar Cordero, Ambre Denis-Noël, E. Spinelli, F. Meunier
Natural speech is highly complex and variable. Particularly, spoken language, in contrast to written language, has no clear word boundaries. Adult listeners can exploit different types of information to segment the continuous stream such as acoustic and semantic information. However, the weight of these cues, when co-occurring, remains to be determined. Behavioural tasks are not conclusive on this point as they focus participants ’ attention on certain sources of information, thus biasing the results. Here, we looked at the processing of homophonic utterances such as l’amie vs la mie (both /lami/) which include fine acoustic differences and for which the meaning changes depending on segmentation. To examine the perceptual resolution of such ambiguities when semantic information is available, we measured the online processing of sentences containing such sequences in an ERP experiment involving no active task. In a congruent context, semantic information matched the acoustic signal of the word amie, while, in the incongruent condition, the semantic information carried by the sentence and the acoustic signal were leading to different lexical candidates. No clear neural markers for the use of acoustic cues were found. Our results suggest a preponderant weight of semantic information over acoustic information during natural spoken sentence processing.
自然语言是高度复杂和多变的。特别是口语,与书面语相比,没有明确的单词界限。成年听众可以利用不同类型的信息来分割连续的信息流,如声学信息和语义信息。然而,当这些线索同时出现时,其权重仍有待确定。行为任务在这一点上并不是决定性的,因为它们将参与者的注意力集中在某些信息来源上,从而使结果产生偏差。在这里,我们研究了同音话语的处理,比如l 'amie和la mie(都是/lami/),它们包括细微的声学差异,并且它们的意义根据分割而变化。为了检验当语义信息可用时,这些歧义的感知分辨率,我们在一个不涉及主动任务的ERP实验中测量了包含这些序列的句子的在线处理。在完全一致的语境下,语义信息与声信号相匹配,而在完全不一致的语境下,句子所携带的语义信息和声信号导致了不同的词汇候选者。没有发现使用声音线索的明确的神经标记。我们的研究结果表明,在自然口语句子处理过程中,语义信息的权重高于声学信息。
{"title":"Neural correlates of acoustic and semantic cues during speech segmentation in French","authors":"Maria del Mar Cordero, Ambre Denis-Noël, E. Spinelli, F. Meunier","doi":"10.21437/interspeech.2022-10986","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10986","url":null,"abstract":"Natural speech is highly complex and variable. Particularly, spoken language, in contrast to written language, has no clear word boundaries. Adult listeners can exploit different types of information to segment the continuous stream such as acoustic and semantic information. However, the weight of these cues, when co-occurring, remains to be determined. Behavioural tasks are not conclusive on this point as they focus participants ’ attention on certain sources of information, thus biasing the results. Here, we looked at the processing of homophonic utterances such as l’amie vs la mie (both /lami/) which include fine acoustic differences and for which the meaning changes depending on segmentation. To examine the perceptual resolution of such ambiguities when semantic information is available, we measured the online processing of sentences containing such sequences in an ERP experiment involving no active task. In a congruent context, semantic information matched the acoustic signal of the word amie, while, in the incongruent condition, the semantic information carried by the sentence and the acoustic signal were leading to different lexical candidates. No clear neural markers for the use of acoustic cues were found. Our results suggest a preponderant weight of semantic information over acoustic information during natural spoken sentence processing.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4058-4062"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41513074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complex sounds and cross-language influence: The case of ejectives in Omani Mehri 复杂的声音和跨语言的影响:以阿曼语Mehri中的弹射为例
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10199
Rachid Ridouane, Philipp Buech
Ejective consonants are known to considerably vary both cross-linguistically and within individual languages. This variability is often considered a consequence of the complex articulatory strategies involved in their production. Because they are complex, they might be particularly prone to sound change, especially under cross-language influence. In this study, we consider the production of ejectives in Mehri, a Semitic endangered language spoken in Oman where considerable influence from Arabic is expected. We provide acoustic data from seven speakers producing a list of items contrasting ejective and pulmonic alveolar and velar stops in word-initial (/#—/), word-medial (V—V), and word-final (V—#) positions. Different durational and non-durational correlates were examined. The relative importance of these correlates was quantified by the calculation of D-prime values for each. The key empirical finding is that the parameters used to signal ejectivity differ depending mainly on whether the stop is alveolar or velar. Specifically, ejective alveolar stops display characteristics of pharyngealization, similar to Arabic, but velars still maintain attributes of ejectivity in some word positions. We interpret these results as diagnostic of the sound change that is currently in progress, coupled with an ongoing context-dependent neutralization.
众所周知,推出辅音在跨语言和个别语言中都有很大的差异。这种变异性通常被认为是它们产生过程中复杂发音策略的结果。因为它们很复杂,它们可能特别容易发音变化,尤其是在跨语言影响下。在这项研究中,我们考虑了Mehri语中驱逐语的产生,这是一种在阿曼使用的闪米特语,预计阿拉伯语会对其产生相当大的影响。我们提供了来自七个扬声器的声学数据,这些扬声器产生了一个项目列表,这些项目在单词首字母(/#-/)、单词中间字母(V-V)和单词尾字母(V-#)的位置上对比了喷射性和肺性肺泡和velar停止。研究了不同的持续时间和非持续时间相关性。这些相关性的相对重要性是通过计算每个相关性的D质值来量化的。关键的经验发现是,用于发出弹射信号的参数主要取决于停止是肺泡还是绒毛。具体来说,顶出性牙槽停止语显示出类似于阿拉伯语的咽化特征,但在某些单词位置上,维拉尔语仍然保持着顶出性的属性。我们将这些结果解释为对目前正在进行的声音变化的诊断,以及正在进行的上下文相关的中和。
{"title":"Complex sounds and cross-language influence: The case of ejectives in Omani Mehri","authors":"Rachid Ridouane, Philipp Buech","doi":"10.21437/interspeech.2022-10199","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10199","url":null,"abstract":"Ejective consonants are known to considerably vary both cross-linguistically and within individual languages. This variability is often considered a consequence of the complex articulatory strategies involved in their production. Because they are complex, they might be particularly prone to sound change, especially under cross-language influence. In this study, we consider the production of ejectives in Mehri, a Semitic endangered language spoken in Oman where considerable influence from Arabic is expected. We provide acoustic data from seven speakers producing a list of items contrasting ejective and pulmonic alveolar and velar stops in word-initial (/#—/), word-medial (V—V), and word-final (V—#) positions. Different durational and non-durational correlates were examined. The relative importance of these correlates was quantified by the calculation of D-prime values for each. The key empirical finding is that the parameters used to signal ejectivity differ depending mainly on whether the stop is alveolar or velar. Specifically, ejective alveolar stops display characteristics of pharyngealization, similar to Arabic, but velars still maintain attributes of ejectivity in some word positions. We interpret these results as diagnostic of the sound change that is currently in progress, coupled with an ongoing context-dependent neutralization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3433-3437"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41455931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues 在日语中,/a/+/ta/的音节序列可以听为/atta/,带有视觉或触觉提示
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10099
T. Arai, Miho Yamada, Megumi Okusawa
In our previous work, we reported that the word /atta/ with a geminate consonant differs from the syllable sequence /a/+pause+/ta/ in Japanese; specifically, there are formant transitions at the end of the first syllable in /atta/ but not in /a/+pause+/ta/. We also showed that native Japanese speakers perceived /atta/ when a facial video of /atta/ was synchronously played with an audio signal of /a/+pause+/ta/. In that study, we utilized two video clips for the two utterances in which the speaker was asked to control only the timing of the articulatory closing. In that case, there was no guarantee that the videos would be the exactly same except for the timing. Therefore, in the current study, we use a physical model of the human vocal tract with a miniature robot hand unit to achieve articulatory movements for visual cues. We also provide tactile cues to the listener’s finger because we want to test whether cues of another modality affect this perception in the same framework. Our findings showed that when either visual or tactile cues were presented with an audio stimulus, listeners more frequently responded that they heard /atta/ compared to audio-only presentations.
在我们之前的工作中,我们报道了带有双元音辅音的单词/atta/与日语中的音节序列/a/+pause+/ta/不同;具体来说,/atta/中第一个音节的末尾有形成音过渡,而/a/+pause+/ta/中没有。我们还发现,当/atta/的面部视频与/a/+pause+/ta/的音频信号同步播放时,以日语为母语的人感知到/atta/。在那项研究中,我们使用了两个视频片段来描述两种话语,在这两种话语中,说话者被要求只控制发音关闭的时间。在这种情况下,除了时间外,无法保证视频会完全相同。因此,在目前的研究中,我们使用一个人类声道的物理模型和一个微型机械手单元来实现视觉提示的发音运动。我们还向听者的手指提供触觉线索,因为我们想测试另一种形态的线索是否会在相同的框架下影响这种感知。我们的研究结果表明,当视觉或触觉提示同时呈现音频刺激时,与纯音频演示相比,听众更频繁地回应他们听到了/atta/。
{"title":"Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues","authors":"T. Arai, Miho Yamada, Megumi Okusawa","doi":"10.21437/interspeech.2022-10099","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10099","url":null,"abstract":"In our previous work, we reported that the word /atta/ with a geminate consonant differs from the syllable sequence /a/+pause+/ta/ in Japanese; specifically, there are formant transitions at the end of the first syllable in /atta/ but not in /a/+pause+/ta/. We also showed that native Japanese speakers perceived /atta/ when a facial video of /atta/ was synchronously played with an audio signal of /a/+pause+/ta/. In that study, we utilized two video clips for the two utterances in which the speaker was asked to control only the timing of the articulatory closing. In that case, there was no guarantee that the videos would be the exactly same except for the timing. Therefore, in the current study, we use a physical model of the human vocal tract with a miniature robot hand unit to achieve articulatory movements for visual cues. We also provide tactile cues to the listener’s finger because we want to test whether cues of another modality affect this perception in the same framework. Our findings showed that when either visual or tactile cues were presented with an audio stimulus, listeners more frequently responded that they heard /atta/ compared to audio-only presentations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"302 ","pages":"3083-3087"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41331666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation 基于潜在语音表征的跨语言迁移学习音素错误检测方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10228
Jovan M. Dalhouse, K. Itou
Extensive research has been conducted on CALL systems for Pronunciation Error detection to automate language improvement through self-evaluation. However, many of these previous approaches have relied on HMM or Neural Network Hybrid Models which, although have proven to be effective, often utilize phonetically labelled L2 speech data which is ex-pensive and often scarce. This paper discusses a ”zero-shot” transfer learning approach to detect phonetic errors in L2 English speech by Japanese native speakers using solely unaligned phonetically labelled native language speech. The proposed method introduces a simple base architecture which utilizes the XLSR-Wav2Vec2.0 model pre-trained on unlabelled multilingual speech. Phoneme mapping for each language is determined based on difference of articulation of similar phonemes. This method achieved a Phonetic Error Rate of 0.214 on erroneous L2 speech after fine-tuning on 70 hours of speech with low resource automated phonetic labelling, and proved to ad-ditionally model phonemes of the native language of the L2 speaker effectively without the need for L2 speech fine-tuning.
已经对用于发音错误检测的CALL系统进行了广泛的研究,以通过自我评估自动化语言改进。然而,这些先前的方法中的许多都依赖于HMM或神经网络混合模型,尽管已被证明是有效的,但它们通常利用语音标记的L2语音数据,这是一种额外的且通常稀缺的数据。本文讨论了一种“零样本”迁移学习方法,以检测日本母语使用者仅使用未对齐的语音标记母语语音的二级英语语音中的语音错误。所提出的方法引入了一个简单的基础架构,该架构利用了在未标记的多语言语音上预训练的XLSR-Wav2Vec2.0模型。每种语言的音位映射是基于相似音位的发音差异来确定的。该方法在用低资源的自动语音标记对70小时的语音进行微调后,对错误的L2语音实现了0.214的语音错误率,并且证明了在不需要L2语音微调的情况下有效地对L2说话者的母语的音素进行条件建模。
{"title":"Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation","authors":"Jovan M. Dalhouse, K. Itou","doi":"10.21437/interspeech.2022-10228","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10228","url":null,"abstract":"Extensive research has been conducted on CALL systems for Pronunciation Error detection to automate language improvement through self-evaluation. However, many of these previous approaches have relied on HMM or Neural Network Hybrid Models which, although have proven to be effective, often utilize phonetically labelled L2 speech data which is ex-pensive and often scarce. This paper discusses a ”zero-shot” transfer learning approach to detect phonetic errors in L2 English speech by Japanese native speakers using solely unaligned phonetically labelled native language speech. The proposed method introduces a simple base architecture which utilizes the XLSR-Wav2Vec2.0 model pre-trained on unlabelled multilingual speech. Phoneme mapping for each language is determined based on difference of articulation of similar phonemes. This method achieved a Phonetic Error Rate of 0.214 on erroneous L2 speech after fine-tuning on 70 hours of speech with low resource automated phonetic labelling, and proved to ad-ditionally model phonemes of the native language of the L2 speaker effectively without the need for L2 speech fine-tuning.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3133-3137"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41397632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1