首页 > 最新文献

Interspeech最新文献

英文 中文
Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index 利用Spike Acactivity互信息指数预测语音可懂性
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10488
F. Cardinale, W. Nogueira
The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclu-sion that SAMII is a reliable objective metric when “low-level” representations of the speech, such as spike activity, are used.
尖峰活动互信息指数(SAMII)是预测语音可懂度的一种新的侵入性客观指标。通过外周听觉系统的最先进的计算模型来处理目标语音信号和噪声中的语音信号。它模拟了一群听觉神经纤维(ANF)的神经活动,这些神经纤维被分为覆盖语音频率范围的关键频带。使用20ms的分析窗口来计算两个信号的神经活动之间的相互信息。然后,沿着这些分析窗口对相互信息进行平均以获得SAMII。SAMII还通过计算左耳、右耳和双耳的指数,选择预测可懂度的最佳情况,扩展到双耳场景。SAMII是基于第一个清晰度预测挑战训练数据集开发的,并与作为基线的改良双耳短时目标可懂度(MBSTOI)进行了比较。使用清晰度挑战测试数据集,以测量数据和预测数据之间的均方根误差(RMSE)报告分数。SAMII的得分为35.16%,略高于MBSTOI的得分36.52%。这项工作得出的结论是,当使用语音的“低级”表示(如尖峰活动)时,SAMII是一个可靠的客观指标。
{"title":"Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index","authors":"F. Cardinale, W. Nogueira","doi":"10.21437/interspeech.2022-10488","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10488","url":null,"abstract":"The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclu-sion that SAMII is a reliable objective metric when “low-level” representations of the speech, such as spike activity, are used.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3503-3507"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45143239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices 基于麦克风阵列和异步可穿戴设备的协同语音分离
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11025
R. Corey, Manan Mittal, Kanad Sarkar, A. Singer
We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.
我们考虑了在背景噪声中使用固定麦克风阵列和一组可穿戴设备从多个说话者中分离语音的问题。可穿戴设备可以提供有关其佩戴者的语音的可靠信息,但由于网络延迟、采样率偏移和相对运动,它们通常不能直接用于多通道源分离。相反,使用可穿戴麦克风信号来计算每个说话者在每个时频指数下的语音存在概率。这些参数对小采样率偏移和相对运动具有鲁棒性,用于跟踪语音源和背景噪声的二阶统计量。固定阵列然后使用自适应线性时变多通道维纳滤波器分离语音信号。该方法通过使用双耳耳塞麦克风和8个麦克风桌面阵列的三个人的真实房间录音进行了演示。但是对于区分不同的来源是有用的,因为它们相对于说话者的已知位置。该系统使用可穿戴设备来估计SPP值,然后使用SPP值来学习固定阵列麦克风处每个源的二阶统计量。该阵列使用适合于实时应用的自适应线性时变空间滤波器分离源。本工作结合[19]的协作架构、[18]的分布式SPP方法和[15]的运动鲁棒建模。该系统是自适应实现的,并使用真人说话者进行演示。
{"title":"Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices","authors":"R. Corey, Manan Mittal, Kanad Sarkar, A. Singer","doi":"10.21437/interspeech.2022-11025","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11025","url":null,"abstract":"We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5398-5402"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45171254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Automatic Soundtracking System for Text-to-Speech Audiobooks 一种用于文本到语音有声读物的自动声音跟踪系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10236
Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin
Background music (BGM) plays an essential role in audiobooks, which can enhance the immersive experience of audiences and help them better understand the story. However, welldesigned BGM still requires human effort in the text-to-speech (TTS) audiobook production, which is quite time-consuming and costly. In this paper, we introduce an automatic soundtracking system for TTS-based audiobooks. The proposed system divides the soundtracking process into three tasks: plot partition, plot classification, and music selection. The experiments shows that both our plot partition module and plot classification module outperform baselines by a large margin. Furthermore, TTS-based audiobooks produced with our proposed automatic soundtracking system achieves comparable performance to that produced with the human soundtracking system. To our best of knowledge, this is the first work of automatic soundtracking system for audiobooks. Demos are available on https: //acst1223.github.io/interspeech2022/main.
背景音乐在有声读物中发挥着至关重要的作用,它可以增强观众的沉浸式体验,帮助他们更好地理解故事。然而,设计良好的BGM仍然需要在文本到语音(TTS)有声读物的制作中付出人力,这是相当耗时和昂贵的。本文介绍了一种基于TTS的有声读物自动跟踪系统。该系统将声音跟踪过程分为三个任务:情节划分、情节分类和音乐选择。实验表明,我们的小区划分模块和小区分类模块都大大优于基线。此外,使用我们提出的自动声音跟踪系统制作的基于TTS的有声读物实现了与使用人类声音跟踪系统生产的有声书相当的性能。据我们所知,这是有声读物自动声音跟踪系统的第一部作品。演示可在https://acst1223.github.io/interseech2022/main上获得。
{"title":"An Automatic Soundtracking System for Text-to-Speech Audiobooks","authors":"Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin","doi":"10.21437/interspeech.2022-10236","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10236","url":null,"abstract":"Background music (BGM) plays an essential role in audiobooks, which can enhance the immersive experience of audiences and help them better understand the story. However, welldesigned BGM still requires human effort in the text-to-speech (TTS) audiobook production, which is quite time-consuming and costly. In this paper, we introduce an automatic soundtracking system for TTS-based audiobooks. The proposed system divides the soundtracking process into three tasks: plot partition, plot classification, and music selection. The experiments shows that both our plot partition module and plot classification module outperform baselines by a large margin. Furthermore, TTS-based audiobooks produced with our proposed automatic soundtracking system achieves comparable performance to that produced with the human soundtracking system. To our best of knowledge, this is the first work of automatic soundtracking system for audiobooks. Demos are available on https: //acst1223.github.io/interspeech2022/main.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"476-480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45188982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model 基于神经编码器-解码器模型的基于潜在话语证据的缺失数据ASR经验抽样
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-576
Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani
Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.
缺失数据自动语音识别(MD-ASR)可以利用语音增强(SE)结果的不确定性,而无需重新训练模型参数。这种不确定性由概率证据模型表示,其设计和期望计算非常重要。在将MD方法应用于基于神经编码器-解码器模型的话语ASR时,出现了两个问题:话语证据模型的高维性和使用蒙特卡罗方法近似期望时生成的样本帧之间的不连续性。我们提出了新的话语证据模型,使用了一个潜在变量和一种从中取样的经验方法。我们的潜在模型的空间受到给定潜在变量的更简单的条件概率密度函数(pdfs)的限制,这使我们能够以确定性或随机的方式从低维空间生成样本。因为该变量在简单的pdf中也是一个常见的平滑参数,所以生成的样本在帧之间是连续的,这与逐帧模型不同,提高了ASR性能。神经SE的不确定性也被用作我们的混合pdf模型中的一个组成部分。实验表明,使用变换器模型的MD-ASR,增强语音的字符错误率平均提高了2.5个点。
{"title":"Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model","authors":"Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani","doi":"10.21437/interspeech.2022-576","DOIUrl":"https://doi.org/10.21437/interspeech.2022-576","url":null,"abstract":"Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3789-3793"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45261354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech 等时是美丽的吗?语音中的等时性有助于神经振荡模型中的音节事件检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10426
Mamady Nabe, J. Diard, J. Schwartz
Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this pa-per, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.
基于振荡的语音感知神经计算模型基于人脑振荡跟踪语音信号的能力。因此,人们期望这种跟踪对于更规则的信号更有效。在这篇论文中,我们通过语音感知的神经计算模型来解决等时性对事件检测的贡献问题。我们考虑了文献中提出的一个简单的事件检测模型,该模型基于声包络驱动的振荡过程,先前已被证明可以有效地检测各种语言中的音节事件。我们首先评估了它在法语音节事件检测中的表现,并表明与元音起始点相关的“感知中心”比音节起始点检测得更稳健。然后,我们证明了自然语音中的等时性提高了振荡模型中事件检测的性能。我们还评估了该模型对声学噪声的鲁棒性。总体而言,这些结果表明了自下而上的共振机制对事件检测的重要性;然而,他们认为自下而上的声包络处理无法完美地检测与语音时间分割相关的事件,这突出了自上而下的预测知识的潜在和互补作用。
{"title":"Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech","authors":"Mamady Nabe, J. Diard, J. Schwartz","doi":"10.21437/interspeech.2022-10426","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10426","url":null,"abstract":"Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this pa-per, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4671-4675"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45456275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese 普通话伦巴第格:一个类似伦巴第格的标准汉语语料库
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-854
Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai
{"title":"Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese","authors":"Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai","doi":"10.21437/interspeech.2022-854","DOIUrl":"https://doi.org/10.21437/interspeech.2022-854","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3078-3082"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45598645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Prosody of Cheering in Sport Events 体育赛事中欢呼的韵律
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10982
Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka
Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners.
励志演讲通常传达一种高度情绪化的信息,其目的是激发行动。本文的目的是研究一种特殊类型的欢呼,即在体育赛事中(这里是长跑),用那个人的名字来煽动对那个人的欢呼。31名以德语为母语的人参加了这项实验。他们被要求在一个以视频为代表的体育赛事中,通过说出他或她的名字(1-5个音节长)来为一位马拉松运动员加油。出于比较的原因,参与者还单独说出了相同的名字和携带句。我们的研究结果表明,说话者使用不同的策略来实现他们的激励交流目标:一些说话者通过把跑步者的名字分成几个音节来念,而另一些人则尽可能快地念出名字,更多地强调第一个音节。一些发言者采取了混合策略。与我们的预期相反,造成不同说话风格(欢呼与中性)差异的主要原因并不是强度,至少在我们使用的方法上是这样。相反,参与者在为马拉松运动员欢呼时使用了更高的基本频率和更长的持续时间。
{"title":"The Prosody of Cheering in Sport Events","authors":"Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka","doi":"10.21437/interspeech.2022-10982","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10982","url":null,"abstract":"Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5283-5287"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43143290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training 计算机辅助发音训练中孤立英语单词的声重音检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-197
Vera Bernhard, Sandra Schwab, J. Goldman
We propose a system for automatic lexical stress detection in isolated English words. It is designed to be part of the computer-assisted pronunciation training application MIAPARLE (“https://miaparle.unige.ch”) that specifically focuses on stress contrasts acquisition. Training lexical stress cannot be disregarded in language education as the accuracy in production highly affects the intelligibility and perceived fluency of an L2 speaker. The pipeline automatically segments audio input into syllables over which duration, intensity, pitch, and spectral information is calculated. Since the stress of a syllable is defined relative to its neighboring syllables, the values obtained over the syllables are complemented with differential values to the preceding and following syllables. The resulting feature vectors, retrieved from 1011 recordings of single words spoken by English natives, are used to train a Voting Classifier composed of four supervised classifiers, namely a Support Vector Machine, a Neural Net, a K Nearest Neighbor, and a Random Forest classifier. The approach determines syllables of a single word as stressed or unstressed with an F1 score of 94% and an accuracy of 96%.
我们提出了一个英语孤立词的自动词法重音检测系统。它被设计成计算机辅助发音训练应用程序MIAPARLE(“https://miaparle.unige.ch”)的一部分,特别侧重于重读对比习得。训练词汇重音在语言教育中不可忽视,因为词汇重音的准确性对二语说话者的可理解性和感知流畅性有很大影响。该管道自动将音频输入分割成音节,并计算其持续时间、强度、音高和频谱信息。由于一个音节的重音是相对于它的邻近音节来定义的,因此在音节上获得的值与前一个音节和后一个音节的值相辅相成。所得到的特征向量是从1011个英语母语者说的单个单词的录音中检索出来的,用于训练由四个监督分类器组成的投票分类器,即支持向量机、神经网络、K近邻和随机森林分类器。该方法确定单个单词的音节是重读还是非重读,F1得分为94%,准确率为96%。
{"title":"Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training","authors":"Vera Bernhard, Sandra Schwab, J. Goldman","doi":"10.21437/interspeech.2022-197","DOIUrl":"https://doi.org/10.21437/interspeech.2022-197","url":null,"abstract":"We propose a system for automatic lexical stress detection in isolated English words. It is designed to be part of the computer-assisted pronunciation training application MIAPARLE (“https://miaparle.unige.ch”) that specifically focuses on stress contrasts acquisition. Training lexical stress cannot be disregarded in language education as the accuracy in production highly affects the intelligibility and perceived fluency of an L2 speaker. The pipeline automatically segments audio input into syllables over which duration, intensity, pitch, and spectral information is calculated. Since the stress of a syllable is defined relative to its neighboring syllables, the values obtained over the syllables are complemented with differential values to the preceding and following syllables. The resulting feature vectors, retrieved from 1011 recordings of single words spoken by English natives, are used to train a Voting Classifier composed of four supervised classifiers, namely a Support Vector Machine, a Neural Net, a K Nearest Neighbor, and a Random Forest classifier. The approach determines syllables of a single word as stressed or unstressed with an F1 score of 94% and an accuracy of 96%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3143-3147"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49006649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network 基于多任务学习的子带自适应注意时间卷积神经网络非侵入性语音质量评价
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10315
Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang
In terms of subjective evaluations, speech quality has been gen-erally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overfitting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In ConferencingSpeech 2022 Challenge, our method achieves a mean Pearson’s correlation coefficient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.
在主观评价方面,语音质量通常由平均意见得分(MOS)来描述。近年来,非侵入式语音质量评估通过利用深度学习技术取得了积极进展。在本文中,我们提出了一种新的基于多任务学习的模型,称为子带自适应注意力时间卷积神经网络(SAA-TCN),以在MOS值区间检测器(VID)辅助任务的帮助下进行非侵入性语音质量评估。该模型不使用全频带幅度谱图,而是以子带幅度谱图为输入,以减少模型参数并防止过度拟合。为了有效地利用子带频率维度上的能量分布信息,采用子带自适应注意力(SAA)来增强TCN模型。实验结果表明,该方法在预测MOS值方面具有良好的性能。在ConferencingSpeech 2022挑战赛中,我们的方法获得了0.763的平均皮尔逊相关系数(PCC)分数,并比挑战赛基线方法高0.233。
{"title":"Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network","authors":"Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang","doi":"10.21437/interspeech.2022-10315","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10315","url":null,"abstract":"In terms of subjective evaluations, speech quality has been gen-erally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overfitting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In ConferencingSpeech 2022 Challenge, our method achieves a mean Pearson’s correlation coefficient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3298-3302"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49153770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On Breathing Pattern Information in Synthetic Speech 合成语音中的呼吸模式信息
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10271
Z. Mostaani, M. Magimai.-Doss
The respiratory system is an integral part of human speech production. As a consequence, there is a close relation between respiration and speech signal, and the produced speech signal carries breathing pattern related information. Speech can also be generated using speech synthesis systems. In this paper, we investigate whether synthetic speech carries breathing pattern related information in the same way as natural human speech. We address this research question in the framework of logical-access presentation attack detection using embeddings extracted from neural networks pre-trained for speech breathing pattern estimation. Our studies on ASVSpoof 2019 challenge data show that there is a clear distinction between the extracted breathing pattern embedding of natural human speech and syn-thesized speech, indicating that speech synthesis systems tend to not carry breathing pattern related information in the same way as human speech. Whilst, this is not the case with voice conversion of natural human speech.
呼吸系统是人类语言产生的一个组成部分。因此,呼吸与语音信号之间存在着密切的联系,产生的语音信号携带着与呼吸方式相关的信息。语音也可以使用语音合成系统生成。在本文中,我们研究了合成语音是否以与人类自然语音相同的方式携带呼吸模式相关信息。我们在逻辑访问表示攻击检测的框架中解决了这个研究问题,使用从语音呼吸模式估计预训练的神经网络中提取的嵌入。我们对ASVSpoof 2019挑战数据的研究表明,提取的人类自然语音和合成语音的呼吸模式嵌入之间存在明显的区别,这表明语音合成系统往往不像人类语音那样携带呼吸模式相关信息。然而,自然人类语言的语音转换并非如此。
{"title":"On Breathing Pattern Information in Synthetic Speech","authors":"Z. Mostaani, M. Magimai.-Doss","doi":"10.21437/interspeech.2022-10271","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10271","url":null,"abstract":"The respiratory system is an integral part of human speech production. As a consequence, there is a close relation between respiration and speech signal, and the produced speech signal carries breathing pattern related information. Speech can also be generated using speech synthesis systems. In this paper, we investigate whether synthetic speech carries breathing pattern related information in the same way as natural human speech. We address this research question in the framework of logical-access presentation attack detection using embeddings extracted from neural networks pre-trained for speech breathing pattern estimation. Our studies on ASVSpoof 2019 challenge data show that there is a clear distinction between the extracted breathing pattern embedding of natural human speech and syn-thesized speech, indicating that speech synthesis systems tend to not carry breathing pattern related information in the same way as human speech. Whilst, this is not the case with voice conversion of natural human speech.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2768-2772"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48554971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1