首页 > 最新文献

Interspeech最新文献

英文 中文
Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method 语义差分法对穿透性语音的感知评价
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-100
T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano
Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.
有些扬声器具有穿透性的声音,即使在很大的噪音中或从很远的地方也能清晰地听到。本研究利用因子分析法对穿透性语音的音质进行了研究。11名参与者对124名说话者的声音如何从嘈杂声中突现出来进行打分。以得分为渗透指数,选取得分高的和得分低的发言者各10人,用语义差分法进行评分实验。40名本科生用14个双极7分制的声音质量量表对这些演讲者所产生的日语句子进行评分。采用13个量表的数据(即在14个量表中剔除1个渗透量表)进行因子分析。获得了三个主要因素:(1)强大和金属,(2)女性化,(3)美学。第一个因素(强力和金属)与穿透等级高度相关。这些结果表明,穿透性声音具有多维音质,穿透性声音的特征与声音的力量和金属性有关。
{"title":"Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method","authors":"T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano","doi":"10.21437/interspeech.2022-100","DOIUrl":"https://doi.org/10.21437/interspeech.2022-100","url":null,"abstract":"Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3063-3067"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45037272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis 用于表达文本到语音合成的约束跨模态说话风格建模
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11275
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng
{"title":"CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis","authors":"Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng","doi":"10.21437/interspeech.2022-11275","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11275","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5533-5537"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45690327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting 基于空间感知的多渠道多方会议发言人日记
Pub Date : 2022-09-18 DOI: 10.21437/Interspeech.2022-11412
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong
This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which consists of attention superdirective beamforming (ASDB) block and Conformer encoder. The proposed ASDB is a self-adapted channel-wise block that extracts the latent spatial features of array audios by modeling interdependencies between channels. We explore DMSNet to address overlapped speech problem on multi-channel audio and achieve 93.53% accuracy on evaluation set. By performing DMSNet based overlapped speech detection (OSD) module, the diarization error rate (DER) of cluster-based diarization system decrease significantly from 13.45% to 7.64%.
本文描述了一个用于多渠道多党会议的空间感知说话人日记系统。二值化系统通过麦克风阵列获取说话人的方位信息。说话人空间嵌入是由超定向波束形成(SDB)导出的x矢量和s矢量生成的,这使得嵌入更加鲁棒。具体来说,我们提出了一种新的多通道序列到序列神经网络架构,称为判别多流神经网络(DMSNet),它由注意力超定向波束形成(ASDB)块和保形编码器组成。所提出的ASDB是一个自适应的通道块,通过建模通道之间的相互依赖性来提取阵列音频的潜在空间特征。我们探索DMSNet来解决多声道音频上的重叠语音问题,并在评估集上达到93.53%的准确率。通过执行基于DMSNet的重叠语音检测(OSD)模块,基于聚类的二元化系统的二元错误率(DER)从13.45%显著降低到7.64%。
{"title":"Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting","authors":"Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong","doi":"10.21437/Interspeech.2022-11412","DOIUrl":"https://doi.org/10.21437/Interspeech.2022-11412","url":null,"abstract":"This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which consists of attention superdirective beamforming (ASDB) block and Conformer encoder. The proposed ASDB is a self-adapted channel-wise block that extracts the latent spatial features of array audios by modeling interdependencies between channels. We explore DMSNet to address overlapped speech problem on multi-channel audio and achieve 93.53% accuracy on evaluation set. By performing DMSNet based overlapped speech detection (OSD) module, the diarization error rate (DER) of cluster-based diarization system decrease significantly from 13.45% to 7.64%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1491-1495"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45765166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Online Learning of Open-set Speaker Identification by Active User-registration 基于主动用户注册的开放集说话人识别在线学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-25
Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee
Registering each user’s identity for voice assistants is bur-densome and complex for multi-user environments like a household scenario. This is particularly true when the registration needs to happen on-the-fly with a relatively minimum effort. Most of the prior works for speaker identification (SID) do not seamlessly allow the addition of new speakers as these do not support online updates. To deal with such limitation, we introduce a novel online learning approach to open-set SID that can actively register unknown users in the household setting. Based on MPART (Message Passing Adaptive Resonance The-ory), our method performs online active semi-supervised learning for open-set SID by using speaking embedding vectors to infer new speakers and request user’s identity. Our method pro-gressively improves the overall SID performance without forgetting, making it attractive for many interactive real-world ap-plications. We evaluate our model for the online learning setting of an open-set SID task where new speakers are added on-the-fly, demonstrating its superior performance.
对于像家庭场景这样的多用户环境来说,为语音助手注册每个用户的身份既麻烦又复杂。当注册需要以相对最小的工作量进行时,情况尤其如此。大多数先前的扬声器识别(SID)工作都不允许无缝添加新的扬声器,因为这些扬声器不支持在线更新。为了解决这种限制,我们引入了一种新的在线学习方法来开放集SID,该方法可以在家庭环境中主动注册未知用户。基于MPART(Message Passing Adaptive Resonance Theory),我们的方法通过使用语音嵌入向量来推断新的说话者并请求用户身份,对开放集SID进行在线主动半监督学习。我们的方法在不忘记的情况下逐步提高了SID的整体性能,使其对许多交互式现实世界应用程序具有吸引力。我们评估了我们的模型,用于开放集SID任务的在线学习设置,其中动态添加了新的扬声器,展示了其卓越的性能。
{"title":"Online Learning of Open-set Speaker Identification by Active User-registration","authors":"Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee","doi":"10.21437/interspeech.2022-25","DOIUrl":"https://doi.org/10.21437/interspeech.2022-25","url":null,"abstract":"Registering each user’s identity for voice assistants is bur-densome and complex for multi-user environments like a household scenario. This is particularly true when the registration needs to happen on-the-fly with a relatively minimum effort. Most of the prior works for speaker identification (SID) do not seamlessly allow the addition of new speakers as these do not support online updates. To deal with such limitation, we introduce a novel online learning approach to open-set SID that can actively register unknown users in the household setting. Based on MPART (Message Passing Adaptive Resonance The-ory), our method performs online active semi-supervised learning for open-set SID by using speaking embedding vectors to infer new speakers and request user’s identity. Our method pro-gressively improves the overall SID performance without forgetting, making it attractive for many interactive real-world ap-plications. We evaluate our model for the online learning setting of an open-set SID task where new speakers are added on-the-fly, demonstrating its superior performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5065-5069"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45807880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild MSDWild:狂野中的多模态说话人日记数据集
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10466
Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu
Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.
在现实声环境中,扬声器的特征化是一项具有挑战性的任务,越来越受到学术界和工业界的关注。虽然人们普遍认为,结合视觉信息有利于语音识别等音频处理任务,但目前还没有完全发布的数据集可用于在现实环境中对多模态说话人dialarization性能进行基准测试。在本文中,我们发布了MSDWild∗,这是一个用于野外多模态说话人diarization的基准数据集。该数据集收集自公开视频,涵盖了丰富的现实世界场景和语言。所有视频剪辑都是自然拍摄的视频,没有镜头切换等过度编辑。音频和视频都被释放。特别是,MSDWild有很大一部分自然重叠的语音,为鸡尾酒会问题研究提供了一个很好的测试平台。此外,我们还在数据集上使用纯音频、纯视觉和视听扬声器拨号进行基线实验。
{"title":"MSDWild: Multi-modal Speaker Diarization Dataset in the Wild","authors":"Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu","doi":"10.21437/interspeech.2022-10466","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10466","url":null,"abstract":"Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1476-1480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42383323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Zero-Shot Foreign Accent Conversion without a Native Reference 没有本机引用的零样本外来重音转换
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10664
Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna
Previous approaches for foreign accent conversion (FAC) ei-ther need a reference utterance from a native speaker (L1) during synthesis, or are dedicated one-to-one systems that must be trained separately for each non-native (L2) speaker. To address both issues, we propose a new FAC system that can transform L2 speech directly from previously unseen speakers. The system consists of two independent modules: a translator and a synthesizer, which operate on bottleneck features derived from phonetic posteriorgrams. The translator is trained to map bottleneck features in L2 utterances into those from a parallel L1 utterance. The synthesizer is a many-to-many system that maps input bottleneck features into the corresponding Mel-spectrograms, conditioned on an embedding from the L2 speaker. During inference, both modules operate in sequence to take an unseen L2 utterance and generate a native-accented Mel-spectrogram. Perceptual experiments show that our system achieves a large reduction (67%) in non-native accentedness compared to a state-of-the-art reference-free system (28.9%) that builds a dedicated model for each L2 speaker. Moreover, 80% of the listeners rated the synthesized utterances to have the same voice identity as the L2 speaker.
先前的外国口音转换(FAC)方法在合成过程中还需要来自母语(L1)的参考话语,或者是必须为每个非母语(L2)说话者单独训练的专用一对一系统。为了解决这两个问题,我们提出了一种新的FAC系统,它可以直接转换以前看不见的说话者的L2语音。该系统由两个独立的模块组成:翻译器和合成器,它们对语音后验图中的瓶颈特征进行操作。翻译者被训练为将L2话语中的瓶颈特征映射为来自平行L1话语的瓶颈特征。合成器是一个多对多系统,它将输入瓶颈特征映射到相应的Mel声谱图中,条件是来自L2扬声器的嵌入。在推理过程中,两个模块依次操作,以获取一个看不见的L2话语,并生成一个带有本地口音的梅尔声谱图。感知实验表明,与为每个L2说话者建立专用模型的最先进的无参考系统(28.9%)相比,我们的系统在非母语重音方面实现了大幅降低(67%)。此外,80%的听众认为合成的话语与L2说话者具有相同的语音身份。
{"title":"Zero-Shot Foreign Accent Conversion without a Native Reference","authors":"Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna","doi":"10.21437/interspeech.2022-10664","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10664","url":null,"abstract":"Previous approaches for foreign accent conversion (FAC) ei-ther need a reference utterance from a native speaker (L1) during synthesis, or are dedicated one-to-one systems that must be trained separately for each non-native (L2) speaker. To address both issues, we propose a new FAC system that can transform L2 speech directly from previously unseen speakers. The system consists of two independent modules: a translator and a synthesizer, which operate on bottleneck features derived from phonetic posteriorgrams. The translator is trained to map bottleneck features in L2 utterances into those from a parallel L1 utterance. The synthesizer is a many-to-many system that maps input bottleneck features into the corresponding Mel-spectrograms, conditioned on an embedding from the L2 speaker. During inference, both modules operate in sequence to take an unseen L2 utterance and generate a native-accented Mel-spectrogram. Perceptual experiments show that our system achieves a large reduction (67%) in non-native accentedness compared to a state-of-the-art reference-free system (28.9%) that builds a dedicated model for each L2 speaker. Moreover, 80% of the listeners rated the synthesized utterances to have the same voice identity as the L2 speaker.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4920-4924"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43009987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Effects of laryngeal manipulations on voice gender perception 喉部操作对声音性别感知的影响
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10815
Zhaoyan Zhang, Jason Zhang, J. Kreiman
This study aims to identify laryngeal manipulations that would allow a male to approximate a female-sounding voice, and that can be targeted in voice feminization surgery or therapy. Synthetic voices were generated using a three-dimensional vocal fold model with parametric variations in vocal fold geometry, stiffness, adduction, and subglottal pressure. The vocal tract was kept constant in order to focus on the contribution of laryngeal manipulations. Listening subjects were asked to judge if a voice sounded male or female, or if they were unsure. Results showed the expected large effect of the fundamental frequency (F0) and a moderate effect of spectral shape on gender perception. A mismatch between F0 and spectral shape cues (e.g., low F0 paired with high H1-H2) contributed to ambiguity in gender perception, particularly for voices with F0 in the intermediate range between those of typical adult males and females. Physiologically, the results showed that a female-sounding voice can be produced by decreasing vocal fold thickness and increasing vocal fold transverse stiffness in the coronal plane, changes in which modified both F0 and spectral shape. In contrast, laryngeal manipulations with limited impact on F0 or spectral shape were shown to be less effective in modifying gender perception.
这项研究的目的是确定喉部的操作,使男性的声音接近女性的声音,这可以在声音女性化手术或治疗中有针对性。合成声音是使用三维声带模型生成的,该模型具有声带几何形状、刚度、内收和声门下压力的参数变化。声道保持不变,以便重点关注喉部手法的作用。受试者被要求判断一个声音听起来是男性还是女性,或者他们不确定。结果表明,基频(F0)对性别感知的影响较大,谱形对性别感知的影响较小。F0和频谱形状线索之间的不匹配(例如,低F0与高H1-H2配对)导致性别感知的模糊性,特别是对于F0处于典型成年男性和女性之间的中间范围的声音。生理上,通过降低声带厚度和增加冠状面声带横向刚度,可以产生女声,这种变化改变了F0和频谱形状。相比之下,对F0或频谱形状影响有限的喉部手法在改变性别感知方面效果较差。
{"title":"Effects of laryngeal manipulations on voice gender perception","authors":"Zhaoyan Zhang, Jason Zhang, J. Kreiman","doi":"10.21437/interspeech.2022-10815","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10815","url":null,"abstract":"This study aims to identify laryngeal manipulations that would allow a male to approximate a female-sounding voice, and that can be targeted in voice feminization surgery or therapy. Synthetic voices were generated using a three-dimensional vocal fold model with parametric variations in vocal fold geometry, stiffness, adduction, and subglottal pressure. The vocal tract was kept constant in order to focus on the contribution of laryngeal manipulations. Listening subjects were asked to judge if a voice sounded male or female, or if they were unsure. Results showed the expected large effect of the fundamental frequency (F0) and a moderate effect of spectral shape on gender perception. A mismatch between F0 and spectral shape cues (e.g., low F0 paired with high H1-H2) contributed to ambiguity in gender perception, particularly for voices with F0 in the intermediate range between those of typical adult males and females. Physiologically, the results showed that a female-sounding voice can be produced by decreasing vocal fold thickness and increasing vocal fold transverse stiffness in the coronal plane, changes in which modified both F0 and spectral shape. In contrast, laryngeal manipulations with limited impact on F0 or spectral shape were shown to be less effective in modifying gender perception.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1856-1860"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43054390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental learning for RNN-Transducer based speech recognition models 基于RNN传感器的语音识别模型的增量学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10795
Deepak Baby, Pasquale D’Alterio, Valentin Mendelev
This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is pos-sible to improve recognition of specific new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting specific words.
本文研究了一种基于RNN-Transducer自动语音识别(ASR)模型的语音助手增量学习框架。这样的模型需要定期更新,以跟上客户请求分布的变化。我们证明了一种简单的微调方法,结合旧的和新的训练数据,可以用来增量地更新模型,只需要几个小时的训练时间,并且对旧数据没有任何退化。本文探讨了基于月度训练数据的ASR模型的多轮增量更新。结果表明,与在月度评估数据集上从零开始训练的模型相比,该方法的相对WER提高了5-6%。此外,我们还探讨了是否有可能提高对特定生词的识别。我们用每个单词的少量训练话语(真实的和合成的)模拟了多轮增量更新,并表明对新单词的识别显着提高,但对一般数据有轻微的下降。最后,我们证明了在一般数据上观察到的退化可以通过将每月更新与针对特定单词的更新交叉使用来缓解。
{"title":"Incremental learning for RNN-Transducer based speech recognition models","authors":"Deepak Baby, Pasquale D’Alterio, Valentin Mendelev","doi":"10.21437/interspeech.2022-10795","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10795","url":null,"abstract":"This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is pos-sible to improve recognition of specific new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting specific words.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"71-75"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47633462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification 用于构音障碍语音自动分类的对抗性自由说话人身份不变表示学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-402
Parvaneh Janbakhshi, I. Kodrasi
Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.
对病理学无关线索(如说话者身份信息)具有鲁棒性的语音表示已被证明有利于自动进行构音障碍语音分类。最近提出的一种用于学习构音障碍语音分类的说话人身份不变表示的技术是基于对抗性训练的。然而,对抗性训练可能具有挑战性、不稳定且对训练参数敏感。为了避免对抗性训练,在本文中,我们提出利用依赖于互信息最小化的特征分离框架来学习说话人身份不变表示。在神经典型和构音障碍语音数据库上的实验结果表明,所提出的无对抗性框架成功地学习了说话人身份不变表示。此外,研究表明,这种表征与使用对抗性训练获得的表征具有相似的构音障碍语音分类性能,而训练过程更稳定,对训练参数不太敏感。
{"title":"Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification","authors":"Parvaneh Janbakhshi, I. Kodrasi","doi":"10.21437/interspeech.2022-402","DOIUrl":"https://doi.org/10.21437/interspeech.2022-402","url":null,"abstract":"Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2138-2142"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48272141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge distillation for In-memory keyword spotting model 内存关键字识别模型的知识精馏
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-633
Zeyang Song, Qi Liu, Qu Yang, Haizhou Li
We study a light-weight implementation of keyword spotting (KWS) for voice command and control, that can be implemented on an in-memory computing (IMC) unit with same accuracy at a lower computational cost than the state-of-the-art methods. KWS is expected to be always-on for mobile devices with limited resources. IMC represents one of the solutions. However, it only supports multiplication-accumulation and Boolean operations. We note that common feature extraction methods, such as MFCC and SincConv, are not supported by IMC as they depend on expensive logarithm computing. On the other hand, some neural network solutions to KWS involve a large number of parameters that are not feasible for mobile devices. In this work, we propose a knowledge distillation technique to replace the complex speech frontend like MFCC or SincConv with a light-weight encoder without performance loss. Experiments show that the proposed model outperforms the KWS model with MFCC and SincConv front-end in terms of accuracy and computational cost.
我们研究了一种用于语音命令和控制的关键词识别(KWS)的轻量级实现,该实现可以在内存计算(IMC)单元上以与最先进的方法相比更低的计算成本实现,具有相同的精度。对于资源有限的移动设备,KWS预计将一直处于开启状态。IMC代表了其中一种解决方案。但是,它只支持乘法累加和布尔运算。我们注意到,常见的特征提取方法,如MFCC和SincConv,不受IMC的支持,因为它们依赖于昂贵的对数计算。另一方面,KWS的一些神经网络解决方案涉及大量参数,这些参数对于移动设备来说是不可行的。在这项工作中,我们提出了一种知识提取技术,用一种没有性能损失的轻量级编码器来取代像MFCC或SincConv这样的复杂语音前端。实验表明,该模型在精度和计算成本方面优于具有MFCC和SincConv前端的KWS模型。
{"title":"Knowledge distillation for In-memory keyword spotting model","authors":"Zeyang Song, Qi Liu, Qu Yang, Haizhou Li","doi":"10.21437/interspeech.2022-633","DOIUrl":"https://doi.org/10.21437/interspeech.2022-633","url":null,"abstract":"We study a light-weight implementation of keyword spotting (KWS) for voice command and control, that can be implemented on an in-memory computing (IMC) unit with same accuracy at a lower computational cost than the state-of-the-art methods. KWS is expected to be always-on for mobile devices with limited resources. IMC represents one of the solutions. However, it only supports multiplication-accumulation and Boolean operations. We note that common feature extraction methods, such as MFCC and SincConv, are not supported by IMC as they depend on expensive logarithm computing. On the other hand, some neural network solutions to KWS involve a large number of parameters that are not feasible for mobile devices. In this work, we propose a knowledge distillation technique to replace the complex speech frontend like MFCC or SincConv with a light-weight encoder without performance loss. Experiments show that the proposed model outperforms the KWS model with MFCC and SincConv front-end in terms of accuracy and computational cost.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4128-4132"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48292021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1