首页 > 最新文献

The 9th International Symposium on Chinese Spoken Language Processing最新文献

英文 中文
The modeling of tongue tip in Standard Chinese using MRI 标准汉语舌尖的MRI建模
Pub Date : 2017-02-21 DOI: 10.16511/J.CNKI.QHDXXB.2017.22.008
Wang Gaowu, Dang Jianwu, Kong Jiangping
Summary form only given. In this paper, the tongue tip was modeled based on the articulatory data from MRI images in Standard Chinese. First, the MRI articulatory database of Standard Chinese, including 9 vowels and 75 consonant variants, were established. Second, Principle Component Analysis (PCA) was performed on the tongue shape to find articulatory factors, and the result showed that it would be more precise and concise when the tongue was divided as the tongue tip and tongue body and modeled separately. Finally, according to this result, the tongue tip was modeled by two articulatory parameters: Tongue Tip Protrude and Tongue Tip Raise, which represents the protruding/advancing and raising/retroflexing movements of the tongue tip.
只提供摘要形式。本文基于标准汉语MRI图像的发音数据,建立了舌尖的模型。首先,建立了包含9个元音和75个辅音变体的普通话MRI发音数据库。其次,对舌形进行主成分分析(PCA),寻找发音因子,结果表明,将舌形分为舌尖和舌体,分别建模更为精确简洁。最后,根据这一结果,舌尖由两个发音参数:舌尖突出和舌尖抬高来建模,这两个参数分别代表舌尖的突出/前进和抬高/后屈运动。
{"title":"The modeling of tongue tip in Standard Chinese using MRI","authors":"Wang Gaowu, Dang Jianwu, Kong Jiangping","doi":"10.16511/J.CNKI.QHDXXB.2017.22.008","DOIUrl":"https://doi.org/10.16511/J.CNKI.QHDXXB.2017.22.008","url":null,"abstract":"Summary form only given. In this paper, the tongue tip was modeled based on the articulatory data from MRI images in Standard Chinese. First, the MRI articulatory database of Standard Chinese, including 9 vowels and 75 consonant variants, were established. Second, Principle Component Analysis (PCA) was performed on the tongue shape to find articulatory factors, and the result showed that it would be more precise and concise when the tongue was divided as the tongue tip and tongue body and modeled separately. Finally, according to this result, the tongue tip was modeled by two articulatory parameters: Tongue Tip Protrude and Tongue Tip Raise, which represents the protruding/advancing and raising/retroflexing movements of the tongue tip.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122678418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A multi-channel/multi-speaker articulatory database in Mandarin for speech visualization 中文语音可视化多声道/多说话人发音数据库
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936629
Dan Zhang, Xianqian Liu, N. Yan, Lan Wang, Yun Zhu, Hui Chen
The application of articulatory database in speech production and automatic speech recognition has been practiced for many years. The goal of the research was to build an articulatory database specifying in Chinese Mandarin production and to investigate its efficacy in speech animation. Carstens EMA AG501 device were respectively used to capture acoustic data and articulatory data. Also, a Microsoft Kinect camera was applied to capture face-tracking data as a supplement. Finally, we tried several methods to extract acoustic parameters and built up a 3D talking head model to verify the efficacy of the database.
发音数据库在语音生成和语音自动识别中的应用已经有多年的实践。本研究的目的在于建立中文普通话制作的发音资料库,并探讨其在语音动画中的效果。采用Carstens EMA AG501装置分别采集声学数据和发音数据。此外,微软Kinect摄像头也被用于捕捉面部追踪数据,作为补充。最后,我们尝试了几种方法提取声学参数,并建立了一个三维说话头模型来验证数据库的有效性。
{"title":"A multi-channel/multi-speaker articulatory database in Mandarin for speech visualization","authors":"Dan Zhang, Xianqian Liu, N. Yan, Lan Wang, Yun Zhu, Hui Chen","doi":"10.1109/ISCSLP.2014.6936629","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936629","url":null,"abstract":"The application of articulatory database in speech production and automatic speech recognition has been practiced for many years. The goal of the research was to build an articulatory database specifying in Chinese Mandarin production and to investigate its efficacy in speech animation. Carstens EMA AG501 device were respectively used to capture acoustic data and articulatory data. Also, a Microsoft Kinect camera was applied to capture face-tracking data as a supplement. Finally, we tried several methods to extract acoustic parameters and built up a 3D talking head model to verify the efficacy of the database.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123548026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition 用DNN-HMM标记未分割序列数据及其在语音识别中的应用
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936622
Xiangang Li, Xihong Wu
Recently, deep neural network (DNN) with hidden Markov model (HMM) has turned out to be a superior sequence learning framework, based on which significant improvements were achieved in many application tasks, such as automatic speech recognition (ASR). However, the training of DNN-HMM requires the pre-segmented training data, which can be generated using Gaussian Mixture Model (GMM) in ASR tasks. Thus, questions are raised by many researchers: can we train the DNN-HMM without GMM seeding, and what does it suggest if the answer is yes? In this research, we come up with the `yes' answer by presenting forward-backward learning algorithm for DNN-HMM framework. Besides, a training procedure is proposed, in which, the training for context independent (CI) DNN-HMM is treated as the pre-training for context dependent (CD) DNN-HMM. To evaluate the contribution of this work, experiments on ASR task with the benchmark corpus TIMIT are performed, and the results demonstrate the effectiveness of this research.
近年来,基于隐马尔可夫模型(HMM)的深度神经网络(DNN)已成为一种较好的序列学习框架,在自动语音识别(ASR)等许多应用任务中取得了显著的进步。然而,DNN-HMM的训练需要预先分割的训练数据,这些数据可以在ASR任务中使用高斯混合模型(GMM)生成。因此,许多研究人员提出了问题:我们能否在没有GMM播种的情况下训练DNN-HMM ?如果答案是肯定的,它意味着什么?在这项研究中,我们通过提出DNN-HMM框架的前向向后学习算法,得出了“是”的答案。此外,提出了一种训练过程,将上下文独立(CI) DNN-HMM的训练作为上下文依赖(CD) DNN-HMM的预训练。为了评估本研究的贡献,以基准语料库TIMIT为例,对ASR任务进行了实验,结果证明了本研究的有效性。
{"title":"Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition","authors":"Xiangang Li, Xihong Wu","doi":"10.1109/ISCSLP.2014.6936622","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936622","url":null,"abstract":"Recently, deep neural network (DNN) with hidden Markov model (HMM) has turned out to be a superior sequence learning framework, based on which significant improvements were achieved in many application tasks, such as automatic speech recognition (ASR). However, the training of DNN-HMM requires the pre-segmented training data, which can be generated using Gaussian Mixture Model (GMM) in ASR tasks. Thus, questions are raised by many researchers: can we train the DNN-HMM without GMM seeding, and what does it suggest if the answer is yes? In this research, we come up with the `yes' answer by presenting forward-backward learning algorithm for DNN-HMM framework. Besides, a training procedure is proposed, in which, the training for context independent (CI) DNN-HMM is treated as the pre-training for context dependent (CD) DNN-HMM. To evaluate the contribution of this work, experiments on ASR task with the benchmark corpus TIMIT are performed, and the results demonstrate the effectiveness of this research.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125556354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Improving training time of deep neural networkwith asynchronous averaged stochastic gradient descent 异步平均随机梯度下降法提高深度神经网络的训练时间
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936596
Zhao You, Bo Xu
Deep neural network acoustic models have shown large improvement in performance over Gaussian mixture models (GMMs) in recent studies. Typically, stochastic gradient descent (SGD) is the most popular method for training deep neural networks. However, training DNN with minibatch based SGD is very slow. Because it requires frequent serial training and scanning the whole training set many passes before reaching the asymptotic region, making it difficult to scale to large dataset. Commonly, we can reduce training time from two aspects, reducing the epochs of training and exploring the distributed training algorithm. There are some distributed training algorithms, such as LBFGS, Hessian-free optimization and asynchronous SGD, have proven significantly reducing the training time. In order to further reduce the training time, we attempted to explore training algorithm with fast convergence and combined it with distributed training algorithm. Averaged stochastic gradient descent (ASGD) is proved simple and effective for one pass on-line learning. This paper investigates the asynchronous ASGD algorithm for deep neural network training. We tested asynchronous ASGD on the Mandarin Chinese recorded speech recognition task using deep neural networks. Experimental results show that the performance of one pass asynchronous ASGD is very close to that of multiple passes asynchronous SGD. Meanwhile, we can reduce the training time by a factor of 6.3.
在最近的研究中,深度神经网络声学模型的性能比高斯混合模型(GMMs)有了很大的提高。通常,随机梯度下降(SGD)是最常用的深度神经网络训练方法。然而,用基于SGD的小批量训练DNN是非常慢的。由于它需要频繁的连续训练,并且在到达渐近区域之前需要对整个训练集进行多次扫描,因此难以扩展到大型数据集。通常,我们可以从两个方面减少训练时间,减少训练的epoch和探索分布式训练算法。一些分布式训练算法,如LBFGS、Hessian-free优化和异步SGD,已经被证明可以显著减少训练时间。为了进一步减少训练时间,我们尝试探索快速收敛的训练算法,并将其与分布式训练算法相结合。证明了平均随机梯度下降法(ASGD)对于一次在线学习简单有效。研究了用于深度神经网络训练的异步ASGD算法。我们使用深度神经网络对汉语录音语音识别任务进行了异步ASGD测试。实验结果表明,单次异步ASGD的性能与多次异步ASGD非常接近。同时,我们可以将训练时间减少到原来的6.3倍。
{"title":"Improving training time of deep neural networkwith asynchronous averaged stochastic gradient descent","authors":"Zhao You, Bo Xu","doi":"10.1109/ISCSLP.2014.6936596","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936596","url":null,"abstract":"Deep neural network acoustic models have shown large improvement in performance over Gaussian mixture models (GMMs) in recent studies. Typically, stochastic gradient descent (SGD) is the most popular method for training deep neural networks. However, training DNN with minibatch based SGD is very slow. Because it requires frequent serial training and scanning the whole training set many passes before reaching the asymptotic region, making it difficult to scale to large dataset. Commonly, we can reduce training time from two aspects, reducing the epochs of training and exploring the distributed training algorithm. There are some distributed training algorithms, such as LBFGS, Hessian-free optimization and asynchronous SGD, have proven significantly reducing the training time. In order to further reduce the training time, we attempted to explore training algorithm with fast convergence and combined it with distributed training algorithm. Averaged stochastic gradient descent (ASGD) is proved simple and effective for one pass on-line learning. This paper investigates the asynchronous ASGD algorithm for deep neural network training. We tested asynchronous ASGD on the Mandarin Chinese recorded speech recognition task using deep neural networks. Experimental results show that the performance of one pass asynchronous ASGD is very close to that of multiple passes asynchronous SGD. Meanwhile, we can reduce the training time by a factor of 6.3.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128320719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Joint-character-POC N-gram language modeling for Chinese speech recognition 中文语音识别的联合字符- poc N-gram语言建模
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936588
Bin Wang, Zhijian Ou, Jian Li, A. Kawamura
The state-of-the-art language models (LMs) for Chinese speech recognition are word n-gram models. However, in Chinese, characters are morphological in meaning and words are not consistently defined. There are recent interests in building the character n-gram LM and its combination with the word n-gram LM. In this paper, in order to exploit both character-level and word-level constraints, we propose the joint n-gram LM, which is an n-gram model based on joint-state that is a pair of character and its position-of-character (POC) tag. We point out the pitfall in naive solving of the smoothing and scoring problems for joint n-gram models, and provide corrected solutions. For experimental comparison, different LMs (including word 4-grams, character 6-grams and joint 6-grams) are tested for speech recognition, using training corpus of 1.9 billion characters. The joint n-gram LM achieves performance improvements, especially in recognizing the utterances containing OOV words.
中文语音识别的最先进的语言模型(LMs)是单词n-gram模型。然而,在汉语中,汉字在意义上是形态的,词语的定义也不一致。最近有兴趣建立字符n-gram LM及其与单词n-gram LM的组合。为了同时利用字符级和词级约束,本文提出了联合n-gram LM,这是一种基于联合状态的n-gram模型,即一对字符及其字符位置(POC)标签。我们指出了单纯解决联合n图模型的平滑和评分问题的缺陷,并提供了修正的解决方案。为了进行实验对比,使用19亿个字符的训练语料库,对不同的lm(包括单词4-g、字符6-g和关节6-g)进行语音识别测试。联合n-gram LM实现了性能提升,特别是在识别包含OOV单词的话语方面。
{"title":"Joint-character-POC N-gram language modeling for Chinese speech recognition","authors":"Bin Wang, Zhijian Ou, Jian Li, A. Kawamura","doi":"10.1109/ISCSLP.2014.6936588","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936588","url":null,"abstract":"The state-of-the-art language models (LMs) for Chinese speech recognition are word n-gram models. However, in Chinese, characters are morphological in meaning and words are not consistently defined. There are recent interests in building the character n-gram LM and its combination with the word n-gram LM. In this paper, in order to exploit both character-level and word-level constraints, we propose the joint n-gram LM, which is an n-gram model based on joint-state that is a pair of character and its position-of-character (POC) tag. We point out the pitfall in naive solving of the smoothing and scoring problems for joint n-gram models, and provide corrected solutions. For experimental comparison, different LMs (including word 4-grams, character 6-grams and joint 6-grams) are tested for speech recognition, using training corpus of 1.9 billion characters. The joint n-gram LM achieves performance improvements, especially in recognizing the utterances containing OOV words.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129741925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on truncated speech in speaker verification 说话人验证中截断语音的研究
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936671
Fanhu Bie, Dong Wang, T. Zheng
Summary form only given. The speech truncating phenomenon is a general problem is practical speaker recognition system. After the speech was truncated by amplitude, the spectral was changed during the process, resulting in the decreasing in the system`s performance. The paper describes the observation and the conclusion on the impact of the truncated segments, studies the reason of the impact on the recognition performance, gives out the ways of the truncated segments detection and reducing the decreasing of the performance. The simulation on NIST SRE08 shows that, just when the amplitude truncating ratio remains high (more than the 80% of the maximum amplitude), the performance drops sharply; the performance of traditional GMM-UBM system and I-vector system behavior familiar when the amplitude truncating is low, while I-vector gives a better robustness when is high. The paper gives out a proposal on truncating segments detection based on subspace discriminant information, which is then used to discard the truncating segments. The experiments show that this proposal could well detect the truncated segments. However, the results show that there are still speaker discriminant information in the truncated segments, when the amplitude truncated ratio remains low, it's better to remain the data to sustain the performance, otherwise, the speaker should take another recording to keep the system performance.
只提供摘要形式。语音截断现象是实际说话人识别系统中普遍存在的问题。语音经过幅度截断后,在此过程中频谱发生变化,导致系统性能下降。本文描述了对截断段影响的观察和结论,研究了影响识别性能的原因,给出了截断段检测和减少性能下降的方法。在NIST SRE08上的仿真结果表明,当截幅比保持较高时(大于最大幅值的80%),性能急剧下降;传统GMM-UBM系统和i向量系统在截断幅值较低时表现相似,而i向量在截断幅值较高时具有较好的鲁棒性。提出了一种基于子空间判别信息的截断段检测方法,利用子空间判别信息对截断段进行丢弃。实验表明,该算法能很好地检测出截断的片段。然而,结果表明,在截断的片段中仍然存在说话人的判别信息,当幅度截断比较低时,最好保留数据以维持性能,否则,说话人应再次录音以保持系统性能。
{"title":"Research on truncated speech in speaker verification","authors":"Fanhu Bie, Dong Wang, T. Zheng","doi":"10.1109/ISCSLP.2014.6936671","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936671","url":null,"abstract":"Summary form only given. The speech truncating phenomenon is a general problem is practical speaker recognition system. After the speech was truncated by amplitude, the spectral was changed during the process, resulting in the decreasing in the system`s performance. The paper describes the observation and the conclusion on the impact of the truncated segments, studies the reason of the impact on the recognition performance, gives out the ways of the truncated segments detection and reducing the decreasing of the performance. The simulation on NIST SRE08 shows that, just when the amplitude truncating ratio remains high (more than the 80% of the maximum amplitude), the performance drops sharply; the performance of traditional GMM-UBM system and I-vector system behavior familiar when the amplitude truncating is low, while I-vector gives a better robustness when is high. The paper gives out a proposal on truncating segments detection based on subspace discriminant information, which is then used to discard the truncating segments. The experiments show that this proposal could well detect the truncated segments. However, the results show that there are still speaker discriminant information in the truncated segments, when the amplitude truncated ratio remains low, it's better to remain the data to sustain the performance, otherwise, the speaker should take another recording to keep the system performance.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114070795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phonotactic language recognition based on DNN-HMM acoustic model 基于DNN-HMM声学模型的语音语音识别
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936704
Weiwei Liu, Meng Cai, Hua Yuan, Xiao-Bei Shi, Weiqiang Zhang, Jia Liu
A recently introduced deep neural network (DNN) has achieved some unprecedented gains in many challenging automatic speech recognition (ASR) tasks. In this paper deep neural network hidden Markov model (DNN-HMM) acoustic models is introduced to phonotactic language recognition and outperforms artificial neural network hidden Markov model (ANN-HMM) and Gaussian mixture model hidden Markov model (GMM-HMM) acoustic model. Experimental results have confirmed that phonotactic language recognition system using DNN-HMM acoustic model yields relative equal error rate reduction of 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% for 30s, 10s, 3s comparing with the ANN-HMM and GMM-HMM approaches respectively on National Institute of Standards and Technology language recognition evaluation (NIST LRE) 2009 tasks.
最近引入的深度神经网络(DNN)在许多具有挑战性的自动语音识别(ASR)任务中取得了前所未有的成就。本文将深度神经网络隐马尔可夫模型(DNN-HMM)声学模型引入到语音语音识别中,其声学模型优于人工神经网络隐马尔可夫模型(ANN-HMM)和高斯混合模型隐马尔可夫模型(GMM-HMM)声学模型。实验结果证实,在美国国家标准与技术研究院语言识别评估(NIST LRE) 2009任务中,采用DNN-HMM声学模型的音致化语言识别系统在30秒、10秒、3秒的错误率分别比ANN-HMM和GMM-HMM方法降低了28.42%、14.06%、18.70%和12.55%、7.20%、2.47%。
{"title":"Phonotactic language recognition based on DNN-HMM acoustic model","authors":"Weiwei Liu, Meng Cai, Hua Yuan, Xiao-Bei Shi, Weiqiang Zhang, Jia Liu","doi":"10.1109/ISCSLP.2014.6936704","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936704","url":null,"abstract":"A recently introduced deep neural network (DNN) has achieved some unprecedented gains in many challenging automatic speech recognition (ASR) tasks. In this paper deep neural network hidden Markov model (DNN-HMM) acoustic models is introduced to phonotactic language recognition and outperforms artificial neural network hidden Markov model (ANN-HMM) and Gaussian mixture model hidden Markov model (GMM-HMM) acoustic model. Experimental results have confirmed that phonotactic language recognition system using DNN-HMM acoustic model yields relative equal error rate reduction of 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% for 30s, 10s, 3s comparing with the ANN-HMM and GMM-HMM approaches respectively on National Institute of Standards and Technology language recognition evaluation (NIST LRE) 2009 tasks.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"306 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An iterative framework for unsupervised learning in the PLDA based speaker verification 基于PLDA的说话人验证中的无监督学习迭代框架
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936726
Wenbo Liu, Zhiding Yu, Ming Li
We present an iterative and unsupervised learning approach for the speaker verification task. In conventional speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) has been widely used as a supervised backend. However, PLDA requires fully labeled training data, which is often difficult to obtain in reality. To automatically retrieve the speaker labels of unlabeled training data, we propose to use the Affinity Propagation (AP) - a clustering method that takes pairwise data similarity as input - to generate the labels for the PLDA modeling. We further propose an iterative refinement strategy that incrementally updates the similarity input of the AP clustering with the previous iteration's PLDA scoring outputs. Moreover, we evaluate the performance of different PLDA scoring methods for the multiple enrollment task and show that the generalized hypothesis testing achieves the best results. Experiments were conducted on the NIST SRE 2010 and the 2014 i-vector challenge database. The results show that our proposed iterative and unsupervised PLDA model learning approach outperformed the cosine similarity baseline by 35% relatively.
我们提出了一种迭代的无监督学习方法来完成说话人验证任务。在传统的说话人验证中,概率线性判别分析(PLDA)作为监督后端被广泛使用。然而,PLDA需要完全标记的训练数据,这在现实中往往很难获得。为了自动检索未标记训练数据的说话人标签,我们建议使用亲和传播(Affinity Propagation, AP)——一种以成对数据相似度作为输入的聚类方法——来生成用于PLDA建模的标签。我们进一步提出了一种迭代优化策略,该策略使用前一次迭代的PLDA评分输出增量更新AP聚类的相似性输入。此外,我们评估了不同PLDA评分方法对多招生任务的性能,并表明广义假设检验取得了最好的结果。实验分别在NIST SRE 2010和2014 i-vector挑战数据库上进行。结果表明,我们提出的迭代和无监督PLDA模型学习方法相对优于余弦相似基线35%。
{"title":"An iterative framework for unsupervised learning in the PLDA based speaker verification","authors":"Wenbo Liu, Zhiding Yu, Ming Li","doi":"10.1109/ISCSLP.2014.6936726","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936726","url":null,"abstract":"We present an iterative and unsupervised learning approach for the speaker verification task. In conventional speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) has been widely used as a supervised backend. However, PLDA requires fully labeled training data, which is often difficult to obtain in reality. To automatically retrieve the speaker labels of unlabeled training data, we propose to use the Affinity Propagation (AP) - a clustering method that takes pairwise data similarity as input - to generate the labels for the PLDA modeling. We further propose an iterative refinement strategy that incrementally updates the similarity input of the AP clustering with the previous iteration's PLDA scoring outputs. Moreover, we evaluate the performance of different PLDA scoring methods for the multiple enrollment task and show that the generalized hypothesis testing achieves the best results. Experiments were conducted on the NIST SRE 2010 and the 2014 i-vector challenge database. The results show that our proposed iterative and unsupervised PLDA model learning approach outperformed the cosine similarity baseline by 35% relatively.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121713388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Influences of vowels on perception of nasal codas in Mandarin for Japanese learners and Chinese 元音对日语学习者和汉语学习者汉语鼻尾知觉的影响
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936691
Zuyan Wang, Jin-Song Zhang
This paper aims at studying on perceptual influences from vowel segments on the judgments of alveolar/velar nasals by Chinese and Japanese subjects, through two experiments: a) perception of the natural syllables; b) perception of the synthesized syllables. The results show that: 1) The nasalized vowels play dominating roles in cueing Chinese subjects to judge which coda the nasal is, whereas they have few effects for Japanese subjects, especially for the discrimination between an and ang. 2) When the nasalized vowel portions are missing, the vowel nuclei lay similar influences on the perceptions for both Chinese and Japanese. The larger the acoustic differences between the vowel nuclei are for the pair of alveolar/velar ones, the easier it is for both Chinese and Japanese to correctly distinguish them. From these results, we suggest that the importance of the acoustic differences between vowel portions in the pair of alveolar/velar nasals, and the sensitivity to the nasalized vowels, should be highlighted in the learning of Chinese as a second language by Japanese students.
本文旨在通过两个实验研究元音片段对汉语和日语受试者判断肺泡/腭鼻音的感知影响:a)对自然音节的感知;B)对合成音节的感知。结果表明:1)鼻音化元音对汉语被试判断鼻音是哪个词尾起主导作用,而对日语被试的影响不大,尤其是对an和ang的区分。2)在缺少鼻音部分的情况下,元音核对汉语和日语语音感知的影响相似。对于一对肺泡/舌状元音,元音核之间的声学差异越大,汉语和日语就越容易正确区分它们。根据这些结果,我们建议日本学生在汉语作为第二语言的学习中,应重视对肺泡/腭鼻中元音部分之间的声学差异,以及对鼻化元音的敏感性。
{"title":"Influences of vowels on perception of nasal codas in Mandarin for Japanese learners and Chinese","authors":"Zuyan Wang, Jin-Song Zhang","doi":"10.1109/ISCSLP.2014.6936691","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936691","url":null,"abstract":"This paper aims at studying on perceptual influences from vowel segments on the judgments of alveolar/velar nasals by Chinese and Japanese subjects, through two experiments: a) perception of the natural syllables; b) perception of the synthesized syllables. The results show that: 1) The nasalized vowels play dominating roles in cueing Chinese subjects to judge which coda the nasal is, whereas they have few effects for Japanese subjects, especially for the discrimination between an and ang. 2) When the nasalized vowel portions are missing, the vowel nuclei lay similar influences on the perceptions for both Chinese and Japanese. The larger the acoustic differences between the vowel nuclei are for the pair of alveolar/velar ones, the easier it is for both Chinese and Japanese to correctly distinguish them. From these results, we suggest that the importance of the acoustic differences between vowel portions in the pair of alveolar/velar nasals, and the sensitivity to the nasalized vowels, should be highlighted in the learning of Chinese as a second language by Japanese students.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130888272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving segmental GMM based voice conversion method with target frame selection 改进基于分段GMM的目标帧选择语音转换方法
Pub Date : 2014-10-27 DOI: 10.1109/ISCSLP.2014.6936633
H. Gu, Sung-Fung Tsai
In this paper, the voice conversion method based on segmental Gaussian mixture models (GMMs) is further improved by adding the module of target frame selection (TFS). Segmental GMMs are meant to replace a single GMM of a large number of mixture components with several voice-content specific GMMs each consisting of much fewer mixture components. In addition, TFS is used to find a frame, of spectral features near to the mapped feature vector, from the target-speaker frame pool corresponding to the segment class as the input frame belongs to. Both ideas are intended to alleviate the problem that the converted spectral envelopes are often over smoothed. To evaluate the performance of the two ideas mentioned, three voice conversion systems are constructed, and used to conduct listening tests. The results of the tests show that the system using the two ideas together can obtain much improved voice quality. In addition, the measured variance ratio (VR) values show that the system with the two ideas adopted also obtains the highest VR value.
本文通过增加目标帧选择(TFS)模块,对基于分段高斯混合模型(GMMs)的语音转换方法进行了改进。分段式GMM是指用几个语音内容特定的GMM来取代由大量混合成分组成的单个GMM,每个GMM由更少的混合成分组成。此外,使用TFS从作为输入帧所属的段类所对应的目标-说话人帧池中找到一帧靠近映射特征向量的光谱特征。这两种想法都是为了缓解转换后的光谱包络常常过于平滑的问题。为了评价这两种思想的性能,构建了三个语音转换系统,并进行了听力测试。测试结果表明,将这两种思想结合使用后,系统的语音质量得到了很大的提高。此外,测量的方差比(VR)值表明,采用这两种思想的系统也获得了最高的VR值。
{"title":"Improving segmental GMM based voice conversion method with target frame selection","authors":"H. Gu, Sung-Fung Tsai","doi":"10.1109/ISCSLP.2014.6936633","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936633","url":null,"abstract":"In this paper, the voice conversion method based on segmental Gaussian mixture models (GMMs) is further improved by adding the module of target frame selection (TFS). Segmental GMMs are meant to replace a single GMM of a large number of mixture components with several voice-content specific GMMs each consisting of much fewer mixture components. In addition, TFS is used to find a frame, of spectral features near to the mapped feature vector, from the target-speaker frame pool corresponding to the segment class as the input frame belongs to. Both ideas are intended to alleviate the problem that the converted spectral envelopes are often over smoothed. To evaluate the performance of the two ideas mentioned, three voice conversion systems are constructed, and used to conduct listening tests. The results of the tests show that the system using the two ideas together can obtain much improved voice quality. In addition, the measured variance ratio (VR) values show that the system with the two ideas adopted also obtains the highest VR value.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133132931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
The 9th International Symposium on Chinese Spoken Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1