首页 > 最新文献

2012 8th International Symposium on Chinese Spoken Language Processing最新文献

英文 中文
Statistical modification based post-filtering technique for HMM-based speech synthesis 基于统计修正的hmm语音合成后滤波技术
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423456
Zhengqi Wen, J. Tao, Hao Che
The speech generated from hidden Markov model (HMM)-based speech synthesis systems (HTS) is suffered from over-smoothing problem which is due to statistical modeling. This paper will focus on post-filtering technique based on statistical modification for the generated speech parameters. The marginal statistics of parameters' trajectory, such as mean, variance, skewness and kurtosis are adjusted according to the values generated from the HTS system. This technique is compared with global variance (GV)-based speech generation algorithm. The listening test showed that the post-filtering technique considering the mean and variance could generate almost equal result with GV model. When further considering the modification of skewness and kurtosis, the quality of generated speech has been improved.
基于隐马尔可夫模型(HMM)的语音合成系统(HTS)由于统计建模而产生的语音存在过平滑问题。本文将重点研究基于统计修正的语音参数后滤波技术。根据HTS系统生成的值调整参数轨迹的边际统计量,如均值、方差、偏度和峰度。并与基于全局方差(GV)的语音生成算法进行了比较。听力测试表明,考虑均值和方差的后滤波技术可以得到与GV模型几乎相等的结果。进一步考虑对偏度和峰度的修正,提高了生成语音的质量。
{"title":"Statistical modification based post-filtering technique for HMM-based speech synthesis","authors":"Zhengqi Wen, J. Tao, Hao Che","doi":"10.1109/ISCSLP.2012.6423456","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423456","url":null,"abstract":"The speech generated from hidden Markov model (HMM)-based speech synthesis systems (HTS) is suffered from over-smoothing problem which is due to statistical modeling. This paper will focus on post-filtering technique based on statistical modification for the generated speech parameters. The marginal statistics of parameters' trajectory, such as mean, variance, skewness and kurtosis are adjusted according to the values generated from the HTS system. This technique is compared with global variance (GV)-based speech generation algorithm. The listening test showed that the post-filtering technique considering the mean and variance could generate almost equal result with GV model. When further considering the modification of skewness and kurtosis, the quality of generated speech has been improved.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123733291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A study on cepstral sub-band normalization for robust ASR 鲁棒ASR的倒谱子带归一化研究
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423484
Syu-Siang Wang, J. Hung, Yu Tsao
In this paper, we propose a cepstral subband normalization (CSN) approach for robust speech recognition. The CSN approach first applies the discrete wavelet transform (DWT) to decompose the original cepstral feature sequence into low and high frequency band (LFB and HFB) parts. Then, CSN normalizes the LFB components and zeros out the HFB components. Finally, an inverse DWT is applied on LFB and HFB components to form the normalized cepstral features. When using the Haar functions as the DWT bases, the calculation of CSN can be processed efficiently with a 50% reduction on the amount of feature components. In addition, our experimental results on the Aurora-2 task show that CSN outperforms the conventional cepstral mean subtraction (CMS), cepstral mean and variance normalization (CMVN), and histogram equalization (HEQ). We also integrate CSN with advanced frontend (AFE) for feature extraction. Experimental results indicate that the integrated AFE+CSN achieves notable improvements over the original AFE. The simple calculation, compact in form, and effective noise robustness properties enable CSN to perform suitably for mobile applications.
本文提出了一种用于鲁棒语音识别的倒谱子带归一化(CSN)方法。CSN方法首先利用离散小波变换(DWT)将原始倒谱特征序列分解为低频段和高频段(LFB和HFB)部分。然后,CSN将LFB分量归一化,并将HFB分量归零。最后,对LFB和HFB分量进行逆小波变换,形成归一化倒谱特征。当使用Haar函数作为DWT基时,CSN的计算可以有效地处理,特征分量的数量减少了50%。此外,我们在Aurora-2任务上的实验结果表明,CSN优于传统的倒谱均值减法(CMS)、倒谱均值方差归一化(CMVN)和直方图均衡化(HEQ)。我们还将CSN与高级前端(AFE)集成在一起,用于特征提取。实验结果表明,与原始AFE相比,集成AFE+CSN取得了显著的改进。简单的计算、紧凑的形式和有效的噪声鲁棒性使CSN能够适用于移动应用程序。
{"title":"A study on cepstral sub-band normalization for robust ASR","authors":"Syu-Siang Wang, J. Hung, Yu Tsao","doi":"10.1109/ISCSLP.2012.6423484","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423484","url":null,"abstract":"In this paper, we propose a cepstral subband normalization (CSN) approach for robust speech recognition. The CSN approach first applies the discrete wavelet transform (DWT) to decompose the original cepstral feature sequence into low and high frequency band (LFB and HFB) parts. Then, CSN normalizes the LFB components and zeros out the HFB components. Finally, an inverse DWT is applied on LFB and HFB components to form the normalized cepstral features. When using the Haar functions as the DWT bases, the calculation of CSN can be processed efficiently with a 50% reduction on the amount of feature components. In addition, our experimental results on the Aurora-2 task show that CSN outperforms the conventional cepstral mean subtraction (CMS), cepstral mean and variance normalization (CMVN), and histogram equalization (HEQ). We also integrate CSN with advanced frontend (AFE) for feature extraction. Experimental results indicate that the integrated AFE+CSN achieves notable improvements over the original AFE. The simple calculation, compact in form, and effective noise robustness properties enable CSN to perform suitably for mobile applications.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125167434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Speaker-ensemble hidden Markov modeling for automatic speech recognition 自动语音识别的扬声器集成隐马尔可夫建模
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423532
Guoli Ye, B. Mak
This paper proposes a new hidden Makov model (HMM) which we call speaker-ensemble HMM (SE-HMM). An SE-HMM is a multi-path HMM in which each path is an HMM constructed from the training data of a different speaker. SE-HMM may be considered a form of template-based acoustic model where speaker-specific acoustic templates are compressed statistically into speaker-specific HMMs. However, one has the flexibility of building SE-HMM at various level of compression: SE-HMM may be built for a triphone state, a triphone, a whole utterance, or other convenient phonetic units. As a result, SE-HMM contains more details than conventional HMM, but is much smaller than common template-based acoustic models. Furthermore, the construction of SE-HMM is simple, and since it is still an HMM, its construction and computation is well supported by common HMM toolkits such as HTK. The proposed SE-HMM was evaluated on Resource Management and Wall Street Journal tasks, and it consistently gives better word recognition results than conventional HMM.
本文提出了一种新的隐马可夫模型(HMM),我们称之为说话人集合HMM (SE-HMM)。SE-HMM是一种多路径HMM,其中每条路径都是由不同说话人的训练数据构建的HMM。SE-HMM可以被认为是一种基于模板的声学模型,其中特定扬声器的声学模板被统计压缩成特定扬声器的hmm。然而,我们可以在不同的压缩级别上灵活地构建SE-HMM: SE-HMM可以为三音状态、三音状态、整个话语或其他方便的语音单位构建。因此,SE-HMM比传统HMM包含更多的细节,但比普通的基于模板的声学模型小得多。此外,SE-HMM的构造简单,由于它仍然是HMM,因此它的构造和计算得到HTK等常用HMM工具包的很好支持。在《资源管理》和《华尔街日报》任务中对所提出的SE-HMM进行了评价,结果表明该方法的词识别效果优于传统HMM。
{"title":"Speaker-ensemble hidden Markov modeling for automatic speech recognition","authors":"Guoli Ye, B. Mak","doi":"10.1109/ISCSLP.2012.6423532","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423532","url":null,"abstract":"This paper proposes a new hidden Makov model (HMM) which we call speaker-ensemble HMM (SE-HMM). An SE-HMM is a multi-path HMM in which each path is an HMM constructed from the training data of a different speaker. SE-HMM may be considered a form of template-based acoustic model where speaker-specific acoustic templates are compressed statistically into speaker-specific HMMs. However, one has the flexibility of building SE-HMM at various level of compression: SE-HMM may be built for a triphone state, a triphone, a whole utterance, or other convenient phonetic units. As a result, SE-HMM contains more details than conventional HMM, but is much smaller than common template-based acoustic models. Furthermore, the construction of SE-HMM is simple, and since it is still an HMM, its construction and computation is well supported by common HMM toolkits such as HTK. The proposed SE-HMM was evaluated on Resource Management and Wall Street Journal tasks, and it consistently gives better word recognition results than conventional HMM.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126236564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A cross-dialect comparison of vowel dispersion and vowel variability 跨方言元音分散和元音变异性的比较
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423458
Wai-Sum Lee
The study is a cross-dialect comparison of the vowel systems of different inventories across five Chinese dialects in terms of vowel dispersion and vowel variability. The dialects include Meixian Kejia or Hakka with 5 vowels, Hong Kong Cantonese with 7 vowels, Fuzhou with 8 vowels, Ningbo with 10 vowels, and Wenling with 11 vowels. Formant frequencies were obtained through spectral analysis of speech data from 10 male and 10 female speakers of each dialect. The findings of this study do not support the vowel dispersion theory which predicts that (i) the larger the vowel inventory is, the more expanded vowel space will be in the F1F2 plane, and (ii) variability in vowel formants is inversely related to vowel inventory size.
本研究从元音分散和元音变异性的角度,跨方言比较了五种汉语方言中不同音源的元音系统。这些方言包括梅县客家5元音、香港粤语7元音、福州8元音、宁波10元音、温岭11元音。通过对每种方言的10名男性和10名女性说话者的语音数据进行频谱分析,获得了峰频率。本研究结果不支持元音分散理论,该理论预测(i)元音库存越大,F1F2平面上的元音空间扩展越大,(ii)元音共振峰的变异性与元音库存大小成反比。
{"title":"A cross-dialect comparison of vowel dispersion and vowel variability","authors":"Wai-Sum Lee","doi":"10.1109/ISCSLP.2012.6423458","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423458","url":null,"abstract":"The study is a cross-dialect comparison of the vowel systems of different inventories across five Chinese dialects in terms of vowel dispersion and vowel variability. The dialects include Meixian Kejia or Hakka with 5 vowels, Hong Kong Cantonese with 7 vowels, Fuzhou with 8 vowels, Ningbo with 10 vowels, and Wenling with 11 vowels. Formant frequencies were obtained through spectral analysis of speech data from 10 male and 10 female speakers of each dialect. The findings of this study do not support the vowel dispersion theory which predicts that (i) the larger the vowel inventory is, the more expanded vowel space will be in the F1F2 plane, and (ii) variability in vowel formants is inversely related to vowel inventory size.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130145297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Controlling the tradeoff property in a regularization framework for noise reduction 在降噪的正则化框架中控制权衡特性
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423500
Xugang Lu, M. Unoki, Shigeki Matsuda, Chiori Hori, H. Kashioka
The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function that gives a good tradeoff between the approximation accuracy and complexity of the function. By using a regularization method, the approximation function can be estimated from noisy observations. In this paper, we further provided a theoretical analysis of the tradeoff property of the framework in noise reduction. We applied the framework for speech enhancement experiments in real applications. Compared with several classical noise reduction methods, the proposed framework showed promising advantages.
在降噪算法设计中,降噪与语音失真之间的权衡是一个关键问题。我们提出了一个考虑权衡问题的降噪正则化框架。我们把语音估计看作是再现核希尔伯特空间(RKHS)中的一个泛函逼近问题。在估计中,制定目标函数以找到一个在函数的近似精度和复杂度之间取得良好平衡的近似函数。通过正则化方法,可以从噪声观测中估计出近似函数。本文进一步从理论上分析了该框架在降噪中的权衡特性。我们将该框架应用于实际应用中的语音增强实验。通过与几种经典降噪方法的比较,表明了该框架的优越性。
{"title":"Controlling the tradeoff property in a regularization framework for noise reduction","authors":"Xugang Lu, M. Unoki, Shigeki Matsuda, Chiori Hori, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423500","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423500","url":null,"abstract":"The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function that gives a good tradeoff between the approximation accuracy and complexity of the function. By using a regularization method, the approximation function can be estimated from noisy observations. In this paper, we further provided a theoretical analysis of the tradeoff property of the framework in noise reduction. We applied the framework for speech enhancement experiments in real applications. Compared with several classical noise reduction methods, the proposed framework showed promising advantages.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122643745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acoustic space partition based on broad phonetic class for ensemble acoustic modeling 基于广义语音类的声学空间划分及其声学建模
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423501
Xugang Lu, Yu Tsao, Shigeki Matsuda, Chiori Hori, H. Kashioka
Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned with acoustic variability caused by gender and accent for Chinese large vocabulary continuous speech recognition (LVCSR). Considering that gender and accent information may be encoded in local acoustic realizations of a few specific phonetic classes rather than in a global acoustic distribution, we proposed a acoustic space partition method based on broad phonetic class (BPC) modeling of speaker for ensemble acoustic modeling. With the principal component analysis (PCA) of the BPC based speaker representation, we designed two level hierarchical data partitions in the low dimensional speaker factor space that concerned with gender and accent information. Ensemble acoustic models were trained on the partitioned data sets on both levels. Speech recognition results showed that using acoustic models trained based on the first level and second level partitions got 9.73% and 32.29% relative improvements in character error reduction rate, respectively.
集成声学建模可以对引起声空间变异性的不同因素进行建模,并提供不同的组合,以提高自动语音识别的性能。其中一个主要问题是如何将训练数据集划分为几个子集,基于这些子集训练集成模型。在本研究中,我们重点研究了中文大词汇量连续语音识别(LVCSR)中由性别和口音引起的声学变异的合奏声学模型。考虑到性别和重音信息可能编码在少数特定语音类的局部声学实现中,而不是在全局声学分布中,我们提出了一种基于说话人广义语音类(BPC)建模的声学空间划分方法,用于整体声学建模。利用基于BPC的说话人表示主成分分析(PCA),在低维说话人因子空间中设计了涉及性别和口音信息的两级分层数据分区。集合声学模型在两个级别的分割数据集上进行训练。语音识别结果表明,使用基于第一级和第二级分区训练的声学模型,字符错误率分别相对提高了9.73%和32.29%。
{"title":"Acoustic space partition based on broad phonetic class for ensemble acoustic modeling","authors":"Xugang Lu, Yu Tsao, Shigeki Matsuda, Chiori Hori, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423501","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423501","url":null,"abstract":"Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned with acoustic variability caused by gender and accent for Chinese large vocabulary continuous speech recognition (LVCSR). Considering that gender and accent information may be encoded in local acoustic realizations of a few specific phonetic classes rather than in a global acoustic distribution, we proposed a acoustic space partition method based on broad phonetic class (BPC) modeling of speaker for ensemble acoustic modeling. With the principal component analysis (PCA) of the BPC based speaker representation, we designed two level hierarchical data partitions in the low dimensional speaker factor space that concerned with gender and accent information. Ensemble acoustic models were trained on the partitioned data sets on both levels. Speech recognition results showed that using acoustic models trained based on the first level and second level partitions got 9.73% and 32.29% relative improvements in character error reduction rate, respectively.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"2008 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116904020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tones in whispered Mandarin 普通话耳语的语调
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423539
Bin Li, R. Rong
This paper examines and compares the characteristics of tones in a CV syllable in Mandarin under phonated and whispered speech. Formants of the vowel in various contexts are also compared across the tone environments in different phonation types, in order to assess whether and how tone environments and vowel production interacts, as the paper is interested as well in whether lack of fundamental frequency in whisper is compensated by other phonetic means in a tonal language. Results suggest that temporal correlates are maintained to a certain extent, and that the vowel space is shifted significantly towards higher frequency range.
本文考察并比较了普通话中一个CV音节在语音和耳语下的声调特征。我们还比较了不同发声类型下不同声调环境下的元音共振峰,以评估声调环境和元音产生是否相互作用以及如何相互作用,因为本文也感兴趣的是,在声调语言中,耳语中缺乏基本频率是否可以通过其他语音手段得到补偿。结果表明,时间相关性在一定程度上保持不变,元音空间向更高的频率范围明显偏移。
{"title":"Tones in whispered Mandarin","authors":"Bin Li, R. Rong","doi":"10.1109/ISCSLP.2012.6423539","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423539","url":null,"abstract":"This paper examines and compares the characteristics of tones in a CV syllable in Mandarin under phonated and whispered speech. Formants of the vowel in various contexts are also compared across the tone environments in different phonation types, in order to assess whether and how tone environments and vowel production interacts, as the paper is interested as well in whether lack of fundamental frequency in whisper is compensated by other phonetic means in a tonal language. Results suggest that temporal correlates are maintained to a certain extent, and that the vowel space is shifted significantly towards higher frequency range.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation 统一轨迹平铺方法实现高质量TTS和跨语言语音转换
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423506
Yao Qian, F. Soong
In human-machine speech communication, it is technically challenging to make the machine talk as naturally as human so as to facilitate “frictionless” interactions, or make a human user to feel the communication is as natural as human-human. We propose a trajectory tiling approach to high quality speech synthesis, where the speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform segment “tiles” stored in a pre-recorded speech database. We test our approach in both TTS and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality speech is also confirmed in objective and subjective tests.
在人机语音交流中,如何让机器像人一样自然地说话,从而促进“无摩擦”的交互,或者让人类用户感觉交流像人与人一样自然,在技术上是一个挑战。我们提出了一种用于高质量语音合成的轨迹平铺方法,其中从自然、处理或合成语音中提取的语音参数轨迹用于指导搜索存储在预录制语音数据库中的最佳波形段“平铺”序列。我们在TTS和跨语言语音转换应用程序中测试了我们的方法。实验结果表明,所提出的轨迹平铺方法能够呈现出既自然又高可理解的语音。感知到的高质量语音也在客观和主观测试中得到证实。
{"title":"A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation","authors":"Yao Qian, F. Soong","doi":"10.1109/ISCSLP.2012.6423506","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423506","url":null,"abstract":"In human-machine speech communication, it is technically challenging to make the machine talk as naturally as human so as to facilitate “frictionless” interactions, or make a human user to feel the communication is as natural as human-human. We propose a trajectory tiling approach to high quality speech synthesis, where the speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform segment “tiles” stored in a pre-recorded speech database. We test our approach in both TTS and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality speech is also confirmed in objective and subjective tests.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114422562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Alternative hypothesis generation using a weighted kernel feature matrix for ASR substitution error correction 基于加权核特征矩阵的备选假设生成用于ASR替换误差校正
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423475
Chao-Hong Liu, Chung-Hsien Wu, David Sarwono
Although automatic speech recognition (ASR) has been successfully used in several applications, it is still non-robust and imprecise especially in a harsh environment wherein the input speech is of low quality. Robust error correction for ASR outputs thus becomes important in addition to improving recognition performance. In recent approaches to error correction, linguistic or domain information is used to generate the alternative hypotheses for the ASR outputs followed by the selection of the most likely alternative. In this study, the distances between ASR outputs and the potentially correct alternatives are estimated based on a weighted context-dependent syllable cluster-based kernel feature matrix followed by multidimensional scaling (MDS)-based distance rescaling. These distances are then used to construct an alternative syllable lattice and the dynamic programming is used to obtain the most likely correct output with respect to the original ASR results. Experiments show that the proposed method achieved about 1.95% improvement on the word error rate compared to the correction pair approach using the MATBN Mandarin Chinese broadcast news corpus.
尽管自动语音识别(ASR)已经成功地应用于一些应用中,但它仍然是非鲁棒性和不精确的,特别是在输入语音质量低的恶劣环境中。因此,除了提高识别性能外,ASR输出的鲁棒纠错也变得非常重要。在最近的纠错方法中,语言或领域信息用于为ASR输出生成替代假设,然后选择最可能的替代假设。在本研究中,基于加权上下文相关音节聚类的核特征矩阵,然后基于多维尺度(MDS)的距离重新缩放,估计ASR输出和潜在正确替代之间的距离。然后使用这些距离来构建替代音节格,并使用动态规划来获得相对于原始ASR结果的最可能的正确输出。实验表明,与使用MATBN普通话广播新闻语料库的纠错对方法相比,该方法的错误率提高了约1.95%。
{"title":"Alternative hypothesis generation using a weighted kernel feature matrix for ASR substitution error correction","authors":"Chao-Hong Liu, Chung-Hsien Wu, David Sarwono","doi":"10.1109/ISCSLP.2012.6423475","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423475","url":null,"abstract":"Although automatic speech recognition (ASR) has been successfully used in several applications, it is still non-robust and imprecise especially in a harsh environment wherein the input speech is of low quality. Robust error correction for ASR outputs thus becomes important in addition to improving recognition performance. In recent approaches to error correction, linguistic or domain information is used to generate the alternative hypotheses for the ASR outputs followed by the selection of the most likely alternative. In this study, the distances between ASR outputs and the potentially correct alternatives are estimated based on a weighted context-dependent syllable cluster-based kernel feature matrix followed by multidimensional scaling (MDS)-based distance rescaling. These distances are then used to construct an alternative syllable lattice and the dynamic programming is used to obtain the most likely correct output with respect to the original ASR results. Experiments show that the proposed method achieved about 1.95% improvement on the word error rate compared to the correction pair approach using the MATBN Mandarin Chinese broadcast news corpus.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122580341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A study of F0 modelling and generation with lyrics and shape characterization for singing voice synthesis 基于歌词和形状特征的F0建模和生成方法的研究
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423491
Siu Wa Lee, M. Dong, Haizhou Li
Natural pitch fluctuation is essential to singing voice. Recently, we have proposed a generalized F0 modelling method which models the expected F0 fluctuation under various contexts with note HMMs. Knowing that having F0 contours close to human professional singing promotes perceived quality, we are confronted with two requirements: (1) accurate estimation on F0 and (2) precise voiced/unvoiced decisions. In this paper, we introduce two techniques in the above directions. Influence of lyrics phonetics on singing F0 is considered to capture the F0 and voicing behaviour brought from different note-lyrics combinations. The generalized F0 modelling method is further extended to frequency-domain to study if shape characterization in terms of sinusoids helps F0 estimation or not. Our experiments showed that the use of lyrics information leads to better F0 generation and improves naturalness of synthesized singing. While the frequency-domain representation is viable, its performance is less competitive than time-domain representation, which requires further study.
自然的音高波动对歌唱的声音至关重要。最近,我们提出了一种广义的F0建模方法,该方法用注hmm对各种情况下的期望F0波动进行建模。知道让F0轮廓接近人类专业歌唱可以提高感知质量,我们面临两个要求:(1)对F0的准确估计和(2)精确的浊音/非浊音决策。在本文中,我们将介绍上述两个方向的两种技术。考虑歌词语音学对演唱F0的影响,捕捉不同音符-歌词组合带来的F0和发声行为。将广义F0建模方法进一步扩展到频域,研究正弦曲线的形状表征是否有助于F0的估计。我们的实验表明,歌词信息的使用可以更好地生成F0,提高合成歌唱的自然度。虽然频域表示是可行的,但其性能不如时域表示具有竞争力,这需要进一步研究。
{"title":"A study of F0 modelling and generation with lyrics and shape characterization for singing voice synthesis","authors":"Siu Wa Lee, M. Dong, Haizhou Li","doi":"10.1109/ISCSLP.2012.6423491","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423491","url":null,"abstract":"Natural pitch fluctuation is essential to singing voice. Recently, we have proposed a generalized F0 modelling method which models the expected F0 fluctuation under various contexts with note HMMs. Knowing that having F0 contours close to human professional singing promotes perceived quality, we are confronted with two requirements: (1) accurate estimation on F0 and (2) precise voiced/unvoiced decisions. In this paper, we introduce two techniques in the above directions. Influence of lyrics phonetics on singing F0 is considered to capture the F0 and voicing behaviour brought from different note-lyrics combinations. The generalized F0 modelling method is further extended to frequency-domain to study if shape characterization in terms of sinusoids helps F0 estimation or not. Our experiments showed that the use of lyrics information leads to better F0 generation and improves naturalness of synthesized singing. While the frequency-domain representation is viable, its performance is less competitive than time-domain representation, which requires further study.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115873086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2012 8th International Symposium on Chinese Spoken Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1