Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423456
Zhengqi Wen, J. Tao, Hao Che
The speech generated from hidden Markov model (HMM)-based speech synthesis systems (HTS) is suffered from over-smoothing problem which is due to statistical modeling. This paper will focus on post-filtering technique based on statistical modification for the generated speech parameters. The marginal statistics of parameters' trajectory, such as mean, variance, skewness and kurtosis are adjusted according to the values generated from the HTS system. This technique is compared with global variance (GV)-based speech generation algorithm. The listening test showed that the post-filtering technique considering the mean and variance could generate almost equal result with GV model. When further considering the modification of skewness and kurtosis, the quality of generated speech has been improved.
{"title":"Statistical modification based post-filtering technique for HMM-based speech synthesis","authors":"Zhengqi Wen, J. Tao, Hao Che","doi":"10.1109/ISCSLP.2012.6423456","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423456","url":null,"abstract":"The speech generated from hidden Markov model (HMM)-based speech synthesis systems (HTS) is suffered from over-smoothing problem which is due to statistical modeling. This paper will focus on post-filtering technique based on statistical modification for the generated speech parameters. The marginal statistics of parameters' trajectory, such as mean, variance, skewness and kurtosis are adjusted according to the values generated from the HTS system. This technique is compared with global variance (GV)-based speech generation algorithm. The listening test showed that the post-filtering technique considering the mean and variance could generate almost equal result with GV model. When further considering the modification of skewness and kurtosis, the quality of generated speech has been improved.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123733291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423484
Syu-Siang Wang, J. Hung, Yu Tsao
In this paper, we propose a cepstral subband normalization (CSN) approach for robust speech recognition. The CSN approach first applies the discrete wavelet transform (DWT) to decompose the original cepstral feature sequence into low and high frequency band (LFB and HFB) parts. Then, CSN normalizes the LFB components and zeros out the HFB components. Finally, an inverse DWT is applied on LFB and HFB components to form the normalized cepstral features. When using the Haar functions as the DWT bases, the calculation of CSN can be processed efficiently with a 50% reduction on the amount of feature components. In addition, our experimental results on the Aurora-2 task show that CSN outperforms the conventional cepstral mean subtraction (CMS), cepstral mean and variance normalization (CMVN), and histogram equalization (HEQ). We also integrate CSN with advanced frontend (AFE) for feature extraction. Experimental results indicate that the integrated AFE+CSN achieves notable improvements over the original AFE. The simple calculation, compact in form, and effective noise robustness properties enable CSN to perform suitably for mobile applications.
{"title":"A study on cepstral sub-band normalization for robust ASR","authors":"Syu-Siang Wang, J. Hung, Yu Tsao","doi":"10.1109/ISCSLP.2012.6423484","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423484","url":null,"abstract":"In this paper, we propose a cepstral subband normalization (CSN) approach for robust speech recognition. The CSN approach first applies the discrete wavelet transform (DWT) to decompose the original cepstral feature sequence into low and high frequency band (LFB and HFB) parts. Then, CSN normalizes the LFB components and zeros out the HFB components. Finally, an inverse DWT is applied on LFB and HFB components to form the normalized cepstral features. When using the Haar functions as the DWT bases, the calculation of CSN can be processed efficiently with a 50% reduction on the amount of feature components. In addition, our experimental results on the Aurora-2 task show that CSN outperforms the conventional cepstral mean subtraction (CMS), cepstral mean and variance normalization (CMVN), and histogram equalization (HEQ). We also integrate CSN with advanced frontend (AFE) for feature extraction. Experimental results indicate that the integrated AFE+CSN achieves notable improvements over the original AFE. The simple calculation, compact in form, and effective noise robustness properties enable CSN to perform suitably for mobile applications.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125167434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423532
Guoli Ye, B. Mak
This paper proposes a new hidden Makov model (HMM) which we call speaker-ensemble HMM (SE-HMM). An SE-HMM is a multi-path HMM in which each path is an HMM constructed from the training data of a different speaker. SE-HMM may be considered a form of template-based acoustic model where speaker-specific acoustic templates are compressed statistically into speaker-specific HMMs. However, one has the flexibility of building SE-HMM at various level of compression: SE-HMM may be built for a triphone state, a triphone, a whole utterance, or other convenient phonetic units. As a result, SE-HMM contains more details than conventional HMM, but is much smaller than common template-based acoustic models. Furthermore, the construction of SE-HMM is simple, and since it is still an HMM, its construction and computation is well supported by common HMM toolkits such as HTK. The proposed SE-HMM was evaluated on Resource Management and Wall Street Journal tasks, and it consistently gives better word recognition results than conventional HMM.
{"title":"Speaker-ensemble hidden Markov modeling for automatic speech recognition","authors":"Guoli Ye, B. Mak","doi":"10.1109/ISCSLP.2012.6423532","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423532","url":null,"abstract":"This paper proposes a new hidden Makov model (HMM) which we call speaker-ensemble HMM (SE-HMM). An SE-HMM is a multi-path HMM in which each path is an HMM constructed from the training data of a different speaker. SE-HMM may be considered a form of template-based acoustic model where speaker-specific acoustic templates are compressed statistically into speaker-specific HMMs. However, one has the flexibility of building SE-HMM at various level of compression: SE-HMM may be built for a triphone state, a triphone, a whole utterance, or other convenient phonetic units. As a result, SE-HMM contains more details than conventional HMM, but is much smaller than common template-based acoustic models. Furthermore, the construction of SE-HMM is simple, and since it is still an HMM, its construction and computation is well supported by common HMM toolkits such as HTK. The proposed SE-HMM was evaluated on Resource Management and Wall Street Journal tasks, and it consistently gives better word recognition results than conventional HMM.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126236564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423458
Wai-Sum Lee
The study is a cross-dialect comparison of the vowel systems of different inventories across five Chinese dialects in terms of vowel dispersion and vowel variability. The dialects include Meixian Kejia or Hakka with 5 vowels, Hong Kong Cantonese with 7 vowels, Fuzhou with 8 vowels, Ningbo with 10 vowels, and Wenling with 11 vowels. Formant frequencies were obtained through spectral analysis of speech data from 10 male and 10 female speakers of each dialect. The findings of this study do not support the vowel dispersion theory which predicts that (i) the larger the vowel inventory is, the more expanded vowel space will be in the F1F2 plane, and (ii) variability in vowel formants is inversely related to vowel inventory size.
{"title":"A cross-dialect comparison of vowel dispersion and vowel variability","authors":"Wai-Sum Lee","doi":"10.1109/ISCSLP.2012.6423458","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423458","url":null,"abstract":"The study is a cross-dialect comparison of the vowel systems of different inventories across five Chinese dialects in terms of vowel dispersion and vowel variability. The dialects include Meixian Kejia or Hakka with 5 vowels, Hong Kong Cantonese with 7 vowels, Fuzhou with 8 vowels, Ningbo with 10 vowels, and Wenling with 11 vowels. Formant frequencies were obtained through spectral analysis of speech data from 10 male and 10 female speakers of each dialect. The findings of this study do not support the vowel dispersion theory which predicts that (i) the larger the vowel inventory is, the more expanded vowel space will be in the F1F2 plane, and (ii) variability in vowel formants is inversely related to vowel inventory size.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130145297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423500
Xugang Lu, M. Unoki, Shigeki Matsuda, Chiori Hori, H. Kashioka
The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function that gives a good tradeoff between the approximation accuracy and complexity of the function. By using a regularization method, the approximation function can be estimated from noisy observations. In this paper, we further provided a theoretical analysis of the tradeoff property of the framework in noise reduction. We applied the framework for speech enhancement experiments in real applications. Compared with several classical noise reduction methods, the proposed framework showed promising advantages.
{"title":"Controlling the tradeoff property in a regularization framework for noise reduction","authors":"Xugang Lu, M. Unoki, Shigeki Matsuda, Chiori Hori, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423500","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423500","url":null,"abstract":"The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function that gives a good tradeoff between the approximation accuracy and complexity of the function. By using a regularization method, the approximation function can be estimated from noisy observations. In this paper, we further provided a theoretical analysis of the tradeoff property of the framework in noise reduction. We applied the framework for speech enhancement experiments in real applications. Compared with several classical noise reduction methods, the proposed framework showed promising advantages.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122643745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423501
Xugang Lu, Yu Tsao, Shigeki Matsuda, Chiori Hori, H. Kashioka
Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned with acoustic variability caused by gender and accent for Chinese large vocabulary continuous speech recognition (LVCSR). Considering that gender and accent information may be encoded in local acoustic realizations of a few specific phonetic classes rather than in a global acoustic distribution, we proposed a acoustic space partition method based on broad phonetic class (BPC) modeling of speaker for ensemble acoustic modeling. With the principal component analysis (PCA) of the BPC based speaker representation, we designed two level hierarchical data partitions in the low dimensional speaker factor space that concerned with gender and accent information. Ensemble acoustic models were trained on the partitioned data sets on both levels. Speech recognition results showed that using acoustic models trained based on the first level and second level partitions got 9.73% and 32.29% relative improvements in character error reduction rate, respectively.
{"title":"Acoustic space partition based on broad phonetic class for ensemble acoustic modeling","authors":"Xugang Lu, Yu Tsao, Shigeki Matsuda, Chiori Hori, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423501","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423501","url":null,"abstract":"Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned with acoustic variability caused by gender and accent for Chinese large vocabulary continuous speech recognition (LVCSR). Considering that gender and accent information may be encoded in local acoustic realizations of a few specific phonetic classes rather than in a global acoustic distribution, we proposed a acoustic space partition method based on broad phonetic class (BPC) modeling of speaker for ensemble acoustic modeling. With the principal component analysis (PCA) of the BPC based speaker representation, we designed two level hierarchical data partitions in the low dimensional speaker factor space that concerned with gender and accent information. Ensemble acoustic models were trained on the partitioned data sets on both levels. Speech recognition results showed that using acoustic models trained based on the first level and second level partitions got 9.73% and 32.29% relative improvements in character error reduction rate, respectively.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"2008 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116904020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423539
Bin Li, R. Rong
This paper examines and compares the characteristics of tones in a CV syllable in Mandarin under phonated and whispered speech. Formants of the vowel in various contexts are also compared across the tone environments in different phonation types, in order to assess whether and how tone environments and vowel production interacts, as the paper is interested as well in whether lack of fundamental frequency in whisper is compensated by other phonetic means in a tonal language. Results suggest that temporal correlates are maintained to a certain extent, and that the vowel space is shifted significantly towards higher frequency range.
{"title":"Tones in whispered Mandarin","authors":"Bin Li, R. Rong","doi":"10.1109/ISCSLP.2012.6423539","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423539","url":null,"abstract":"This paper examines and compares the characteristics of tones in a CV syllable in Mandarin under phonated and whispered speech. Formants of the vowel in various contexts are also compared across the tone environments in different phonation types, in order to assess whether and how tone environments and vowel production interacts, as the paper is interested as well in whether lack of fundamental frequency in whisper is compensated by other phonetic means in a tonal language. Results suggest that temporal correlates are maintained to a certain extent, and that the vowel space is shifted significantly towards higher frequency range.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423506
Yao Qian, F. Soong
In human-machine speech communication, it is technically challenging to make the machine talk as naturally as human so as to facilitate “frictionless” interactions, or make a human user to feel the communication is as natural as human-human. We propose a trajectory tiling approach to high quality speech synthesis, where the speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform segment “tiles” stored in a pre-recorded speech database. We test our approach in both TTS and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality speech is also confirmed in objective and subjective tests.
{"title":"A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation","authors":"Yao Qian, F. Soong","doi":"10.1109/ISCSLP.2012.6423506","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423506","url":null,"abstract":"In human-machine speech communication, it is technically challenging to make the machine talk as naturally as human so as to facilitate “frictionless” interactions, or make a human user to feel the communication is as natural as human-human. We propose a trajectory tiling approach to high quality speech synthesis, where the speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform segment “tiles” stored in a pre-recorded speech database. We test our approach in both TTS and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality speech is also confirmed in objective and subjective tests.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114422562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423475
Chao-Hong Liu, Chung-Hsien Wu, David Sarwono
Although automatic speech recognition (ASR) has been successfully used in several applications, it is still non-robust and imprecise especially in a harsh environment wherein the input speech is of low quality. Robust error correction for ASR outputs thus becomes important in addition to improving recognition performance. In recent approaches to error correction, linguistic or domain information is used to generate the alternative hypotheses for the ASR outputs followed by the selection of the most likely alternative. In this study, the distances between ASR outputs and the potentially correct alternatives are estimated based on a weighted context-dependent syllable cluster-based kernel feature matrix followed by multidimensional scaling (MDS)-based distance rescaling. These distances are then used to construct an alternative syllable lattice and the dynamic programming is used to obtain the most likely correct output with respect to the original ASR results. Experiments show that the proposed method achieved about 1.95% improvement on the word error rate compared to the correction pair approach using the MATBN Mandarin Chinese broadcast news corpus.
{"title":"Alternative hypothesis generation using a weighted kernel feature matrix for ASR substitution error correction","authors":"Chao-Hong Liu, Chung-Hsien Wu, David Sarwono","doi":"10.1109/ISCSLP.2012.6423475","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423475","url":null,"abstract":"Although automatic speech recognition (ASR) has been successfully used in several applications, it is still non-robust and imprecise especially in a harsh environment wherein the input speech is of low quality. Robust error correction for ASR outputs thus becomes important in addition to improving recognition performance. In recent approaches to error correction, linguistic or domain information is used to generate the alternative hypotheses for the ASR outputs followed by the selection of the most likely alternative. In this study, the distances between ASR outputs and the potentially correct alternatives are estimated based on a weighted context-dependent syllable cluster-based kernel feature matrix followed by multidimensional scaling (MDS)-based distance rescaling. These distances are then used to construct an alternative syllable lattice and the dynamic programming is used to obtain the most likely correct output with respect to the original ASR results. Experiments show that the proposed method achieved about 1.95% improvement on the word error rate compared to the correction pair approach using the MATBN Mandarin Chinese broadcast news corpus.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122580341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423491
Siu Wa Lee, M. Dong, Haizhou Li
Natural pitch fluctuation is essential to singing voice. Recently, we have proposed a generalized F0 modelling method which models the expected F0 fluctuation under various contexts with note HMMs. Knowing that having F0 contours close to human professional singing promotes perceived quality, we are confronted with two requirements: (1) accurate estimation on F0 and (2) precise voiced/unvoiced decisions. In this paper, we introduce two techniques in the above directions. Influence of lyrics phonetics on singing F0 is considered to capture the F0 and voicing behaviour brought from different note-lyrics combinations. The generalized F0 modelling method is further extended to frequency-domain to study if shape characterization in terms of sinusoids helps F0 estimation or not. Our experiments showed that the use of lyrics information leads to better F0 generation and improves naturalness of synthesized singing. While the frequency-domain representation is viable, its performance is less competitive than time-domain representation, which requires further study.
{"title":"A study of F0 modelling and generation with lyrics and shape characterization for singing voice synthesis","authors":"Siu Wa Lee, M. Dong, Haizhou Li","doi":"10.1109/ISCSLP.2012.6423491","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423491","url":null,"abstract":"Natural pitch fluctuation is essential to singing voice. Recently, we have proposed a generalized F0 modelling method which models the expected F0 fluctuation under various contexts with note HMMs. Knowing that having F0 contours close to human professional singing promotes perceived quality, we are confronted with two requirements: (1) accurate estimation on F0 and (2) precise voiced/unvoiced decisions. In this paper, we introduce two techniques in the above directions. Influence of lyrics phonetics on singing F0 is considered to capture the F0 and voicing behaviour brought from different note-lyrics combinations. The generalized F0 modelling method is further extended to frequency-domain to study if shape characterization in terms of sinusoids helps F0 estimation or not. Our experiments showed that the use of lyrics information leads to better F0 generation and improves naturalness of synthesized singing. While the frequency-domain representation is viable, its performance is less competitive than time-domain representation, which requires further study.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115873086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}