Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423512
Jian Zhang, Risheng Xia, Zhonghua Fu, Junfeng Li, Yonghong Yan
As an indispensable instrument in today's daily life, mobile phone that is used in diverse environments suffers from the speech quality degradation due to the presence of background noises. In this paper, we propose a novel two-microphone noise reduction system based on the power level ratio (PLR) of the observed signals. In the system, a primary microphone is placed close to the talker's mouth and an auxiliary microphone is placed away. The proposed noise reduction algorithm first calculates the ratio of the power of observed signals at the two microphones, and subsequently calculates the spectral gain function based on the power level ratio using the sigmoid function. Experimental results demonstrate that this proposed algorithm yields the much higher speech quality than the state-of-the-art noise-reduction algorithms, and more importantly involves much less computational cost which makes it feasible for mobile phone.
{"title":"A fast two-microphone noise reduction algorithm based on power level ratio for mobile phone","authors":"Jian Zhang, Risheng Xia, Zhonghua Fu, Junfeng Li, Yonghong Yan","doi":"10.1109/ISCSLP.2012.6423512","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423512","url":null,"abstract":"As an indispensable instrument in today's daily life, mobile phone that is used in diverse environments suffers from the speech quality degradation due to the presence of background noises. In this paper, we propose a novel two-microphone noise reduction system based on the power level ratio (PLR) of the observed signals. In the system, a primary microphone is placed close to the talker's mouth and an auxiliary microphone is placed away. The proposed noise reduction algorithm first calculates the ratio of the power of observed signals at the two microphones, and subsequently calculates the spectral gain function based on the power level ratio using the sigmoid function. Experimental results demonstrate that this proposed algorithm yields the much higher speech quality than the state-of-the-art noise-reduction algorithms, and more importantly involves much less computational cost which makes it feasible for mobile phone.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"39 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127439995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423493
Chun Xing Li, Zhiyong Wu, Fanbo Meng, H. Meng, Lianhong Cai
This paper addresses the problem of automatic detection of contrastive word pairs and their acoustic realization in emphasis for expressive text-to-speech (TTS) synthesis in English. Support vector machines (SVMs) have been used to automatically detect contrastive word pairs from lexical features, syntactic dependencies and semantic relations. A much better performance is achieved by adding accent ratio and word identity features. Hidden Markov model (HMM) based speech synthesis is then used to generate emphatic speeches by putting emphasis on the detected contrastive word pairs. Subjective experiments show that most of the listeners consider putting emphasis on contrastive word pairs is more acceptable than on non-contrastive word pairs. This indicates the importance of the accurate detection of contrastive word pairs.
{"title":"Detection and emphatic realization of contrastive word pairs for expressive text-to-speech synthesis","authors":"Chun Xing Li, Zhiyong Wu, Fanbo Meng, H. Meng, Lianhong Cai","doi":"10.1109/ISCSLP.2012.6423493","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423493","url":null,"abstract":"This paper addresses the problem of automatic detection of contrastive word pairs and their acoustic realization in emphasis for expressive text-to-speech (TTS) synthesis in English. Support vector machines (SVMs) have been used to automatically detect contrastive word pairs from lexical features, syntactic dependencies and semantic relations. A much better performance is achieved by adding accent ratio and word identity features. Hidden Markov model (HMM) based speech synthesis is then used to generate emphatic speeches by putting emphasis on the detected contrastive word pairs. Subjective experiments show that most of the listeners consider putting emphasis on contrastive word pairs is more acceptable than on non-contrastive word pairs. This indicates the importance of the accurate detection of contrastive word pairs.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128093464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423527
Wei Rao, M. Mak
This paper investigates the small sample-size problem in i-vector based speaker verification systems. The idea of i-vectors is to represent the characteristics of speakers in the factors of a factor analyzer. Because the factor loading matrix defines the possible speaker and channel-variability of i-vectors, it is important to suppress the unwanted channel variability. Linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and probabilistic LDA are commonly used for such purpose. These methods, however, require training data comprising many speakers each providing sufficient recording sessions for good performance. Performance will suffer when the number of speakers and/or number of sessions per speaker are too small. This paper compares four approaches to addressing this small sample-size problem: (1) preprocessing the i-vectors by PCA before applying LDA (PCA+LDA), (2) replacing the matrix inverse in LDA by pseudo-inverse, (3) applying multi-way LDA by exploiting the microphone and speaker labels of the training data, and (4) increasing the matrix rank in LDA by generating more i-vectors using utterance partitioning. Results based on NIST 2010 SRE suggests that utterance partitioning performs the best, followed by multi-way LDA and PCA+LDA.
{"title":"Alleviating the small sample-size problem in i-vector based speaker verification","authors":"Wei Rao, M. Mak","doi":"10.1109/ISCSLP.2012.6423527","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423527","url":null,"abstract":"This paper investigates the small sample-size problem in i-vector based speaker verification systems. The idea of i-vectors is to represent the characteristics of speakers in the factors of a factor analyzer. Because the factor loading matrix defines the possible speaker and channel-variability of i-vectors, it is important to suppress the unwanted channel variability. Linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and probabilistic LDA are commonly used for such purpose. These methods, however, require training data comprising many speakers each providing sufficient recording sessions for good performance. Performance will suffer when the number of speakers and/or number of sessions per speaker are too small. This paper compares four approaches to addressing this small sample-size problem: (1) preprocessing the i-vectors by PCA before applying LDA (PCA+LDA), (2) replacing the matrix inverse in LDA by pseudo-inverse, (3) applying multi-way LDA by exploiting the microphone and speaker labels of the training data, and (4) increasing the matrix rank in LDA by generating more i-vectors using utterance partitioning. Results based on NIST 2010 SRE suggests that utterance partitioning performs the best, followed by multi-way LDA and PCA+LDA.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423524
Xian-Jun Xia, Zhenhua Ling, Chen-Yu Yang, Lirong Dai
This paper presents an improved unit selection and waveform concatenation speech synthesis method by gathering and utilizing human feedbacks on synthetic speech. Firstly, a set of texts are synthesized by the baseline unit selection synthesis system. Each prosodic word within the synthetic speech is then evaluated as a natural one or an unnatural one by listeners. In our proposed method, these natural synthetic segments are treated as virtual candidate units to extend the original speech corpus for unit selection. A new speech synthesis system is constructed using this extended speech corpus. A synthetic error detector based on SVM classifier is also built using the natural and unnatural synthetic speech. At synthesis time, the input text is synthesized using the baseline system and the extended system simultaneously. The two unit selection results are evaluated by the trained synthetic error detector to determine the optimal one. Experimental results prove the effectiveness of our proposed method in improving the naturalness of synthetic speech on a task of synthesizing place names.
{"title":"Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech","authors":"Xian-Jun Xia, Zhenhua Ling, Chen-Yu Yang, Lirong Dai","doi":"10.1109/ISCSLP.2012.6423524","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423524","url":null,"abstract":"This paper presents an improved unit selection and waveform concatenation speech synthesis method by gathering and utilizing human feedbacks on synthetic speech. Firstly, a set of texts are synthesized by the baseline unit selection synthesis system. Each prosodic word within the synthetic speech is then evaluated as a natural one or an unnatural one by listeners. In our proposed method, these natural synthetic segments are treated as virtual candidate units to extend the original speech corpus for unit selection. A new speech synthesis system is constructed using this extended speech corpus. A synthetic error detector based on SVM classifier is also built using the natural and unnatural synthetic speech. At synthesis time, the input text is synthesized using the baseline system and the extended system simultaneously. The two unit selection results are evaluated by the trained synthetic error detector to determine the optimal one. Experimental results prove the effectiveness of our proposed method in improving the naturalness of synthetic speech on a task of synthesizing place names.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130753248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423540
T. Zou, Jinsong Zhang, Wen Cao
This paper investigated Mandarin Tone 2-Tone 3 perceptual space in isolated syllables and disyllables of native speakers and Japanese learners. In two experiments, we examined the listeners' use of pitch height and the position of turning point as cues of tone identity. The result showed that, in isolated syllables, Chinese perceived these two tones in categorical fashion. Pitch height was more important than the turning point as a cue. Within a certain range of pitch height, there was a complementary relationship between these two variables. The perceptual result of Japanese subjects did not show apparent categorical pattern. In disyllables, for Chinese subjects, the contextual influence on the boundary position in Tone 2-half Tone 3 continuum was not significant, but the boundary position in pitch height and turning point Tone 2-Tone 3 continuum shifted significantly in different tonal context. Comparing to Chinese subjects, Japanese subjects' perceptual ranges of Tone 3 in isolated syllables and disyllables were narrower, and it's more difficult for them to identify these two tones in disyllables.
{"title":"A comparative study of perception of tone 2 and tone 3 in Mandarin by native speakers and Japanese learners","authors":"T. Zou, Jinsong Zhang, Wen Cao","doi":"10.1109/ISCSLP.2012.6423540","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423540","url":null,"abstract":"This paper investigated Mandarin Tone 2-Tone 3 perceptual space in isolated syllables and disyllables of native speakers and Japanese learners. In two experiments, we examined the listeners' use of pitch height and the position of turning point as cues of tone identity. The result showed that, in isolated syllables, Chinese perceived these two tones in categorical fashion. Pitch height was more important than the turning point as a cue. Within a certain range of pitch height, there was a complementary relationship between these two variables. The perceptual result of Japanese subjects did not show apparent categorical pattern. In disyllables, for Chinese subjects, the contextual influence on the boundary position in Tone 2-half Tone 3 continuum was not significant, but the boundary position in pitch height and turning point Tone 2-Tone 3 continuum shifted significantly in different tonal context. Comparing to Chinese subjects, Japanese subjects' perceptual ranges of Tone 3 in isolated syllables and disyllables were narrower, and it's more difficult for them to identify these two tones in disyllables.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133378858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423518
Jinfu Ni, Y. Shiga, H. Kawai, H. Kashioka
In order to build web-based voicefonts, an unsupervised method is needed to automate the extraction of acoustic and linguistic properties of speech. This paper addresses the impact of automatic speech transcription on statistical parametric speech synthesis based on a single speaker's 100 hour speech corpus, focusing particularly on two factors of affecting speech quality: transcript accuracy and size of training dataset. Experimental results indicate that for an unsupervised method to achieve fair (MOS 3) voice quality, 1.5 hours of speech are necessary for phone accuracy over 80% and 3.5 hours necessary for phone accuracy down to 65%. Improvement in MOS quality turns out not to be significant when more than 4 hours of speech are used. The usage of automatic transcripts certainly leads to voice degradation. One of the mechanisms behind this is that transcript errors cause mismatches between speech segments and phone labels that significantly distort the structures of decision trees in resultant HMM-based voices.
{"title":"Experiments on unsupervised statistical parametric speech synthesis","authors":"Jinfu Ni, Y. Shiga, H. Kawai, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423518","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423518","url":null,"abstract":"In order to build web-based voicefonts, an unsupervised method is needed to automate the extraction of acoustic and linguistic properties of speech. This paper addresses the impact of automatic speech transcription on statistical parametric speech synthesis based on a single speaker's 100 hour speech corpus, focusing particularly on two factors of affecting speech quality: transcript accuracy and size of training dataset. Experimental results indicate that for an unsupervised method to achieve fair (MOS 3) voice quality, 1.5 hours of speech are necessary for phone accuracy over 80% and 3.5 hours necessary for phone accuracy down to 65%. Improvement in MOS quality turns out not to be significant when more than 4 hours of speech are used. The usage of automatic transcripts certainly leads to voice degradation. One of the mechanisms behind this is that transcript errors cause mismatches between speech segments and phone labels that significantly distort the structures of decision trees in resultant HMM-based voices.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133860354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423494
Yan Li, Si Li, Weiran Xu, Jun Guo
The aim of term semantic orientation analysis is to mine the sentiment polarity of words and phrases from their contexts. This paper presents a novel algorithm called Affinity Propagation to analyze semantic orientations of terms. Specifically, we build an informative graph from text corpus using an efficient Word Activation Force model and regard each term as a node in the graph. Then we propagate opinionated information over the whole graph using only a small number of seed terms. We finally utilize affinity vectors rather than context vectors to detect term polarities and construct the polarity lexicons. Evaluations on our proposed algorithm show its advantages over the state-of-the-art algorithms. And further improvements can be obtained by combining Affinity Propagation with Pointwise Mutual Information.
{"title":"Analyzing semantic orientation of terms using Affinity Propagation","authors":"Yan Li, Si Li, Weiran Xu, Jun Guo","doi":"10.1109/ISCSLP.2012.6423494","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423494","url":null,"abstract":"The aim of term semantic orientation analysis is to mine the sentiment polarity of words and phrases from their contexts. This paper presents a novel algorithm called Affinity Propagation to analyze semantic orientations of terms. Specifically, we build an informative graph from text corpus using an efficient Word Activation Force model and regard each term as a node in the graph. Then we propagate opinionated information over the whole graph using only a small number of seed terms. We finally utilize affinity vectors rather than context vectors to detect term polarities and construct the polarity lexicons. Evaluations on our proposed algorithm show its advantages over the state-of-the-art algorithms. And further improvements can be obtained by combining Affinity Propagation with Pointwise Mutual Information.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133884079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423487
Guo Li, P. Mok
Previous studies into interlanguage speech intelligibility benefit (ISIB) have focused on the influence of subjects' native language (L1) on the phonetic production and perception in their second language (L2). However, no research so far has examined the effect of the listeners' exposure and training in a second language (L2) on their understanding of L2-accented native language (L1). This paper aims to address this issue with subjects whose L1 is English, L2 is Mandarin. Characteristics of Mandarin-accented English include the devoicing of word-final consonants, and the insufficient distinction of the vowel pairs /i:/ - /i/ and /ε/ - /æ/. These features could negatively affect listeners' understanding of contrastive word pairs. In this study, 9 native Mandarin listeners, 9 monolingual English listeners and 9 English-Mandarin bilinguals were asked to listen to recordings of Mandarin-accented English and identify minimal pairs involving the above consonant and vowel contrasts. Results show that among all three groups of subjects, native Mandarin listeners scored the highest accuracy, but English listeners with training in Mandarin and monolingual English speakers had similar scores. These findings support the existence of ISIB for Mandarin, and call for further study on bilingual L2 learners.
{"title":"Preliminary study on the interlanguage speech intelligibility benefit for English-Mandarin bilingual l2 learners","authors":"Guo Li, P. Mok","doi":"10.1109/ISCSLP.2012.6423487","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423487","url":null,"abstract":"Previous studies into interlanguage speech intelligibility benefit (ISIB) have focused on the influence of subjects' native language (L1) on the phonetic production and perception in their second language (L2). However, no research so far has examined the effect of the listeners' exposure and training in a second language (L2) on their understanding of L2-accented native language (L1). This paper aims to address this issue with subjects whose L1 is English, L2 is Mandarin. Characteristics of Mandarin-accented English include the devoicing of word-final consonants, and the insufficient distinction of the vowel pairs /i:/ - /i/ and /ε/ - /æ/. These features could negatively affect listeners' understanding of contrastive word pairs. In this study, 9 native Mandarin listeners, 9 monolingual English listeners and 9 English-Mandarin bilinguals were asked to listen to recordings of Mandarin-accented English and identify minimal pairs involving the above consonant and vowel contrasts. Results show that among all three groups of subjects, native Mandarin listeners scored the highest accuracy, but English listeners with training in Mandarin and monolingual English speakers had similar scores. These findings support the existence of ISIB for Mandarin, and call for further study on bilingual L2 learners.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127199293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423536
Yi-Chin Huang, Chung-Hsien Wu, Sz-Ting Weng
In this paper, a novel hierarchical prosodic unit selection method is proposed based on pitch contour pattern retrieval, in order to obtained natural pitch contour of the personalized synthetic voice. In this framework, a hierarchical prosodic unit based on Fujisaki model is used to take local pitch contour variation and global intonation of utterance into account. Furthermore, novel ways of integrating pitch contour pattern of prosodic units in the prosodic model are invents in order to improve the selection mechanism of the appropriate pitch contour. A novel prosodic unit selection method is proposed based on sentence retrieval, which not only uses the traditional linguistic cue as selection criterion, but also the shape of the pitch contour. Also, the codewords of pitch patterns in the training corpus and synthesized corpus were constructed by the proposed method and were used to map the relation between training codeword and synthesized corpus. Finally, the language model of pitch pattern is adopted to find the proper pitch pattern sequence of input text. The evaluation results demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of model-based method.
{"title":"Hierarchical prosodic pattern selection based on Fujisaki model for natural mandarin speech synthesis","authors":"Yi-Chin Huang, Chung-Hsien Wu, Sz-Ting Weng","doi":"10.1109/ISCSLP.2012.6423536","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423536","url":null,"abstract":"In this paper, a novel hierarchical prosodic unit selection method is proposed based on pitch contour pattern retrieval, in order to obtained natural pitch contour of the personalized synthetic voice. In this framework, a hierarchical prosodic unit based on Fujisaki model is used to take local pitch contour variation and global intonation of utterance into account. Furthermore, novel ways of integrating pitch contour pattern of prosodic units in the prosodic model are invents in order to improve the selection mechanism of the appropriate pitch contour. A novel prosodic unit selection method is proposed based on sentence retrieval, which not only uses the traditional linguistic cue as selection criterion, but also the shape of the pitch contour. Also, the codewords of pitch patterns in the training corpus and synthesized corpus were constructed by the proposed method and were used to map the relation between training codeword and synthesized corpus. Finally, the language model of pitch pattern is adopted to find the proper pitch pattern sequence of input text. The evaluation results demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of model-based method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"17 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423510
Cuiling Zhang
Change of pitch is a common disguise type adopted by criminals in forensic voice comparison, which introduces substantial variance of acoustic properties and results in poorer performance of speaker recognition. This paper investigates the acoustic properties of disguised voices with raised and lowered pitch from 11 Chinese male speakers. Parameters including fundamental frequency, syllable duration, intensity, vowel formant frequencies, and long term average spectrum (LTAS) were measured and statistically compared with those of normal voice. The effect of voice disguise on speaker recognition by both human and machine is also evaluated. The results show that speakers have different ability of adjusting pitch. Pitch change results in corresponding change of other parameters and degradation of speaker recognition by parameter discrimination, auditory perception and automatic speaker recognition, but some systematic changes of parameters provide clues for forensic voice comparison.
{"title":"Acoustic analysis of disguised voices with raised and lowered pitch","authors":"Cuiling Zhang","doi":"10.1109/ISCSLP.2012.6423510","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423510","url":null,"abstract":"Change of pitch is a common disguise type adopted by criminals in forensic voice comparison, which introduces substantial variance of acoustic properties and results in poorer performance of speaker recognition. This paper investigates the acoustic properties of disguised voices with raised and lowered pitch from 11 Chinese male speakers. Parameters including fundamental frequency, syllable duration, intensity, vowel formant frequencies, and long term average spectrum (LTAS) were measured and statistically compared with those of normal voice. The effect of voice disguise on speaker recognition by both human and machine is also evaluated. The results show that speakers have different ability of adjusting pitch. Pitch change results in corresponding change of other parameters and degradation of speaker recognition by parameter discrimination, auditory perception and automatic speaker recognition, but some systematic changes of parameters provide clues for forensic voice comparison.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125792362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}