Pub Date : 2010-12-01DOI: 10.1109/ISCSLP.2010.5684847
Shih-Hsiang Lin, Berlin Chen, E. Jan
Query-by-example information retrieval aims at helping users to find relevant documents accurately when users provide specific query exemplars describing what they are interested in. The query exemplars are usually long and in the form of either a partial or even a full document. However, they may contain extraneous terms (or off-topic information) that would have a negative impact on the retrieval performance. In this paper, we propose to integrate extractive summarization techniques into the retrieval process so as to improve the informativeness of a verbose query exemplar. The original query exemplar is first divided into several sub-queries or sentences. To construct a new concise query exemplar, summarization techniques are then employed to select a salient subset of sub-queries. Experiments on the TDT Chinese collection show that the proposed approach is indeed effective and promising.
{"title":"Improving the informativeness of verbose queries using summarization techniques for spoken document retrieval","authors":"Shih-Hsiang Lin, Berlin Chen, E. Jan","doi":"10.1109/ISCSLP.2010.5684847","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684847","url":null,"abstract":"Query-by-example information retrieval aims at helping users to find relevant documents accurately when users provide specific query exemplars describing what they are interested in. The query exemplars are usually long and in the form of either a partial or even a full document. However, they may contain extraneous terms (or off-topic information) that would have a negative impact on the retrieval performance. In this paper, we propose to integrate extractive summarization techniques into the retrieval process so as to improve the informativeness of a verbose query exemplar. The original query exemplar is first divided into several sub-queries or sentences. To construct a new concise query exemplar, summarization techniques are then employed to select a salient subset of sub-queries. Experiments on the TDT Chinese collection show that the proposed approach is indeed effective and promising.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125354569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684886
Hanwu Sun, B. Ma, Haizhou Li
In this paper, we study a front-end frame selection approach for the interview channel speaker recognition system. This new approach keeps the high quality speech frames and removes noisy and irrelevant speech frames for speaker modeling. For robust voice activity detection (VAD) under the different types of microphones located in the interview room, we adopt the spectral subtraction algorithm for noise reduction. An energy based frame selection algorithm is first applied to indicate the speech activity at the frame level. To overcome the summed channel effects in the interview condition, a study is conducted to effectively extract the relevant speaker's speech frames based on VAD Tags and ASR transcript Tags provided by NIST. The eigenchannel based GMM-SVM speaker recognition system is used to evaluate the proposed method. The experiments are conducted on the NIST 2008 and NIST 2010 Speaker Recognition Evaluation interview-interview conditions. It demonstrates that the approach provides an efficient way to select high quality speech frames and the relevant speaker's voice in the interview environment for speaker recognition.
{"title":"Frame selection of interview channel for NIST speaker recognition evaluation","authors":"Hanwu Sun, B. Ma, Haizhou Li","doi":"10.1109/ISCSLP.2010.5684886","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684886","url":null,"abstract":"In this paper, we study a front-end frame selection approach for the interview channel speaker recognition system. This new approach keeps the high quality speech frames and removes noisy and irrelevant speech frames for speaker modeling. For robust voice activity detection (VAD) under the different types of microphones located in the interview room, we adopt the spectral subtraction algorithm for noise reduction. An energy based frame selection algorithm is first applied to indicate the speech activity at the frame level. To overcome the summed channel effects in the interview condition, a study is conducted to effectively extract the relevant speaker's speech frames based on VAD Tags and ASR transcript Tags provided by NIST. The eigenchannel based GMM-SVM speaker recognition system is used to evaluate the proposed method. The experiments are conducted on the NIST 2008 and NIST 2010 Speaker Recognition Evaluation interview-interview conditions. It demonstrates that the approach provides an efficient way to select high quality speech frames and the relevant speaker's voice in the interview environment for speaker recognition.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114679470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684897
Chao-Hong Liu, Chung-Hsien Wu
Long sentences have posed significant challenges for many natural language processing (NLP) tasks such as machine translation and language understanding, because it is still very difficult for the state-of-the-art parsers to analyze them. In this paper, we identify the Sentence Decomplexification (SD) problem and propose models for SD to help understand long sentences. Given a complex sentence, SD seeks to return two sentences, one main clause and the other subordinate clause. These two clauses together include all the information of the original sentence. Since identifying subordinate clauses is a more difficult task than traditional chunking, we also propose a holistic aspect-based detection (HAD) method for clause detection to reduce the overhead required for SD sentence similarity computation. We provide the formalisms of SD and show that HAD can be used for efficiency purposes to this task. The SD system was used to improve the performance of a long sentence understanding system. Experimental results show that the task of SD achieves 78.7% accuracy using Chinese Gigaword Corpus as sentence comparison corpus. For the performance of long sentence understanding, the proposed method reports an improvement of accuracy from 70.7% to 75.5% as compared to that without using SD.
{"title":"Sentence Decomplexification using holistic aspect-based clause detection for long sentence understanding","authors":"Chao-Hong Liu, Chung-Hsien Wu","doi":"10.1109/ISCSLP.2010.5684897","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684897","url":null,"abstract":"Long sentences have posed significant challenges for many natural language processing (NLP) tasks such as machine translation and language understanding, because it is still very difficult for the state-of-the-art parsers to analyze them. In this paper, we identify the Sentence Decomplexification (SD) problem and propose models for SD to help understand long sentences. Given a complex sentence, SD seeks to return two sentences, one main clause and the other subordinate clause. These two clauses together include all the information of the original sentence. Since identifying subordinate clauses is a more difficult task than traditional chunking, we also propose a holistic aspect-based detection (HAD) method for clause detection to reduce the overhead required for SD sentence similarity computation. We provide the formalisms of SD and show that HAD can be used for efficiency purposes to this task. The SD system was used to improve the performance of a long sentence understanding system. Experimental results show that the task of SD achieves 78.7% accuracy using Chinese Gigaword Corpus as sentence comparison corpus. For the performance of long sentence understanding, the proposed method reports an improvement of accuracy from 70.7% to 75.5% as compared to that without using SD.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684486
Yujia Li, Tan Lee
Our previous study revealed that F0 variations in Cantonese speech can be sufficiently represented by linear approximations of the observed F0 contours. This was observed with test materials that have relatively limited lexical and segmental variations. In the present work, the generalizability of linear approximation is examined with a large corpus of polysyllabic Cantonese words. Perceptual results clearly validate the effectiveness of linearly approximated F0 contours. Subsequently analysis of the amount of generated linear approximations is carried out. The properties of linear F0 movements in continuous Cantonese speech are learned, particularly in association with different tones. Lastly, two objective evaluations of the modified F0 contours, RMS error and contour correlation are compared with the true perceptual performance. It is found that neither of these objective measurements gives reliable prediction on perceived speech naturalness.
{"title":"Perception and analysis of linearly approximated F0 contours in Cantonese speech","authors":"Yujia Li, Tan Lee","doi":"10.1109/ISCSLP.2010.5684486","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684486","url":null,"abstract":"Our previous study revealed that F0 variations in Cantonese speech can be sufficiently represented by linear approximations of the observed F0 contours. This was observed with test materials that have relatively limited lexical and segmental variations. In the present work, the generalizability of linear approximation is examined with a large corpus of polysyllabic Cantonese words. Perceptual results clearly validate the effectiveness of linearly approximated F0 contours. Subsequently analysis of the amount of generated linear approximations is carried out. The properties of linear F0 movements in continuous Cantonese speech are learned, particularly in association with different tones. Lastly, two objective evaluations of the modified F0 contours, RMS error and contour correlation are compared with the true perceptual performance. It is found that neither of these objective measurements gives reliable prediction on perceived speech naturalness.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116462768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684892
Shih-Hsun Chen, Hsiao-Chuan Wang
The independent component analysis (ICA) is a commonly used method to find the demixing matrix for the blind source separation (BSS). For speech signals, we should solve BSS problems in the convolutive mixing model, i.e., ICA technique is extended to the frequency domain. The cross-spectral density matrices are computed for each frequency bin instead of covariance matrices in time domain. The joint approximate diagonalization (JADIAG) algorithm proposed by D. T. Pham has been proved to be effective in dealing with the convolutive mixing problem. This paper presents a method to speed up the JADIAG computation in two phases. First, the critical band property of human auditory system is applied so that a set of selected demixing matrices is shared in a critical band to reduce the number of demixing matrices. Second, an efficient estimation of transformation matrix is proposed so that the iterations for finding the demixing matrices in JADIAG algorithm are reduced. The experiment shows that about 71% of computation time can be reduced.
{"title":"A speedup method for the separation of speech signals in frequency domain","authors":"Shih-Hsun Chen, Hsiao-Chuan Wang","doi":"10.1109/ISCSLP.2010.5684892","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684892","url":null,"abstract":"The independent component analysis (ICA) is a commonly used method to find the demixing matrix for the blind source separation (BSS). For speech signals, we should solve BSS problems in the convolutive mixing model, i.e., ICA technique is extended to the frequency domain. The cross-spectral density matrices are computed for each frequency bin instead of covariance matrices in time domain. The joint approximate diagonalization (JADIAG) algorithm proposed by D. T. Pham has been proved to be effective in dealing with the convolutive mixing problem. This paper presents a method to speed up the JADIAG computation in two phases. First, the critical band property of human auditory system is applied so that a set of selected demixing matrices is shared in a critical band to reduce the number of demixing matrices. Second, an efficient estimation of transformation matrix is proposed so that the iterations for finding the demixing matrices in JADIAG algorithm are reduced. The experiment shows that about 71% of computation time can be reduced.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114513620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684851
Chiu-yu Tseng, Zhao-yu Su, Chi-Feng Huang, T. Visceglia
A perceptually-based hierarchy of prosodic phrase group (HPG) framework was used in this study to investigate similarities and differences in the size and strategy of discourse-level speech planning across L1 and L2 English speaker groups. While both groups appear to produce similar configurations of acoustic contrasts to signal discourse boundaries, L1 speakers were found to produce these cues more robustly in English. Differences were also found between L1 English and L1 Taiwan Mandarin speaker groups with respect to the distribution of prosodic break levels and break locations. These differences in L1 and L2 organization of discourse speech prosody in English can be largely attributed to between-group differences in speech planning and chunking strategies whereby L2 speakers use more intermediate chunking units and fewer larger-scale planning units in their prosodic discourse organization. Through more understanding of prosody transfer, we believe that technology developed on the basis of L1 Mandarin spoken language processing may be applied to L2 English produced by the same speaker population, with little modification.
{"title":"An initial investigation of L1 and L2 discourse speech planning in English","authors":"Chiu-yu Tseng, Zhao-yu Su, Chi-Feng Huang, T. Visceglia","doi":"10.1109/ISCSLP.2010.5684851","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684851","url":null,"abstract":"A perceptually-based hierarchy of prosodic phrase group (HPG) framework was used in this study to investigate similarities and differences in the size and strategy of discourse-level speech planning across L1 and L2 English speaker groups. While both groups appear to produce similar configurations of acoustic contrasts to signal discourse boundaries, L1 speakers were found to produce these cues more robustly in English. Differences were also found between L1 English and L1 Taiwan Mandarin speaker groups with respect to the distribution of prosodic break levels and break locations. These differences in L1 and L2 organization of discourse speech prosody in English can be largely attributed to between-group differences in speech planning and chunking strategies whereby L2 speakers use more intermediate chunking units and fewer larger-scale planning units in their prosodic discourse organization. Through more understanding of prosody transfer, we believe that technology developed on the basis of L1 Mandarin spoken language processing may be applied to L2 English produced by the same speaker population, with little modification.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122166853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684891
Yanhua Long, Lirong Dai, Eryu Wang, B. Ma, Wu Guo
Discovering a discriminative feature representative together with a suitable distance measure is the key for a successful speaker recognition system. In this paper, we propose a new approach for automatic speaker verification. The main contribution of the paper is the extraction of discriminative speaker features using non-negative matrix factorization (NMF) decomposition in the GMM mean space, and the use of cosine-distance measure for speaker classification. With the decomposition, the speaker space is represented by the pattern components while a speaker can be characterized by a coefficient vector representing a specific localization in the space. We validate the proposed approach on the 10-second training and 10-second testing condition constructed from 863 Putonghua (Mandarin) corpus. Relative 10.57% and 26.11% improvements compared to the conventional GMM-UBM system have been achieved for female and male trials respectively.
{"title":"Non-negative matrix factorization based discriminative features for speaker verification","authors":"Yanhua Long, Lirong Dai, Eryu Wang, B. Ma, Wu Guo","doi":"10.1109/ISCSLP.2010.5684891","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684891","url":null,"abstract":"Discovering a discriminative feature representative together with a suitable distance measure is the key for a successful speaker recognition system. In this paper, we propose a new approach for automatic speaker verification. The main contribution of the paper is the extraction of discriminative speaker features using non-negative matrix factorization (NMF) decomposition in the GMM mean space, and the use of cosine-distance measure for speaker classification. With the decomposition, the speaker space is represented by the pattern components while a speaker can be characterized by a coefficient vector representing a specific localization in the space. We validate the proposed approach on the 10-second training and 10-second testing condition constructed from 863 Putonghua (Mandarin) corpus. Relative 10.57% and 26.11% improvements compared to the conventional GMM-UBM system have been achieved for female and male trials respectively.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128675506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684878
Bin Li, Caicai Zhang
This study focuses on the perception of two synthesized Mandarin tones: the high level tone (Tone 1) and the high falling tone (Tone 4), which have been reported difficult for Cantonese learners of Mandarin [15]. As the two tones are distinctive in F0 directions and also vary in F0 onsets, it is worth investigating why Cantonese listeners find them perceptually indistinguishable. We aim to find out what F0 cues Cantonese listeners rely on in perceiving these two Mandarin tones by modifying the F0 curves along two dimensions: F0 onset and F0 slope. Results show that Mandarin listeners are able to identify the two pitches based on F0 slope irrespective of F0 onsets, whereas Cantonese listeners seem more sensitive towards the variation of F0 onsets.
{"title":"Effects of F0 dimensions in perception of Mandarin tones","authors":"Bin Li, Caicai Zhang","doi":"10.1109/ISCSLP.2010.5684878","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684878","url":null,"abstract":"This study focuses on the perception of two synthesized Mandarin tones: the high level tone (Tone 1) and the high falling tone (Tone 4), which have been reported difficult for Cantonese learners of Mandarin [15]. As the two tones are distinctive in F0 directions and also vary in F0 onsets, it is worth investigating why Cantonese listeners find them perceptually indistinguishable. We aim to find out what F0 cues Cantonese listeners rely on in perceiving these two Mandarin tones by modifying the F0 curves along two dimensions: F0 onset and F0 slope. Results show that Mandarin listeners are able to identify the two pitches based on F0 slope irrespective of F0 onsets, whereas Cantonese listeners seem more sensitive towards the variation of F0 onsets.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130540044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684900
Houwei Cao, P. Ching, Tan Lee, Y. Yeung
This paper addresses the problem of language modeling for LVCSR of Cantonese-English code-mixing utterances spoken in daily communications. In the absence of sufficient amount of code-mixing text data, translation-based and semantics-based mapping are applied on n-grams to better estimate the probability of low-frequency and unseen mixed-language n-grams events. In translation-based mapping scheme, the Cantonese-to-English translation dictionary is adopted to transcribe monolingual Cantonese n-grams to mixed-language n-grams. In semantics-based mapping scheme, n-gram mapping is based on the meaning and syntactic function of the English words in the lexicon. Different semantics-based language models are trained with different mapping schemes. They are evaluated in terms of perplexity and in the task of LVCSR. Experimental results confirm that, the more the observed mixed-language n-grams after mapping, the better the language model perplexity as well as the recognition performance. The proposed language models show significant improvement on recognition performance on embedded English words when they are compared with the baseline 3-gram LM. The best recognition accuracy attained is 63.9% and 74.7% respectively for the English words and Cantonese characters in code-mixing utterances.
{"title":"Semantics-based language modeling for Cantonese-English code-mixing speech recognition","authors":"Houwei Cao, P. Ching, Tan Lee, Y. Yeung","doi":"10.1109/ISCSLP.2010.5684900","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684900","url":null,"abstract":"This paper addresses the problem of language modeling for LVCSR of Cantonese-English code-mixing utterances spoken in daily communications. In the absence of sufficient amount of code-mixing text data, translation-based and semantics-based mapping are applied on n-grams to better estimate the probability of low-frequency and unseen mixed-language n-grams events. In translation-based mapping scheme, the Cantonese-to-English translation dictionary is adopted to transcribe monolingual Cantonese n-grams to mixed-language n-grams. In semantics-based mapping scheme, n-gram mapping is based on the meaning and syntactic function of the English words in the lexicon. Different semantics-based language models are trained with different mapping schemes. They are evaluated in terms of perplexity and in the task of LVCSR. Experimental results confirm that, the more the observed mixed-language n-grams after mapping, the better the language model perplexity as well as the recognition performance. The proposed language models show significant improvement on recognition performance on embedded English words when they are compared with the baseline 3-gram LM. The best recognition accuracy attained is 63.9% and 74.7% respectively for the English words and Cantonese characters in code-mixing utterances.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/ISCSLP.2010.5684839
B. Mak, Tom Ko
Recently we proposed a novel method to explicitly model the phone deletion phenomenon in speech, and introduced the context-dependent fragmented word model (CD-FWM). An evaluation on the WSJ1 Hub2 5K task shows that even in read speech, CD-FWM could reduce word error rate (WER) by a relative 10.3%. Since it is generally expected that the phone deletion phenomenon is more pronounced in conversational and spontaneous speech than in read speech, we extend our investigation of modeling phone deletion in conversation using CD-FWM on the SVitchboard 500-word task in this paper. To our surprise, much smaller recognition gain is obtained. Through a series of analyses, we present some plausible explanations for why phone deletion modeling is more successful in read speech than in conversational speech, and suggest future directions in improving CD-FWM for recognizing conversational speech.
{"title":"Problems of modeling phone deletion in conversational speech for speech recognition","authors":"B. Mak, Tom Ko","doi":"10.1109/ISCSLP.2010.5684839","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684839","url":null,"abstract":"Recently we proposed a novel method to explicitly model the phone deletion phenomenon in speech, and introduced the context-dependent fragmented word model (CD-FWM). An evaluation on the WSJ1 Hub2 5K task shows that even in read speech, CD-FWM could reduce word error rate (WER) by a relative 10.3%. Since it is generally expected that the phone deletion phenomenon is more pronounced in conversational and spontaneous speech than in read speech, we extend our investigation of modeling phone deletion in conversation using CD-FWM on the SVitchboard 500-word task in this paper. To our surprise, much smaller recognition gain is obtained. Through a series of analyses, we present some plausible explanations for why phone deletion modeling is more successful in read speech than in conversational speech, and suggest future directions in improving CD-FWM for recognizing conversational speech.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127783581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}