The segmentation of text and speech into topics and subtopics is an important step in document interpretation. For text, formatting information, such as headings and paragraphing, is available to aid in this endeavor, although this information is by no means su cient. For speech, the task is even more di cult. We present results of the application of machine learning techniques to the automatic identi cation of intonational phrases beginning and ending 'topics' determined independently by annotators for two corpora | the Boston Directions Corpus and the Broadcast News (HUB-4) DARPA/NIST database.
{"title":"Acoustic indicators of topic segmentation","authors":"Julia Hirschberg, C. H. Nakatani","doi":"10.21437/ICSLP.1998-582","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-582","url":null,"abstract":"The segmentation of text and speech into topics and subtopics is an important step in document interpretation. For text, formatting information, such as headings and paragraphing, is available to aid in this endeavor, although this information is by no means su cient. For speech, the task is even more di cult. We present results of the application of machine learning techniques to the automatic identi cation of intonational phrases beginning and ending 'topics' determined independently by annotators for two corpora | the Boston Directions Corpus and the Broadcast News (HUB-4) DARPA/NIST database.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128117617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this study, we investigate the e(cid:11)ectiveness of an unknown word processing(UWP) algorithm, which is incorporated into an N-gram language model based speech recognition system for dealing with (cid:12)lled pauses and out- of-vocabulary(OOV) words. We have already been investigated the e(cid:11)ect of the UWP algorithm, which utilizes a simple subword sequence decoder, in a spoken dialog sys- tem using a context free grammar(CFG) as a language model. The e(cid:11)ect of the UWP algorithm was investigated using an N-based continuous speech recognition system on both a small dialog task and a large-vocabulary read speech dictation task. The experiment results showed that the UWP improves the recognition accuracy and an N-gram based system with the UWP can improve the understanding performance in compared with a CFG-based system.
{"title":"Dealing with out-of-vocabulary words and speech disfluencies in an n-gram based speech understanding system","authors":"A. Kai, Y. Hirose, S. Nakagawa","doi":"10.21437/ICSLP.1998-648","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-648","url":null,"abstract":"In this study, we investigate the e(cid:11)ectiveness of an unknown word processing(UWP) algorithm, which is incorporated into an N-gram language model based speech recognition system for dealing with (cid:12)lled pauses and out- of-vocabulary(OOV) words. We have already been investigated the e(cid:11)ect of the UWP algorithm, which utilizes a simple subword sequence decoder, in a spoken dialog sys- tem using a context free grammar(CFG) as a language model. The e(cid:11)ect of the UWP algorithm was investigated using an N-based continuous speech recognition system on both a small dialog task and a large-vocabulary read speech dictation task. The experiment results showed that the UWP improves the recognition accuracy and an N-gram based system with the UWP can improve the understanding performance in compared with a CFG-based system.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125718849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel method for modeling phonetic context using linear context transforms. Initial investigations have shown the feasibility of synthesising context dependent models from context independent models through weighted interpolation of the peripheral states of a given hidden markov model with its adjacent model. This idea can be further extended, to maximum likelihood estimation of not only single weights, but a matrix of weights or a transform. This paper outlines the application of Maximum Likelihood Linear Regression (MLLR) as a means of modeling context dependency in continuous density Hidden Markov Models (HMM).
{"title":"Context dependent tree based transforms for phonetic speech recognition","authors":"Bernard Doherty, S. Vaseghi, P. McCourt","doi":"10.21437/ICSLP.1998-645","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-645","url":null,"abstract":"This paper presents a novel method for modeling phonetic context using linear context transforms. Initial investigations have shown the feasibility of synthesising context dependent models from context independent models through weighted interpolation of the peripheral states of a given hidden markov model with its adjacent model. This idea can be further extended, to maximum likelihood estimation of not only single weights, but a matrix of weights or a transform. This paper outlines the application of Maximum Likelihood Linear Regression (MLLR) as a means of modeling context dependency in continuous density Hidden Markov Models (HMM).","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127910160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The present paper examines what kinds of Shanghai disyllabic lexical tone sandhi undergoes, especially in what sense and to what extent a disyllabic tone can be claimed to result from rightward spreading of the corresponding citation tone. It will be shown that F0 spreading occurs in the Long tone domains while Contour element spreading mainly in the Short tone domains.
{"title":"What spreads, and how? tonal rightward spreading on shanghai disyllabic compounds","authors":"X. Zhu","doi":"10.21437/ICSLP.1998-145","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-145","url":null,"abstract":"The present paper examines what kinds of Shanghai disyllabic lexical tone sandhi undergoes, especially in what sense and to what extent a disyllabic tone can be claimed to result from rightward spreading of the corresponding citation tone. It will be shown that F0 spreading occurs in the Long tone domains while Contour element spreading mainly in the Short tone domains.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128192571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we compare two different methods for phonetically labeling a speech database. The first approach is based on the alignment of the speech signal on a high quality synthetic speech pattern, and the second one uses a hybrid HMM/ANN system. Both systems have been evaluated on French read utterances from a speaker never seen in the training stage of the HMM/ANN system and manually segmented. This study outlines the advantages and drawbacks of both methods. The high quality speech synthetic system has the great advantage that no training stage is needed, while the classical HMM/ANN system easily allows multiple phonetic transcriptions. We deduce a method for the automatic constitution of phonetically labeled speech databases based on using the synthetic speech segmentation tool to bootstrap the training process of our hybrid HMM/ANN system. The importance of such segmentation tools will be a key point for the development of improved speech synthesis and recognition systems.
{"title":"Phonetic alignment: speech synthesis based vs. hybrid HMM/ANN","authors":"F. Malfrère, O. Deroo, T. Dutoit","doi":"10.21437/ICSLP.1998-595","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-595","url":null,"abstract":"In this paper we compare two different methods for phonetically labeling a speech database. The first approach is based on the alignment of the speech signal on a high quality synthetic speech pattern, and the second one uses a hybrid HMM/ANN system. Both systems have been evaluated on French read utterances from a speaker never seen in the training stage of the HMM/ANN system and manually segmented. This study outlines the advantages and drawbacks of both methods. The high quality speech synthetic system has the great advantage that no training stage is needed, while the classical HMM/ANN system easily allows multiple phonetic transcriptions. We deduce a method for the automatic constitution of phonetically labeled speech databases based on using the synthetic speech segmentation tool to bootstrap the training process of our hybrid HMM/ANN system. The importance of such segmentation tools will be a key point for the development of improved speech synthesis and recognition systems.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115831748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigated adult Japanese speakers’ deficiencies in English spoken word recognition. We found that the accurate recognition of the first syllable or the initial portion of each word played an important role in recognizing a word correctly. It was implied in the study that their recognition performance would be enhanced by utilizing the speech processing methods, time-scale expansion and/or dynamic range compression. Although approximately 85 percent of English words begin with strong syllables [1], many of them do not carry a sentence stress and they are not pronounced as clearly as isolated words. Moreover, the duration of a word, especially a beginning word is so short that the listener can't recognize it correctly. Two experiments were administered in the anechoic room. In the first experiment, subjects listened to extracted words and corresponding isolated words of English, which included words without primary stress on the first syllables. We found that they had difficulty in recognizing both isolated words and the extracted words, especially when the word did not begin with a strong syllable, which was sounded somewhat unclear. This is quite frequent in a normal English speech. We confirmed that they had difficulty recognizing the words which began with weak syllables and it is concluded that the first syllable plays an important role in the recognition of words at least for Japanese speakers. In the second experiment, the extracted words and the corresponding time-scale expanded words (henceforth, expanded words) were given. The result indicated that the expanded words were better recognized. It is found that the time-scale modification (henceforth, TSM) of the extracted words didn’t lose intelligibility even around the ratio of 2.00, as was clear from the fact that the recognition improved.
{"title":"The importance of the first syllable in English spoken word recognition by adult Japanese speakers","authors":"Kazuo Nakayama, Kaoru Tomita-Nakayama","doi":"10.21437/ICSLP.1998-764","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-764","url":null,"abstract":"We investigated adult Japanese speakers’ deficiencies in English spoken word recognition. We found that the accurate recognition of the first syllable or the initial portion of each word played an important role in recognizing a word correctly. It was implied in the study that their recognition performance would be enhanced by utilizing the speech processing methods, time-scale expansion and/or dynamic range compression. Although approximately 85 percent of English words begin with strong syllables [1], many of them do not carry a sentence stress and they are not pronounced as clearly as isolated words. Moreover, the duration of a word, especially a beginning word is so short that the listener can't recognize it correctly. Two experiments were administered in the anechoic room. In the first experiment, subjects listened to extracted words and corresponding isolated words of English, which included words without primary stress on the first syllables. We found that they had difficulty in recognizing both isolated words and the extracted words, especially when the word did not begin with a strong syllable, which was sounded somewhat unclear. This is quite frequent in a normal English speech. We confirmed that they had difficulty recognizing the words which began with weak syllables and it is concluded that the first syllable plays an important role in the recognition of words at least for Japanese speakers. In the second experiment, the extracted words and the corresponding time-scale expanded words (henceforth, expanded words) were given. The result indicated that the expanded words were better recognized. It is found that the time-scale modification (henceforth, TSM) of the extracted words didn’t lose intelligibility even around the ratio of 2.00, as was clear from the fact that the recognition improved.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115896967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Programs for testing and training of difficult vowel distinctions in American English were created for subjects to access via the Internet using a web browser. The testing and training data include many likely vowel confusions for speakers of different L1s. The training program focuses on one distinction at a time, and adjusts to concentrate on particular contexts or exemplars that are difficult for the individual subject. In the current study, 52 subjects participated in testing and 2 subjects participated in training. In the testing portion, results indicate that the L1 and the fluency level in English, as well as individual variability, have an effect on perceptual ability. In the training portion, subjects showed significant improvement on the contrasts on which they trained. Because these programs make extensive data collection over large populations and large distances easy, this method of research will facilitate further investigation of questions regarding second language acquisition.
{"title":"Computer-mediated input and the acquisition of L2 vowels","authors":"M. Fox","doi":"10.21437/ICSLP.1998-844","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-844","url":null,"abstract":"Programs for testing and training of difficult vowel distinctions in American English were created for subjects to access via the Internet using a web browser. The testing and training data include many likely vowel confusions for speakers of different L1s. The training program focuses on one distinction at a time, and adjusts to concentrate on particular contexts or exemplars that are difficult for the individual subject. In the current study, 52 subjects participated in testing and 2 subjects participated in training. In the testing portion, results indicate that the L1 and the fluency level in English, as well as individual variability, have an effect on perceptual ability. In the training portion, subjects showed significant improvement on the contrasts on which they trained. Because these programs make extensive data collection over large populations and large distances easy, this method of research will facilitate further investigation of questions regarding second language acquisition.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132051206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a speaker adaptation technique for a phonetic vocoder based on HMM. In the vocoder, the encoder performs phoneme recognition and transmits phoneme indexes and state durations to the decoder, and the decoder synthesizes speech using HMM-based speech synthesis technique. One of the main problems of this vocoder is that the voice characteristics of synthetic speech depend on HMMs used in the decoder, and are therefore fixed regardless of a variety of input speakers. To overcome this problem, we adapt HMMs to input speech by transmitting transfer vectors, information on mismatch between the input speech and HMMs. The results of the subjective tests show that the performance of the proposed vocoder without quantization of transfer vectors is comparable to that of a speaker dependent vocoder.
{"title":"A very low bit rate speech coder using HMM with speaker adaptation","authors":"T. Masuko, K. Tokuda, Takao Kobayashi","doi":"10.21437/ICSLP.1998-375","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-375","url":null,"abstract":"This paper describes a speaker adaptation technique for a phonetic vocoder based on HMM. In the vocoder, the encoder performs phoneme recognition and transmits phoneme indexes and state durations to the decoder, and the decoder synthesizes speech using HMM-based speech synthesis technique. One of the main problems of this vocoder is that the voice characteristics of synthetic speech depend on HMMs used in the decoder, and are therefore fixed regardless of a variety of input speakers. To overcome this problem, we adapt HMMs to input speech by transmitting transfer vectors, information on mismatch between the input speech and HMMs. The results of the subjective tests show that the performance of the proposed vocoder without quantization of transfer vectors is comparable to that of a speaker dependent vocoder.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132208532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruxin Chen, Miyuki Tanaka, Duanpei Wu, L. Olorenshaw, Mariscela Amador
This paper reports on a large vocabulary speaker independent isolated word recognizer targeting 50,000 words. The system supports a unique four-layer sharing structure for either continuous HMM or discrete HMM. Evaluation is performed using a dictionary of 5000 US city names, a dictionary of the 5000 English most frequent words, a dictionary of 50,000 English words, and the 110,000 word CMU English dictionary. For these dictionaries, recognition accuracy ranges from 90% to 93% for the top 3 results.
{"title":"A four layer sharing HMM system for very large vocabulary isolated word recognition","authors":"Ruxin Chen, Miyuki Tanaka, Duanpei Wu, L. Olorenshaw, Mariscela Amador","doi":"10.21437/ICSLP.1998-284","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-284","url":null,"abstract":"This paper reports on a large vocabulary speaker independent isolated word recognizer targeting 50,000 words. The system supports a unique four-layer sharing structure for either continuous HMM or discrete HMM. Evaluation is performed using a dictionary of 5000 US city names, a dictionary of the 5000 English most frequent words, a dictionary of 50,000 English words, and the 110,000 word CMU English dictionary. For these dictionaries, recognition accuracy ranges from 90% to 93% for the top 3 results.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132355017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a simple and e(cid:14)cient time domain technique to estimate an all-poll model on a mel-frequency axis (Mel-LPC). This method requires only two-fold computational cost as compared to conventional linear prediction analysis. The recognition performance of mel-cepstral parameters obtained by the Mel LPC analysis is compared with those of conventional LP mel-cepstra and the mel-frequency cepstrum coe(cid:14)cients (MFCC) through gender-dependent phoneme and word recognition tests. The results show that the Mel-LPC cepstrum attains a signi(cid:12)cant improvement in recognition accuracy over conventional LP mel-cepstrum, and gives slightly higher accuracy for male speakersand slightlylower accuracy for female speakersthan MFCC.
{"title":"An efficient mel-LPC analysis method for speech recognition","authors":"H. Matsumoto, Y. Nakatoh, Y. Furuhata","doi":"10.21437/ICSLP.1998-536","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-536","url":null,"abstract":"This paper proposes a simple and e(cid:14)cient time domain technique to estimate an all-poll model on a mel-frequency axis (Mel-LPC). This method requires only two-fold computational cost as compared to conventional linear prediction analysis. The recognition performance of mel-cepstral parameters obtained by the Mel LPC analysis is compared with those of conventional LP mel-cepstra and the mel-frequency cepstrum coe(cid:14)cients (MFCC) through gender-dependent phoneme and word recognition tests. The results show that the Mel-LPC cepstrum attains a signi(cid:12)cant improvement in recognition accuracy over conventional LP mel-cepstrum, and gives slightly higher accuracy for male speakersand slightlylower accuracy for female speakersthan MFCC.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"239 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132403918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}