In a pilot study of phonetic modi cation of function words in 2 spontaneous speech dialogues, 99 utterances of the syllable /tu/ corresponding to to, two, too, -to and toincluded the pronunciation variants [t h u, t h , D , d , n , s , s, t h , , t h ]. Factors in uencing phonetic modi cation included phonetic context, prosody, part of speech, adjacent dis uency and individual speaker. 11% of the acoustic landmarks de ning /t/ closure, /t/ release and vowel jaw opening maximumwere not detectable in hand labelling. In a separate corpus, 59% of recognition errors involved grammatical or function words like conjunctions, articles, prepositions, pronouns and auxilliary verbs, and for 17 tokens of /tu/, half were misrecognized. Implications of these preliminary results for linguistic theory, cognitive modelling of speech processing and automatic speech recognition are discussed.
在对2个自发语音对话中虚词语音修饰的初步研究中,to、two、too、to和to对应的99个音节/tu/包含了发音变体[t h u、th、D、D、n、s、s、th、th]。影响语音修饰的因素包括语音语境、韵律、词性、相邻差异和说话人个体。11%的声音标志,如/t/关闭,/t/释放和元音颚开口最大值在手标记中未被检测到。在一个单独的语料库中,59%的识别错误涉及语法或功能词,如连词、冠词、介词、代词和助动词,而对于/tu/的17个标记,有一半被错误识别。这些初步结果对语言学理论、语音处理认知建模和自动语音识别的意义进行了讨论。
{"title":"Phonetic modification of the syllable /tu/ in two spontaneous american English dialogues","authors":"N. Veilleux, S. Shattuck-Hufnagel","doi":"10.21437/ICSLP.1998-673","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-673","url":null,"abstract":"In a pilot study of phonetic modi cation of function words in 2 spontaneous speech dialogues, 99 utterances of the syllable /tu/ corresponding to to, two, too, -to and toincluded the pronunciation variants [t h u, t h , D , d , n , s , s, t h , , t h ]. Factors in uencing phonetic modi cation included phonetic context, prosody, part of speech, adjacent dis uency and individual speaker. 11% of the acoustic landmarks de ning /t/ closure, /t/ release and vowel jaw opening maximumwere not detectable in hand labelling. In a separate corpus, 59% of recognition errors involved grammatical or function words like conjunctions, articles, prepositions, pronouns and auxilliary verbs, and for 17 tokens of /tu/, half were misrecognized. Implications of these preliminary results for linguistic theory, cognitive modelling of speech processing and automatic speech recognition are discussed.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116011815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Warnke, E. Nöth, J. Buckow, S. Harbeck, H. Niemann
In this paper, we present a bootstrap training approach for language model (LM) classifiers. Training class dependent LM and running them in parallel, LM can serve as classifiers with any kind of symbol sequence, e.g., word or phoneme sequences for tasks like topic spotting or language identification (LID). Irrespective of the special symbol sequence used for a LM classifier, the training of a LM is done with a manually labeled training set for each class obtained from not necessarily cooperative speakers. Therefore, we have to face some erroneous labels and deviations from the originally intended class specification. Both facts can worsen classification. It might therefore be better not to use all utterances for training but to automatically select those utterances that improve recognition accuracy; this can be done by a bootstrap procedure. We present the results achieved with our best approach on the VERBMOBIL corpus for the tasks dialog act classification and LID.
{"title":"A bootstrap training approach for language model classifiers","authors":"V. Warnke, E. Nöth, J. Buckow, S. Harbeck, H. Niemann","doi":"10.21437/ICSLP.1998-770","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-770","url":null,"abstract":"In this paper, we present a bootstrap training approach for language model (LM) classifiers. Training class dependent LM and running them in parallel, LM can serve as classifiers with any kind of symbol sequence, e.g., word or phoneme sequences for tasks like topic spotting or language identification (LID). Irrespective of the special symbol sequence used for a LM classifier, the training of a LM is done with a manually labeled training set for each class obtained from not necessarily cooperative speakers. Therefore, we have to face some erroneous labels and deviations from the originally intended class specification. Both facts can worsen classification. It might therefore be better not to use all utterances for training but to automatically select those utterances that improve recognition accuracy; this can be done by a bootstrap procedure. We present the results achieved with our best approach on the VERBMOBIL corpus for the tasks dialog act classification and LID.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"483 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116389478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we investigate a number of robust indexing and retrieval methods in an effort to improve spoken document retrieval performance in the presence of speech recognition errors. In particular, we examine expanding the original query representation to include confusible terms; developing a new document-query retrieval measure based on approximate matching that is less sensitive to recognition errors; expanding the document representation to include multiple recognition hypotheses; modifying the original query using automatic relevance feedback to include new terms found in the top ranked documents; and combining information from multiple subword unit representations. We study the different methods individually and then explore the effects of combining them. Experiments on radio broadcast news data show that using a combination of these methods can improve retrieval performance by over 20%.
{"title":"Towards robust methods for spoken document retrieval","authors":"Kenney Ng","doi":"10.21437/ICSLP.1998-480","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-480","url":null,"abstract":"In this paper, we investigate a number of robust indexing and retrieval methods in an effort to improve spoken document retrieval performance in the presence of speech recognition errors. In particular, we examine expanding the original query representation to include confusible terms; developing a new document-query retrieval measure based on approximate matching that is less sensitive to recognition errors; expanding the document representation to include multiple recognition hypotheses; modifying the original query using automatic relevance feedback to include new terms found in the top ranked documents; and combining information from multiple subword unit representations. We study the different methods individually and then explore the effects of combining them. Experiments on radio broadcast news data show that using a combination of these methods can improve retrieval performance by over 20%.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116445824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper examines the F0 variations during tone sandhi due to various prosodic factors such as phonation type, length, stress and pitch height. It will be shown that the F0 height and shape of the second syllable (S2) in disyllabic words are determined by the interaction of four conditions: the intervocalic consonant (C2) voicing, the S2 Truncation, the F0 height of S1, and stress assignment. group. Using disyllabic words without C2, 1998 demonstrates smoothly flowing F0 for Shanghai tonal rightward spreading in both and terms. These can be taken to reflect the underlying tension of the vocal cords without influence from supraglottal effects and represent a baseline in terms of which the perturbations observed on items with C2 can be understood.
{"title":"The microprosodics of tone sandhi in shanghai disyllabic compounds","authors":"X. Zhu","doi":"10.21437/ICSLP.1998-143","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-143","url":null,"abstract":"This paper examines the F0 variations during tone sandhi due to various prosodic factors such as phonation type, length, stress and pitch height. It will be shown that the F0 height and shape of the second syllable (S2) in disyllabic words are determined by the interaction of four conditions: the intervocalic consonant (C2) voicing, the S2 Truncation, the F0 height of S1, and stress assignment. group. Using disyllabic words without C2, 1998 demonstrates smoothly flowing F0 for Shanghai tonal rightward spreading in both and terms. These can be taken to reflect the underlying tension of the vocal cords without influence from supraglottal effects and represent a baseline in terms of which the perturbations observed on items with C2 can be understood.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116461279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a Bayesian constrained frequency warping technique. The Bayesian approach provides for inclusion of the prior information of the frequency warping parameter and for adjusting the search range in order to obtain the best warping factor dependent on HMMs. We introduce novel frequency warping (FWP) HMMs which are different warped versions of HMMs. Instead of frequency warping of the input speech we warp the spectrum of the HMMs. This is equivalent to HMMs which have both time and frequency warping capabilities. Experimentally FWP HMMs outperform the conventional constrained frequency warping approach. Furthermore, the best warping factor is estimated in two stages, a coarse stage followed by a fine stage. This method efficiently gauges the optimal warping factor and normalises the FWP HMMs.
{"title":"Bayesian constrained frequency warping HMMS for speaker normalisation","authors":"Ching-Hsiang Ho, S. Vaseghi, Aimin Chen","doi":"10.21437/ICSLP.1998-426","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-426","url":null,"abstract":"This paper presents a Bayesian constrained frequency warping technique. The Bayesian approach provides for inclusion of the prior information of the frequency warping parameter and for adjusting the search range in order to obtain the best warping factor dependent on HMMs. We introduce novel frequency warping (FWP) HMMs which are different warped versions of HMMs. Instead of frequency warping of the input speech we warp the spectrum of the HMMs. This is equivalent to HMMs which have both time and frequency warping capabilities. Experimentally FWP HMMs outperform the conventional constrained frequency warping approach. Furthermore, the best warping factor is estimated in two stages, a coarse stage followed by a fine stage. This method efficiently gauges the optimal warping factor and normalises the FWP HMMs.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122381448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lori S. Levin, Ann E. Thymé-Gobbel, A. Lavie, K. Ries, K. Zechner
This paper describes a 3-level manual discourse coding scheme that we have devised for manual tagging of the CallHome Spanish (CHS) and CallFriend Spanish (CFS) databases used in the CLARITY project. The goal of CLARITY is to explore the use of discourse structure in understanding conversational sp eech. The project combines empirical methods for dialogue processing with state-of-the art LVCSR (using the JANUS recognizer). The three levels of the coding scheme are (1) a speech act level consisting of a tag set extended from DAMSL and Switchboard; (2) dialogue game level defined by initiative and intention; and (3) an act ivity level defined within topic units. The manually tagged dialog ues are used to train automatic classifiers. We present preliminary results for statement categorization, and give an in-progress repo rt of automatic speech act classification and topic boundary identific ation.
{"title":"A discourse coding scheme for conversational Spanish","authors":"Lori S. Levin, Ann E. Thymé-Gobbel, A. Lavie, K. Ries, K. Zechner","doi":"10.21437/ICSLP.1998-492","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-492","url":null,"abstract":"This paper describes a 3-level manual discourse coding scheme that we have devised for manual tagging of the CallHome Spanish (CHS) and CallFriend Spanish (CFS) databases used in the CLARITY project. The goal of CLARITY is to explore the use of discourse structure in understanding conversational sp eech. The project combines empirical methods for dialogue processing with state-of-the art LVCSR (using the JANUS recognizer). The three levels of the coding scheme are (1) a speech act level consisting of a tag set extended from DAMSL and Switchboard; (2) dialogue game level defined by initiative and intention; and (3) an act ivity level defined within topic units. The manually tagged dialog ues are used to train automatic classifiers. We present preliminary results for statement categorization, and give an in-progress repo rt of automatic speech act classification and topic boundary identific ation.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122408828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we set out to investigate the fundamental frequency boundaries of perception of the Taiwanese long tones. We are interested in how the variations in fundamental frequency affect the perception of linguistic tones in Taiwanese speech. Our investigation is adopted from similar studies of tones in Mandarin speech. As opposed to Mandarin tones that can be perceived with little difficulty the seven Taiwanese tones have a more subtle structure and are consequently harder to perceive successfully. The experimental results in this paper allow us to quantify these perceptual boundaries. The experiments consisted of a perception test involving over 150 Taiwanese subjects where the task involved identifying the tone of the words played back in a random sequence. The stimuli consisted of a set of tone pairs and a selection of intermediate tone words obtained by linearly interpolating between the words of the tone pairs.
{"title":"Boundaries of perception of long tones in taiwanese speech","authors":"Fran H. L. Jian","doi":"10.21437/ICSLP.1998-454","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-454","url":null,"abstract":"In this work we set out to investigate the fundamental frequency boundaries of perception of the Taiwanese long tones. We are interested in how the variations in fundamental frequency affect the perception of linguistic tones in Taiwanese speech. Our investigation is adopted from similar studies of tones in Mandarin speech. As opposed to Mandarin tones that can be perceived with little difficulty the seven Taiwanese tones have a more subtle structure and are consequently harder to perceive successfully. The experimental results in this paper allow us to quantify these perceptual boundaries. The experiments consisted of a perception test involving over 150 Taiwanese subjects where the task involved identifying the tone of the words played back in a random sequence. The stimuli consisted of a set of tone pairs and a selection of intermediate tone words obtained by linearly interpolating between the words of the tone pairs.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122798343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of removing channel effects from speech has generally been attacked by attempting to recover a time-varying filter which inverts the entire channel impulse response. We show that human listeners are insensitive to many channel conditions and that the human ear seems to respond primarily to discontinuities of the channel. As a result of these observations, a partial equalization is proposed in which the channel effects to which the ear is sensitive may be removed, without full inversion of the channel. In addition, it is shown that it is possible to build filters of arbitrary length which do not reduce speech intelligibility and do not produce annoying artifacts.
{"title":"A model for speech reverberation and intelligibility restoring filters","authors":"O. Kenny, D. Nelson","doi":"10.21437/ICSLP.1998-310","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-310","url":null,"abstract":"The problem of removing channel effects from speech has generally been attacked by attempting to recover a time-varying filter which inverts the entire channel impulse response. We show that human listeners are insensitive to many channel conditions and that the human ear seems to respond primarily to discontinuities of the channel. As a result of these observations, a partial equalization is proposed in which the channel effects to which the ear is sensitive may be removed, without full inversion of the channel. In addition, it is shown that it is possible to build filters of arbitrary length which do not reduce speech intelligibility and do not produce annoying artifacts.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114366566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Radio and television broadcasts consist of a continuous stream of data comprised of segments of different linguistic and acoustic natures, which poses challenges for transcription. In this paper we report on our recent work in transcribing broadcast news data[2, 4], including the problem of partitioning the data into homogeneous segments prior to word recognition. Gaussian mixture models are used to identify speech and non-speech segments. A maximum-likelihood segmentation/clustering process is then applied to the speech segments using GMMs and an agglomerative clustering algorithm. The clustered segments are then labeled according to bandwidth and gender. The recog-nizer is a continuous mixture density, tied-state cross-word context-dependent HMM system with a 65k trigram language model. Decoding is carried out inthree passes, witha final pass incorporating cluster-based test-set MLLR adaptation. The overall word transcription error on the Nov’97 unpartitioned evaluation test data was 18.5%.
{"title":"Partitioning and transcription of broadcast news data","authors":"J. Gauvain, L. Lamel, G. Adda","doi":"10.21437/ICSLP.1998-618","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-618","url":null,"abstract":"Radio and television broadcasts consist of a continuous stream of data comprised of segments of different linguistic and acoustic natures, which poses challenges for transcription. In this paper we report on our recent work in transcribing broadcast news data[2, 4], including the problem of partitioning the data into homogeneous segments prior to word recognition. Gaussian mixture models are used to identify speech and non-speech segments. A maximum-likelihood segmentation/clustering process is then applied to the speech segments using GMMs and an agglomerative clustering algorithm. The clustered segments are then labeled according to bandwidth and gender. The recog-nizer is a continuous mixture density, tied-state cross-word context-dependent HMM system with a 65k trigram language model. Decoding is carried out inthree passes, witha final pass incorporating cluster-based test-set MLLR adaptation. The overall word transcription error on the Nov’97 unpartitioned evaluation test data was 18.5%.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122102940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Topic adaptation for language modeling is concerned with adjusting the probabilities in a language model to better reflect the expected frequencies of topical words for a new document. The language model to be adapted is usually built from large amounts of training text and is considered representative of the current domain. In order to adapt this model for a new document, the topic (or topics) of the new document are identified. Then, the probabilities of words that are more likely to occur in the identified topic(s) than in general are boosted, and the probabilities of words that are unlikely for the identified topic(s) are suppressed. We present a novel technique for adapting a language model to the topic of a document, using a nonlinear interpolation of -gram language models. A three-way, mutually exclusive division of the vocabulary into general, on-topic and off-topic word classes is used to combine word predictions from a topic-specific and a general language model. We achieve a slight decrease in perplexity and speech recognition word error rate on a Broadcast News test set using these techniques. Our results are compared to results obtained through linear interpolation of topic models.
{"title":"Nonlinear interpolation of topic models for language model adaptation","authors":"K. Seymore, Stanley F. Chen, R. Rosenfeld","doi":"10.21437/ICSLP.1998-667","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-667","url":null,"abstract":"Topic adaptation for language modeling is concerned with adjusting the probabilities in a language model to better reflect the expected frequencies of topical words for a new document. The language model to be adapted is usually built from large amounts of training text and is considered representative of the current domain. In order to adapt this model for a new document, the topic (or topics) of the new document are identified. Then, the probabilities of words that are more likely to occur in the identified topic(s) than in general are boosted, and the probabilities of words that are unlikely for the identified topic(s) are suppressed. We present a novel technique for adapting a language model to the topic of a document, using a nonlinear interpolation of -gram language models. A three-way, mutually exclusive division of the vocabulary into general, on-topic and off-topic word classes is used to combine word predictions from a topic-specific and a general language model. We achieve a slight decrease in perplexity and speech recognition word error rate on a Broadcast News test set using these techniques. Our results are compared to results obtained through linear interpolation of topic models.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117310002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}