This paper proposes a speech recognition method alternative to the conventional sample-based statistical methods which are characterized by the necessity of large amounts of training speech data. To resolve this type of heavy processing, the proposed method employs an intermediate phonetic code system and the calculation of distance between phonetic code sequences in symbolic domain. It realizes high efficiency when compared with direct processing of acoustic correlates, although some deterioration will be expected in recognition scores. We first describe the distance calculation method and present specific procedures for obtaining the intermediate code sequence from input utterances and for spotting words using the calculation of distance in the symbolic domain. Preliminary experiments were examined on isolated word recognition and phrase spotting in continuous speech. Word recognition results indicate that the recognition scores obtained by the proposed method are comparable compared with ordinary phone-HMM-based speech recognition.
{"title":"Speech recognition based on the distance calculation between intermediate phonetic code sequences in symbolic domain","authors":"Kazuyo Tanaka, Hiroaki Kojima","doi":"10.21437/ICSLP.1998-297","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-297","url":null,"abstract":"This paper proposes a speech recognition method alternative to the conventional sample-based statistical methods which are characterized by the necessity of large amounts of training speech data. To resolve this type of heavy processing, the proposed method employs an intermediate phonetic code system and the calculation of distance between phonetic code sequences in symbolic domain. It realizes high efficiency when compared with direct processing of acoustic correlates, although some deterioration will be expected in recognition scores. We first describe the distance calculation method and present specific procedures for obtaining the intermediate code sequence from input utterances and for spotting words using the calculation of distance in the symbolic domain. Preliminary experiments were examined on isolated word recognition and phrase spotting in continuous speech. Word recognition results indicate that the recognition scores obtained by the proposed method are comparable compared with ordinary phone-HMM-based speech recognition.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116253456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Several recent results demonstrate improvement of recognition scores if some FIR ltering is applied on the trajectories of feature vectors. This paper presents a new approach where the characteristics of lters are trained together with the HMM parameters resulting in improvements of the recognition in rst tests.Reestimation formulas for the cut-o frequencies of ideal LPlters are derived as well for the impulse response coe cients of a general FIR LPlter.
{"title":"Enhanced ASR by acoustic feature filtering","authors":"C. Wellekens","doi":"10.21437/ICSLP.1998-194","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-194","url":null,"abstract":"Several recent results demonstrate improvement of recognition scores if some FIR ltering is applied on the trajectories of feature vectors. This paper presents a new approach where the characteristics of lters are trained together with the HMM parameters resulting in improvements of the recognition in rst tests.Reestimation formulas for the cut-o frequencies of ideal LPlters are derived as well for the impulse response coe cients of a general FIR LPlter.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121473107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes an hyperbaric speech processing algorithm combining a restoration of the formants position and a correction of the pitch. The pitch is corrected using an algorithm of time-scale modification associated to an oversampling module. This operation does not only perform a shift of the fundamental frequency, but induces a shift of the other frequencies of the signal. This shift, as well as the formants shift due to the hyperbaric environment, is corrected by the formants restoration module, based on the linear speech production model.
{"title":"Restoration of hyperbaric speech by correction of the formants and the pitch","authors":"Laure Charonnat, M. Guitton, J. Crestel, G. Allée","doi":"10.21437/ICSLP.1998-519","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-519","url":null,"abstract":"This paper describes an hyperbaric speech processing algorithm combining a restoration of the formants position and a correction of the pitch. The pitch is corrected using an algorithm of time-scale modification associated to an oversampling module. This operation does not only perform a shift of the fundamental frequency, but induces a shift of the other frequencies of the signal. This shift, as well as the formants shift due to the hyperbaric environment, is corrected by the formants restoration module, based on the linear speech production model.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121514531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon King, T. A. Stephenson, S. Isard, P. Taylor, Alex Strachan
Speech can be naturally described by phonetic features, such as a set of acoustic phonetic features or a set of articulatory features. This thesis establi shes the effectiveness of using phonetic features in phoneme recognition by comparing a recogniser based on them to a recogniser using an established parametrisation as a baseline. The usefulness of phonetic features serves as the foundation for the subsequent modelling of syllables. Syllables are subject to fewer of the context-sensitivity effects that hamper phone-based speech recognition. I investigate the different questions involved in creating syllable models. After training a feature-based syllable recogniser, I compare the feature based syllables against a baseline. To conclude, the feature based syllable models are compared against the baseline phoneme models in word recognition. With the resultant feature-syllable models performing well in word recognition, the featuresyllables show their future potential for large vocabulary automatic speech recognition.
{"title":"Speech recognition via phonetically featured syllables","authors":"Simon King, T. A. Stephenson, S. Isard, P. Taylor, Alex Strachan","doi":"10.21437/ICSLP.1998-531","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-531","url":null,"abstract":"Speech can be naturally described by phonetic features, such as a set of acoustic phonetic features or a set of articulatory features. This thesis establi shes the effectiveness of using phonetic features in phoneme recognition by comparing a recogniser based on them to a recogniser using an established parametrisation as a baseline. The usefulness of phonetic features serves as the foundation for the subsequent modelling of syllables. Syllables are subject to fewer of the context-sensitivity effects that hamper phone-based speech recognition. I investigate the different questions involved in creating syllable models. After training a feature-based syllable recogniser, I compare the feature based syllables against a baseline. To conclude, the feature based syllable models are compared against the baseline phoneme models in word recognition. With the resultant feature-syllable models performing well in word recognition, the featuresyllables show their future potential for large vocabulary automatic speech recognition.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"52 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on a method for automatically dividing speech utterances into phonemic segments, which are used for constructing synthesis unit inventories for speech synthesis. Here, we propose a new segmentation parameter called, “dynamics of fundamental frequency (DF0).” In the fine structures of F0 contours, there exist phonemic events which are observed as local dips at phonemic transition regions, especially around voiced consonants. We apply this observation about F0 contours to a speech segmentation method. The DF0 segmentation parameter is used in the final stage of the segmentation procedure to refine the phonemic boundaries obtained roughly by DP alignment. We conduct experiments on the proposed automatic segmentation with a speech database prepared for unit inventory construction, and compare the obtained boundaries with those of manual segmentation to show the effectiveness of the proposed method. We also discuss the effects of the boundary refinement on the synthesized speech. effects of the proposed method on the synthesized speech. Finally, we summarize the results obtained here.
{"title":"On the use of F0 features in automatic segmentation for speech synthesis","authors":"Takashi Saito","doi":"10.21437/ICSLP.1998-56","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-56","url":null,"abstract":"This paper focuses on a method for automatically dividing speech utterances into phonemic segments, which are used for constructing synthesis unit inventories for speech synthesis. Here, we propose a new segmentation parameter called, “dynamics of fundamental frequency (DF0).” In the fine structures of F0 contours, there exist phonemic events which are observed as local dips at phonemic transition regions, especially around voiced consonants. We apply this observation about F0 contours to a speech segmentation method. The DF0 segmentation parameter is used in the final stage of the segmentation procedure to refine the phonemic boundaries obtained roughly by DP alignment. We conduct experiments on the proposed automatic segmentation with a speech database prepared for unit inventory construction, and compare the obtained boundaries with those of manual segmentation to show the effectiveness of the proposed method. We also discuss the effects of the boundary refinement on the synthesized speech. effects of the proposed method on the synthesized speech. Finally, we summarize the results obtained here.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114694675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the context of command-and-control applications, we exploit confidence measures in order to classify single-word utterances into two categories: utterances within the vocabulary which are recognized correctly, and other utterances, namely out-ofvocabulary (OOV) or misrecognized utterances. To this end, we investigate the classification error rate (CER) of several classes of confidence measures and transformations. In particular, we employed data-independent and data-dependent measures. The transformations we investigated include mapping to single confidence measures, LDA-transformed measures, and other linear combinations of these measures. These combinations are computed by means of neural networks trained with Bayesoptimal, and with Gardner-Derrida-optimal criteria. Compared to a recognition system without confidence measures, the selection of (various combinations of) confidence measures, the selection of suitable neural network architectures and training methods, continuously improves the CER. Additionally, we found that a linear perceptron generalizes better than a non-linear backpropagation network.
{"title":"Combination of confidence measures in isolated word recognition","authors":"Hans J. G. A. Dolfing, A. Wendemuth","doi":"10.21437/ICSLP.1998-815","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-815","url":null,"abstract":"In the context of command-and-control applications, we exploit confidence measures in order to classify single-word utterances into two categories: utterances within the vocabulary which are recognized correctly, and other utterances, namely out-ofvocabulary (OOV) or misrecognized utterances. To this end, we investigate the classification error rate (CER) of several classes of confidence measures and transformations. In particular, we employed data-independent and data-dependent measures. The transformations we investigated include mapping to single confidence measures, LDA-transformed measures, and other linear combinations of these measures. These combinations are computed by means of neural networks trained with Bayesoptimal, and with Gardner-Derrida-optimal criteria. Compared to a recognition system without confidence measures, the selection of (various combinations of) confidence measures, the selection of suitable neural network architectures and training methods, continuously improves the CER. Additionally, we found that a linear perceptron generalizes better than a non-linear backpropagation network.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114728992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F0 extraction algorithm is developed, and parameters based on orthonormal decomposition of theF0 contour are extracted for tone recognition. Context dependency is expressed by “tri-tone” models clustered into broader classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.
{"title":"A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition","authors":"Chao Wang, S. Seneff","doi":"10.21437/ICSLP.1998-140","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-140","url":null,"abstract":"Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F0 extraction algorithm is developed, and parameters based on orthonormal decomposition of theF0 contour are extracted for tone recognition. Context dependency is expressed by “tri-tone” models clustered into broader classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124459119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Voice onset time (VOT) is a key temporal feature in spoken language. There is some evidence to suggest that there are sex differences in VOT patterns. The cause of these sex differences could be attributed to sexual dimorphism of the vocal apparatus. There is also some evidence to suggest that phonetic sex differences could also be attributed to learned stylistic and linguistic factors. This study reports on an investigation into the VOT patterns for /p b t d/ in a group of thirty children aged 7 (n=10), 9 (n=10) and 11 (n=10) years, with equal numbers of girls (n=5) and boys (n=5) in each age group. Age and sex differences were examined for in the VOT data. Age, sex and age-by-sex interactions were found. The results are presented and discussed.
语音起始时间(VOT)是口语中一项重要的时间特征。有证据表明,VOT模式存在性别差异。造成这些性别差异的原因可归因于发声器官的两性二态性。还有一些证据表明,语音上的性别差异也可能归因于习得的文体和语言因素。本研究报告了对30名年龄为7岁(n=10)、9岁(n=10)和11岁(n=10)的儿童/p / b / d/的VOT模式的调查,每个年龄组中女孩(n=5)和男孩(n=5)的人数相等。在VOT数据中检查了年龄和性别差异。发现了年龄,性别和年龄与性别之间的相互作用。给出了实验结果并进行了讨论。
{"title":"Voice onset time patterns in 7-, 9- and 11-year old children","authors":"S. Whiteside, Jeni Marshall","doi":"10.21437/ICSLP.1998-771","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-771","url":null,"abstract":"Voice onset time (VOT) is a key temporal feature in spoken language. There is some evidence to suggest that there are sex differences in VOT patterns. The cause of these sex differences could be attributed to sexual dimorphism of the vocal apparatus. There is also some evidence to suggest that phonetic sex differences could also be attributed to learned stylistic and linguistic factors. This study reports on an investigation into the VOT patterns for /p b t d/ in a group of thirty children aged 7 (n=10), 9 (n=10) and 11 (n=10) years, with equal numbers of girls (n=5) and boys (n=5) in each age group. Age and sex differences were examined for in the VOT data. Age, sex and age-by-sex interactions were found. The results are presented and discussed.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127691886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cultural similarities and differences were compared between Japanese and North American subjects in the recognition of emotion. Seven native Japanese and five native North Americans (four Americans and one Canadian) subjects participated in the experiments. The materials were five meaningful words or short-sentences in Japanese and American English. Japanese and American actors made vocal and facial expression in order to transmit six basic emotions— happiness, surprise, anger, disgust, fear, and sadness. Three presentation conditions were used—auditory, visual, and audio-visual. The audio-visual stimuli were made by dubbing the auditory stimuli on to the visual stimuli. The results show: (1) subjects can more easily recognize the vocal expression of a speaker who belongs to their own culture, (2) Japanese subjects are not good at recognizing “fear” in both the auditory-alone and visual-alone conditions, (3) and both Japanese and American subjects identify the audio-visually incongruent stimuli more often as a visual label rather than as an auditory label. These results suggest that it is difficult to identify the emotion of a speaker from a different culture and that people will predominantly use visual information to identify emotion.
{"title":"Cultural similarities and differences in the recognition of audio-visual speech stimuli","authors":"S. Shigeno","doi":"10.21437/ICSLP.1998-268","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-268","url":null,"abstract":"Cultural similarities and differences were compared between Japanese and North American subjects in the recognition of emotion. Seven native Japanese and five native North Americans (four Americans and one Canadian) subjects participated in the experiments. The materials were five meaningful words or short-sentences in Japanese and American English. Japanese and American actors made vocal and facial expression in order to transmit six basic emotions— happiness, surprise, anger, disgust, fear, and sadness. Three presentation conditions were used—auditory, visual, and audio-visual. The audio-visual stimuli were made by dubbing the auditory stimuli on to the visual stimuli. The results show: (1) subjects can more easily recognize the vocal expression of a speaker who belongs to their own culture, (2) Japanese subjects are not good at recognizing “fear” in both the auditory-alone and visual-alone conditions, (3) and both Japanese and American subjects identify the audio-visually incongruent stimuli more often as a visual label rather than as an auditory label. These results suggest that it is difficult to identify the emotion of a speaker from a different culture and that people will predominantly use visual information to identify emotion.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126254323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a source controlled variable bit-rate (SC-VBR) speech coder based on the concept of prototype waveform interpolation. The coder uses a four mode classification : silence, voiced, unvoiced and transition. These modes are detected after the speech has been decomposed into slowly evolving (SEW) and rapidly evolving (REW) waveforms. A voicing activity detection (VAD), the relative level of SEW and REW and the cross-correlation coefficient between characteristic waveform segments are used to make the classification. The encoding of the SEW components is improved using a gender adaptation. In tests using conversational speech, the SC-VBR allows a compression factor of around 3. The VBR coder was evaluated against a fixed rate 4.6kbit/s PWI coder for clean speech and noisy speech and was found to perform better for male speech and for noisy speech.
{"title":"Source controlled variable bit-rate speech coder based on waveform interpolation","authors":"F. Plante, B. Cheetham, D. Marston, P. A. Barrett","doi":"10.21437/ICSLP.1998-395","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-395","url":null,"abstract":"This paper describes a source controlled variable bit-rate (SC-VBR) speech coder based on the concept of prototype waveform interpolation. The coder uses a four mode classification : silence, voiced, unvoiced and transition. These modes are detected after the speech has been decomposed into slowly evolving (SEW) and rapidly evolving (REW) waveforms. A voicing activity detection (VAD), the relative level of SEW and REW and the cross-correlation coefficient between characteristic waveform segments are used to make the classification. The encoding of the SEW components is improved using a gender adaptation. In tests using conversational speech, the SC-VBR allows a compression factor of around 3. The VBR coder was evaluated against a fixed rate 4.6kbit/s PWI coder for clean speech and noisy speech and was found to perform better for male speech and for noisy speech.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126374274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}