We present new waveform-interpolation coding procedures which allow perfect reconstruction of the speech signal from the unquantized parameter set. Instead of using adaptive parameter extraction methods, we combine a time warping of the original signal with nonadaptive parameter extraction methods. The new coding structure has good performance at low bit rates and provides convergence to the original waveform with increasing rate.
{"title":"Waveform interpolation coding with pitch-spaced subbands","authors":"W. Kleijn, Huimin Yang, E. Deprettere","doi":"10.21437/ICSLP.1998-382","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-382","url":null,"abstract":"We present new waveform-interpolation coding procedures which allow perfect reconstruction of the speech signal from the unquantized parameter set. Instead of using adaptive parameter extraction methods, we combine a time warping of the original signal with nonadaptive parameter extraction methods. The new coding structure has good performance at low bit rates and provides convergence to the original waveform with increasing rate.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124989500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we analyse and compare a low dimensional linguistic representation of vowels with high dimensional prototypical vowel templates derived from a native Australian English speaker. We further perform the same analysis on Lebanese and Vietnamese accented English to investigate how di(cid:11)erences due to accents impact on such a representation. In a low dimensional linguistic representation a vowel is characterised by articulatory tract parameters. To simplify the problem, the study is restricted to vowels that, notionally at least, involve a steady state articulation i.e. a stable target con(cid:12)guration of tongue, lips and jaw between preceding and following articulatory transitions. Vowels are represented by the horizontal and vertical position of the part of the tongue involved in the key articulation of a particular vowel, e.g., high or low and front or back. To this is added lip posture, spread or rounded. Prototypical vowel templates are derived as follows. The sound pressure signal is parametrized by 12 mel-frequency cepstrum coe(cid:14)cients. At the centre of each phonetically labelled segment, 180 dimensional phone templates are extracted. For the group of short (/I/, /E/, /A/, /O/, /V/, /U/, /@/) and long vowels (/i:/, /e:/, /a:/, /o:/, /u:/, /@:/) we obtain vowel clusters by averaging over all templates of each vowel class and accent. The speech materiaThe speech material is taken from the Australian National Database Of Spoken Language (AN-DOSL). For a comparison of high dimensional vowel clusters derived from speech samples with low dimensional prototypical vowels in the articulatory tract representation we perform a reduction in dimension by a multidimensional scaling transformation in a two dimensional space. Here, a linear transformation maps a high dimensional space on a lower dimensional sub space by optimising the relative distances between data vectors. As an important result we (cid:12)nd. i) /@/ and /@:/ are surrounded by the remaining vowels; ii) the overall structure and the relative distances between the prototypical vowels are very similar. Varia-tions in the structure can be explained by the in(cid:13)uence of native Australian English, Lebanese Arabic and South Vietnamese accents.
{"title":"The influence of accents in australian English vowels and their relation to articulatory tract parameters","authors":"D. Dersch, Chris Cléirigh, Julie Vonwiller","doi":"10.21437/ICSLP.1998-208","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-208","url":null,"abstract":"In this paper we analyse and compare a low dimensional linguistic representation of vowels with high dimensional prototypical vowel templates derived from a native Australian English speaker. We further perform the same analysis on Lebanese and Vietnamese accented English to investigate how di(cid:11)erences due to accents impact on such a representation. In a low dimensional linguistic representation a vowel is characterised by articulatory tract parameters. To simplify the problem, the study is restricted to vowels that, notionally at least, involve a steady state articulation i.e. a stable target con(cid:12)guration of tongue, lips and jaw between preceding and following articulatory transitions. Vowels are represented by the horizontal and vertical position of the part of the tongue involved in the key articulation of a particular vowel, e.g., high or low and front or back. To this is added lip posture, spread or rounded. Prototypical vowel templates are derived as follows. The sound pressure signal is parametrized by 12 mel-frequency cepstrum coe(cid:14)cients. At the centre of each phonetically labelled segment, 180 dimensional phone templates are extracted. For the group of short (/I/, /E/, /A/, /O/, /V/, /U/, /@/) and long vowels (/i:/, /e:/, /a:/, /o:/, /u:/, /@:/) we obtain vowel clusters by averaging over all templates of each vowel class and accent. The speech materiaThe speech material is taken from the Australian National Database Of Spoken Language (AN-DOSL). For a comparison of high dimensional vowel clusters derived from speech samples with low dimensional prototypical vowels in the articulatory tract representation we perform a reduction in dimension by a multidimensional scaling transformation in a two dimensional space. Here, a linear transformation maps a high dimensional space on a lower dimensional sub space by optimising the relative distances between data vectors. As an important result we (cid:12)nd. i) /@/ and /@:/ are surrounded by the remaining vowels; ii) the overall structure and the relative distances between the prototypical vowels are very similar. Varia-tions in the structure can be explained by the in(cid:13)uence of native Australian English, Lebanese Arabic and South Vietnamese accents.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125086912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper reports the experiences we had in evaluating the ACCeSS system using the EAGLES evaluation metrics both at the input/output (black box evaluation) and component levels (glass box evaluation). We deliver an example of a complete evaluation of a continuous speech/mixed initiative system using these standards. Furthermore, we discuss some useful extensions to them.
{"title":"Fly with the EAGLES: evaluation of the \"ACCeSS\" spoken language dialogue system","authors":"G. Hanrieder, Paul Heisterkamp, T. Brey","doi":"10.21437/ICSLP.1998-75","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-75","url":null,"abstract":"This paper reports the experiences we had in evaluating the ACCeSS system using the EAGLES evaluation metrics both at the input/output (black box evaluation) and component levels (glass box evaluation). We deliver an example of a complete evaluation of a continuous speech/mixed initiative system using these standards. Furthermore, we discuss some useful extensions to them.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125167166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The acquisition of putonghua phonology","authors":"L. So, Zhou Jing","doi":"10.21437/ICSLP.1998-768","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-768","url":null,"abstract":"","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125901157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengxi Pan, Jia Liu, Jintao Jiang, Zuoying Wang, Dajin Lu
In this paper, a new robust speech recognition algorithm of multi-models and integrated decision(MMID) is proposed. A parallel MMID(PMMID) algorithm is developed. By using this new algorithm the advantages of different models can be integrated into one system. This algorithm uses different acoustic models at the same time based on DDBHMM (duration distribution based Hidden Markov Model)[2]. These different models include the channel-mismatch-correct(CMC) model, more-alternative-pronunciation model, tone and non-tone models of Chinese Mandarin speech, voice activity detection(VAD) model and state-skip model. The speech recognition accuracy of the multi-model system is better than that of single-model system in the adverse environments. The experimental results show that the error rate of the recognition system is 2.9% and reduced by 81% compared with the baseline system of the single-model.
{"title":"A novel robust speech recognition algorithm based on multi-models and integrated decision method","authors":"Shengxi Pan, Jia Liu, Jintao Jiang, Zuoying Wang, Dajin Lu","doi":"10.21437/ICSLP.1998-334","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-334","url":null,"abstract":"In this paper, a new robust speech recognition algorithm of multi-models and integrated decision(MMID) is proposed. A parallel MMID(PMMID) algorithm is developed. By using this new algorithm the advantages of different models can be integrated into one system. This algorithm uses different acoustic models at the same time based on DDBHMM (duration distribution based Hidden Markov Model)[2]. These different models include the channel-mismatch-correct(CMC) model, more-alternative-pronunciation model, tone and non-tone models of Chinese Mandarin speech, voice activity detection(VAD) model and state-skip model. The speech recognition accuracy of the multi-model system is better than that of single-model system in the adverse environments. The experimental results show that the error rate of the recognition system is 2.9% and reduced by 81% compared with the baseline system of the single-model.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In second language input studies, speaking speed is regarded as one of the most influential factors in comprehension. However, research in this area has mainly been conducted on written texts read aloud. The present study investigated temporal variables, such as articulation rate and ratio and frequency of fillers and silent pauses, in three university lectures given in Japanese. It was found that the total duration ratio of fillers was as great as that of silent pauses. It also became clear that, for individual speakers, articulation rate and frequency of fillers are relatively constant, while frequency of silent pauses varies depending on discourse section. Of total pause ratio, pause frequency and articulation rate, the latter correlated best with listener ratings of speech speed. The findings suggest that spontaneous speech requires methods of speech speed measurement different from those for read speech.
{"title":"Temporal variables in lectures in the Japanese language","authors":"Michiko Watanabe","doi":"10.21437/ICSLP.1998-842","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-842","url":null,"abstract":"In second language input studies, speaking speed is regarded as one of the most influential factors in comprehension. However, research in this area has mainly been conducted on written texts read aloud. The present study investigated temporal variables, such as articulation rate and ratio and frequency of fillers and silent pauses, in three university lectures given in Japanese. It was found that the total duration ratio of fillers was as great as that of silent pauses. It also became clear that, for individual speakers, articulation rate and frequency of fillers are relatively constant, while frequency of silent pauses varies depending on discourse section. Of total pause ratio, pause frequency and articulation rate, the latter correlated best with listener ratings of speech speed. The findings suggest that spontaneous speech requires methods of speech speed measurement different from those for read speech.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123245900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a time-scale pitch-scale modification technique for concatenative speech synthesis. The method is based on a frequency domain source-filter model, where the source is modeled as a mixed excitation. This model is highly coupled with a compression scheme that result in compact acoustic inventories. When compared to the approach in the Whistler system using no mixed excitation, the new method shows improvement in voiced fricatives and over-stretched voiced sounds. In addition, it allows for spectral manipulation such as smoothing of discontinuities at unit boundaries, voice transformations or loudness equalization.
{"title":"A mixed-excitation frequency domain model for time-scale pitch-scale modification of speech","authors":"A. Acero","doi":"10.21437/ICSLP.1998-16","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-16","url":null,"abstract":"This paper presents a time-scale pitch-scale modification technique for concatenative speech synthesis. The method is based on a frequency domain source-filter model, where the source is modeled as a mixed excitation. This model is highly coupled with a compression scheme that result in compact acoustic inventories. When compared to the approach in the Whistler system using no mixed excitation, the new method shows improvement in voiced fricatives and over-stretched voiced sounds. In addition, it allows for spectral manipulation such as smoothing of discontinuities at unit boundaries, voice transformations or loudness equalization.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123330174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Turkish. Based on a global phoneme set we built different multilingual speech recognition systems for five of the 15 languages. Context dependent phoneme models are created data-driven by introducing questions about language and language groups to our polyphone clustering procedure. We apply the resulting multilingual models to unseen languages and present several recognition results in language independent and language adaptive setups.
{"title":"Language independent and language adaptive large vocabulary speech recognition","authors":"Tanja Schultz, A. Waibel","doi":"10.21437/ICSLP.1998-751","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-751","url":null,"abstract":"This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Turkish. Based on a global phoneme set we built different multilingual speech recognition systems for five of the 15 languages. Context dependent phoneme models are created data-driven by introducing questions about language and language groups to our polyphone clustering procedure. We apply the resulting multilingual models to unseen languages and present several recognition results in language independent and language adaptive setups.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"47 30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123787489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-linguistically, focus is often cued by suprasegmental features and changes in phrasing. In this paper, phonetic and phonological markers of contrastive focus in Korean are investigated. We find that, as a phonological marker, focus initiates an accentual phrase (AP), and tends to, but does not always, include the following words in the same AP. But regardless of whether the post-focus sequence is dephrased or not, there is a significant expansion of the focused peak compared to the peak on the following words, thus achieving the perceptual goal of focus: prominence of the focused word relative to the following items. As a phonetic marker, a focused AP has extra-strengthening on its left edge, and the sequence before and after focus tends to be shorter than that in a neutral sentence.
{"title":"Phonetic and phonological markers of contrastive focus in Korean","authors":"Sun-Ah Jun, Hyuck-Joon Lee","doi":"10.21437/ICSLP.1998-151","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-151","url":null,"abstract":"Cross-linguistically, focus is often cued by suprasegmental features and changes in phrasing. In this paper, phonetic and phonological markers of contrastive focus in Korean are investigated. We find that, as a phonological marker, focus initiates an accentual phrase (AP), and tends to, but does not always, include the following words in the same AP. But regardless of whether the post-focus sequence is dephrased or not, there is a significant expansion of the focused peak compared to the peak on the following words, thus achieving the perceptual goal of focus: prominence of the focused word relative to the following items. As a phonetic marker, a focused AP has extra-strengthening on its left edge, and the sequence before and after focus tends to be shorter than that in a neutral sentence.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115049274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose to use an utterance length (duration) dependent threshold for rejecting an unknown input utterance with a general speech(garbage) model. A general speech model, com-paring with more sophisticated anti-subword models, is a more viable solution to the utterance rejection problem for low-cost ap-plications with stringent storage and computational constraints. However, the rejection performance using such a general model with a fixed, universal rejection threshold is in general worse than the anti-models with higher discriminations. Without adding complexities to the rejection algorithm, we propose to vary the rejection threshold according to the utterance length. The experimental results show that significant improvement in rejection performance can be obtained by using the proposed, length dependent rejection threshold over a fixed threshold. We investigate utterance rejection in a command phrase recognition task. The equal error rate, a good figure of merit for calibrating the performance of utterance verification algorithms, is reduced by almost 23% when the proposed length dependent threshold is used.
{"title":"Improved utterance rejection using length dependent thresholds","authors":"Sunil K. Gupta, F. Soong","doi":"10.21437/ICSLP.1998-425","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-425","url":null,"abstract":"In this paper, we propose to use an utterance length (duration) dependent threshold for rejecting an unknown input utterance with a general speech(garbage) model. A general speech model, com-paring with more sophisticated anti-subword models, is a more viable solution to the utterance rejection problem for low-cost ap-plications with stringent storage and computational constraints. However, the rejection performance using such a general model with a fixed, universal rejection threshold is in general worse than the anti-models with higher discriminations. Without adding complexities to the rejection algorithm, we propose to vary the rejection threshold according to the utterance length. The experimental results show that significant improvement in rejection performance can be obtained by using the proposed, length dependent rejection threshold over a fixed threshold. We investigate utterance rejection in a command phrase recognition task. The equal error rate, a good figure of merit for calibrating the performance of utterance verification algorithms, is reduced by almost 23% when the proposed length dependent threshold is used.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115985817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}