Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777877
R. Nisimura, Jumpei Miyake, Hideki Kawahara, T. Irino
We have developed a speech-to-text input method for web systems. The system is provided as a JavaScript library including an Ajax-like mechanism based on a Java applet, CGI programs, and dynamic HTML documents. It allows users to access voice-enabled web pages without requiring special browsers. Web developers can embed it on their web page by inserting only one line in the header field of an HTML document. This study also aims at observing natural spoken interactions in personal environments. We have succeeded in collecting 4,003 inputs during a period of seven months via our public Japanese ASR server. In order to cover out-of-vocabulary words to cope with some proper nouns, a web page to register new words into the language model are developed. As a result, we could obtain an improvement of 0.8% in the recognition accuracy. With regard to the acoustical conditions, an SNR of 25.3 dB was observed.
{"title":"Speech-to-text input method for web system using JavaScript","authors":"R. Nisimura, Jumpei Miyake, Hideki Kawahara, T. Irino","doi":"10.1109/SLT.2008.4777877","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777877","url":null,"abstract":"We have developed a speech-to-text input method for web systems. The system is provided as a JavaScript library including an Ajax-like mechanism based on a Java applet, CGI programs, and dynamic HTML documents. It allows users to access voice-enabled web pages without requiring special browsers. Web developers can embed it on their web page by inserting only one line in the header field of an HTML document. This study also aims at observing natural spoken interactions in personal environments. We have succeeded in collecting 4,003 inputs during a period of seven months via our public Japanese ASR server. In order to cover out-of-vocabulary words to cope with some proper nouns, a web page to register new words into the language model are developed. As a result, we could obtain an improvement of 0.8% in the recognition accuracy. With regard to the acoustical conditions, an SNR of 25.3 dB was observed.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131683310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777881
Kaustubh Kulkarni, Sailik Sengupta, V. Ramasubramanian, Josef G. Bauer, G. Stemmer
The problem of the effect of accent on the performance of Automatic Speech Recognition (ASR) systems is well known. In this paper, we study the effect of accent variability on the performance of the Indian English ASR task. We evaluate the test vocabularies on HMMs trained on (a) Accent specific training data (b) Accent pooled training data which combines all the accent specific training data (c) Accent pooled training data of reduced size matching the size of the accent specific training data. We demonstrate that the accent pooled training set performs the best on phonetically rich isolated word recognition task. But the accent specific HMMs perform better than the reduced accent pooled HMMs, indicating a possible approach of using a first stage accent identification to choose the correct accent trained HMMs for further recognition.
{"title":"Accented Indian english ASR: Some early results","authors":"Kaustubh Kulkarni, Sailik Sengupta, V. Ramasubramanian, Josef G. Bauer, G. Stemmer","doi":"10.1109/SLT.2008.4777881","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777881","url":null,"abstract":"The problem of the effect of accent on the performance of Automatic Speech Recognition (ASR) systems is well known. In this paper, we study the effect of accent variability on the performance of the Indian English ASR task. We evaluate the test vocabularies on HMMs trained on (a) Accent specific training data (b) Accent pooled training data which combines all the accent specific training data (c) Accent pooled training data of reduced size matching the size of the accent specific training data. We demonstrate that the accent pooled training set performs the best on phonetically rich isolated word recognition task. But the accent specific HMMs perform better than the reduced accent pooled HMMs, indicating a possible approach of using a first stage accent identification to choose the correct accent trained HMMs for further recognition.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130046411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777859
F. Fernández-Martínez, Juan Blázquez, J. Ferreiros, R. Barra-Chicote, Javier Macias-Guarasa, J. M. Lucas
In this paper, a Bayesian networks, BNs, approach to dialogue modelling is evaluated in terms of a battery of both subjective and objective metrics. A significant effort in improving the contextual information handling capabilities of the system has been done. Consequently, besides typical dialogue measurement rates for usability like task or dialogue completion rates, dialogue time, etc. we have included a new figure measuring the contextuality of the dialogue as the number of turns where contextual information is helpful for dialogue resolution. The evaluation is developed through a set of predefined scenarios according to different initiative styles and focusing on the impact of the user's level of experience.
{"title":"Evaluation of a spoken dialogue system for controlling a Hifi audio system","authors":"F. Fernández-Martínez, Juan Blázquez, J. Ferreiros, R. Barra-Chicote, Javier Macias-Guarasa, J. M. Lucas","doi":"10.1109/SLT.2008.4777859","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777859","url":null,"abstract":"In this paper, a Bayesian networks, BNs, approach to dialogue modelling is evaluated in terms of a battery of both subjective and objective metrics. A significant effort in improving the contextual information handling capabilities of the system has been done. Consequently, besides typical dialogue measurement rates for usability like task or dialogue completion rates, dialogue time, etc. we have included a new figure measuring the contextuality of the dialogue as the number of turns where contextual information is helpful for dialogue resolution. The evaluation is developed through a set of predefined scenarios according to different initiative styles and focusing on the impact of the user's level of experience.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132025327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777876
Hany Hassan, K. Sima'an, Andy Way
Syntactically-enriched language models (parsers) constitute a promising component in applications such as machine translation and speech-recognition. To maintain a useful level of accuracy, existing parsers are non-incremental and must span a combinatorially growing space of possible structures as every input word is processed. This prohibits their incorporation into standard linear-time decoders. In this paper, we present an incremental, linear-time dependency parser based on Combinatory Categorial Grammar (CCG) and classification techniques. We devise a deterministic transform of CCG-bank canonical derivations into incremental ones, and train our parser on this data. We discover that a cascaded, incremental version provides an appealing balance between efficiency and accuracy.
{"title":"A syntactic language model based on incremental CCG parsing","authors":"Hany Hassan, K. Sima'an, Andy Way","doi":"10.1109/SLT.2008.4777876","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777876","url":null,"abstract":"Syntactically-enriched language models (parsers) constitute a promising component in applications such as machine translation and speech-recognition. To maintain a useful level of accuracy, existing parsers are non-incremental and must span a combinatorially growing space of possible structures as every input word is processed. This prohibits their incorporation into standard linear-time decoders. In this paper, we present an incremental, linear-time dependency parser based on Combinatory Categorial Grammar (CCG) and classification techniques. We devise a deterministic transform of CCG-bank canonical derivations into incremental ones, and train our parser on this data. We discover that a cascaded, incremental version provides an appealing balance between efficiency and accuracy.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114517475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777888
S. Maskey, Martin Cmejrek, Bowen Zhou, Yuqing Gao
Named entity (NE) translation is a challenging problem in machine translation (MT). Most of the training bi-text corpora for MT lack enough samples of NEs to cover the wide variety of contexts NEs can appear in. In this paper, we present a technique to translate NEs based on their NE types in addition to a phrase-based translation model. Our NE translation model is based on a syntax-based system similar to the work of Chiang (2005); but we produce syntax-based rules with non-terminals as NE types instead of general non-terminals. Such class-based rules allow us to better generalize the context NEs. We show that our proposed method obtains an improvement of 0.66 BLEU score absolute as well as 0.26% in F1-measure over the baseline of phrase-based model in NE test set.
{"title":"Class-based named entity translation in a speech to speech translation system","authors":"S. Maskey, Martin Cmejrek, Bowen Zhou, Yuqing Gao","doi":"10.1109/SLT.2008.4777888","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777888","url":null,"abstract":"Named entity (NE) translation is a challenging problem in machine translation (MT). Most of the training bi-text corpora for MT lack enough samples of NEs to cover the wide variety of contexts NEs can appear in. In this paper, we present a technique to translate NEs based on their NE types in addition to a phrase-based translation model. Our NE translation model is based on a syntax-based system similar to the work of Chiang (2005); but we produce syntax-based rules with non-terminals as NE types instead of general non-terminals. Such class-based rules allow us to better generalize the context NEs. We show that our proposed method obtains an improvement of 0.66 BLEU score absolute as well as 0.26% in F1-measure over the baseline of phrase-based model in NE test set.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123634333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777867
Ricardo Ribeiro, David Martins de Matos
We explore the use of topic-based automatically acquired prior knowledge in speech summarization, assessing its influence throughout several term weighting schemes. All information is combined using latent semantic analysis as a core procedure to compute the relevance of the sentence-like units of the given input source. Evaluation is performed using the self-information measure, which tries to capture the informativeness of the summary in relation to the summarized input source. The similarity of the output summaries of the several approaches is also analyzed.
{"title":"Using prior knowledge to assess relevance in speech summarization","authors":"Ricardo Ribeiro, David Martins de Matos","doi":"10.1109/SLT.2008.4777867","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777867","url":null,"abstract":"We explore the use of topic-based automatically acquired prior knowledge in speech summarization, assessing its influence throughout several term weighting schemes. All information is combined using latent semantic analysis as a core procedure to compute the relevance of the sentence-like units of the given input source. Evaluation is performed using the self-information measure, which tries to capture the informativeness of the summary in relation to the summarized input source. The similarity of the output summaries of the several approaches is also analyzed.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124939076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777832
Y. R. Venugopalakrishna, M. V. Vinodh, H. Murthy, C. S. Ramalingam
Our earlier work [1] on speech synthesis has shown that syllables can produce reasonably natural quality speech. Nevertheless, audible artifacts are present due to discontinuities in pitch, energy, and formant trajectories at the joining point of the units. In this paper, we present some minimal signal modification techniques for reducing these artifacts.
{"title":"Methods for improving the quality of syllable based speech synthesis","authors":"Y. R. Venugopalakrishna, M. V. Vinodh, H. Murthy, C. S. Ramalingam","doi":"10.1109/SLT.2008.4777832","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777832","url":null,"abstract":"Our earlier work [1] on speech synthesis has shown that syllables can produce reasonably natural quality speech. Nevertheless, audible artifacts are present due to discontinuities in pitch, energy, and formant trajectories at the joining point of the units. In this paper, we present some minimal signal modification techniques for reducing these artifacts.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127221713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777835
L. Coelho, D. Braga
In this work an adaptive filtering scheme based on a dual Discrete Kalman Filtering (DKF) is proposed for Hidden Markov Model (HMM) based speech synthesis quality enhancement. The objective is to improve signal smoothness across HMMs and their related states and to reduce artifacts due to acoustic model's limitations. Both speech and artifacts are modelled by an autoregressive structure which provides an underlying time frame dependency and improves time-frequency resolution. Themodel parameters are arranged to obtain a combined state-space model and are also used to calculate instantaneous power spectral density estimates. The quality enhancement is performed by a dual discrete Kalman filter that simultaneously gives estimates for the models and the signals. The system's performance has been evaluated using mean opinion score tests and the proposed technique has led to improved results.
{"title":"Adaptive filtering for high quality hmm based speech synthesis","authors":"L. Coelho, D. Braga","doi":"10.1109/SLT.2008.4777835","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777835","url":null,"abstract":"In this work an adaptive filtering scheme based on a dual Discrete Kalman Filtering (DKF) is proposed for Hidden Markov Model (HMM) based speech synthesis quality enhancement. The objective is to improve signal smoothness across HMMs and their related states and to reduce artifacts due to acoustic model's limitations. Both speech and artifacts are modelled by an autoregressive structure which provides an underlying time frame dependency and improves time-frequency resolution. Themodel parameters are arranged to obtain a combined state-space model and are also used to calculate instantaneous power spectral density estimates. The quality enhancement is performed by a dual discrete Kalman filter that simultaneously gives estimates for the models and the signals. The system's performance has been evaluated using mean opinion score tests and the proposed technique has led to improved results.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127482537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777885
Paisarn Charoenpornsawat, Tanja Schultz
A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, maximal matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42% BLEU points compared to the baseline system.
{"title":"Improving word segmentation for Thai speech translation","authors":"Paisarn Charoenpornsawat, Tanja Schultz","doi":"10.1109/SLT.2008.4777885","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777885","url":null,"abstract":"A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, maximal matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42% BLEU points compared to the baseline system.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114588774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777833
K. Hirose, Qinghua Sun, N. Minematsu
Mandarin speech synthesis was conducted by generating prosodic features by the proposed method and segmental features by HMM-based method. The proposed method generates sentence fundamental frequency (F0) contours by representing them as a superposition of tone components on phrase components. The tone components are realized by concatenating their fragments at tone nuclei predicted by a corpus-based method, while the phrase components are generated by rules under the generation process model (F0 model) framework. The method includes prediction of phoneme/pause durations in a statistical method as the first step. Through a listening test on the quality of synthetic speech, it was shown that a better quality was obtainable by the method as compared to that by the full HMM-based method. It was also shown that a better quality is obtainable as compared to the case of generating F0 contours without super-positional scheme.
{"title":"Corpus-based synthesis of Mandarin speech with F0 contours generated by superposing tone components on rule-generated phrase components","authors":"K. Hirose, Qinghua Sun, N. Minematsu","doi":"10.1109/SLT.2008.4777833","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777833","url":null,"abstract":"Mandarin speech synthesis was conducted by generating prosodic features by the proposed method and segmental features by HMM-based method. The proposed method generates sentence fundamental frequency (F0) contours by representing them as a superposition of tone components on phrase components. The tone components are realized by concatenating their fragments at tone nuclei predicted by a corpus-based method, while the phrase components are generated by rules under the generation process model (F0 model) framework. The method includes prediction of phoneme/pause durations in a statistical method as the first step. Through a listening test on the quality of synthetic speech, it was shown that a better quality was obtainable by the method as compared to that by the full HMM-based method. It was also shown that a better quality is obtainable as compared to the case of generating F0 contours without super-positional scheme.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}