C. D. Bruijn, S. Whiteside, P. Cudd, D. Syder, K. Rosen, L. Nord
Literature and individual reports contain indications that the use of speech recognition based human computer interfaces could potentially lead to vocal fatigue, or even to symptoms associated with dysphonia. While more and more people opt for a speech driven computer interface as an alternative input method to the keyboard, and these speech recognition systems become more and widely used, both in the home and office environment, it has become necessary to qualify any potential risks of voice damage. This study reports about ongoing research that investigates acoustic changes in the voice, after use of a discrete speech recognition system. Acoustic analyses were carried out on two Swedish users of such a system. So far, for one of the users, two of the acoustic parameters under investigation that could be an indicator of vocal fatigue, show a significant difference directly before and after use of a speech recognition system.
{"title":"Using automatic speech recognition and its possible effects on the voice","authors":"C. D. Bruijn, S. Whiteside, P. Cudd, D. Syder, K. Rosen, L. Nord","doi":"10.21437/ICSLP.1998-778","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-778","url":null,"abstract":"Literature and individual reports contain indications that the use of speech recognition based human computer interfaces could potentially lead to vocal fatigue, or even to symptoms associated with dysphonia. While more and more people opt for a speech driven computer interface as an alternative input method to the keyboard, and these speech recognition systems become more and widely used, both in the home and office environment, it has become necessary to qualify any potential risks of voice damage. This study reports about ongoing research that investigates acoustic changes in the voice, after use of a discrete speech recognition system. Acoustic analyses were carried out on two Swedish users of such a system. So far, for one of the users, two of the acoustic parameters under investigation that could be an indicator of vocal fatigue, show a significant difference directly before and after use of a speech recognition system.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129469558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Brøndsted, Lars Bo Larsen, Michael Manthey, P. Kevitt, T. Moeslund, Kristian G. Olesen
The present paper presents a generic environment for intelligent multi media applications, denoted “The Intellimedia Work-Bench”. The aim of the workbench is to facilitate development and research within the field of multi modal user interaction. Physically it is a table with various devices mounted above and around. These include: A camera and a laser projector mounted above the workbench, a microphone array mounted on the walls of the room, a speech recogniser and a speech synthesiser. The camera is attached to a vision system capable of locating various objects placed on the workbench. The paper presents two applications utilising the workbench. One is a campus information system, allowing the user to ask for directions within a part of the university campus. The second application is a pool trainer, intended to provide guidance to novice players.
{"title":"The intellimedia workbench - a generic environment for multimodal systems","authors":"T. Brøndsted, Lars Bo Larsen, Michael Manthey, P. Kevitt, T. Moeslund, Kristian G. Olesen","doi":"10.21437/ICSLP.1998-266","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-266","url":null,"abstract":"The present paper presents a generic environment for intelligent multi media applications, denoted “The Intellimedia Work-Bench”. The aim of the workbench is to facilitate development and research within the field of multi modal user interaction. Physically it is a table with various devices mounted above and around. These include: A camera and a laser projector mounted above the workbench, a microphone array mounted on the walls of the room, a speech recogniser and a speech synthesiser. The camera is attached to a vision system capable of locating various objects placed on the workbench. The paper presents two applications utilising the workbench. One is a campus information system, allowing the user to ask for directions within a part of the university campus. The second application is a pool trainer, intended to provide guidance to novice players.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129652726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual perception of speech through spectrogram reading has long been a subject of research, as an aid for the deaf or hearing im-paired. Attributing the lack of success in this type of visual aids mainly to the static form of information presented by the spectrograms, this paper proposes a system of dynamic visualisation for speech sounds. This system samples a high resolved, auditory-based spectrogram, with a window of 20 milliseconds duration, so that exploiting the periodicity of the input sound, it produces a phase-locked sequence of images. This sequence is then animated at a rate of 50 images per second to produce a movie-like image displaying both the time-varying and time-independent information of the underlying sound. Results of several preliminary experiments for evaluation of the potential usefulness of the system for the deaf, undertaken by normal-hearing subjects, support the quick learning and persistence of the gestures for small sets of single words and motivate further investigations.
{"title":"Dynamical spectrogram, an aid for the deaf","authors":"A. Soltani-Farani, E. Chilton, R. Shirley","doi":"10.21437/ICSLP.1998-788","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-788","url":null,"abstract":"Visual perception of speech through spectrogram reading has long been a subject of research, as an aid for the deaf or hearing im-paired. Attributing the lack of success in this type of visual aids mainly to the static form of information presented by the spectrograms, this paper proposes a system of dynamic visualisation for speech sounds. This system samples a high resolved, auditory-based spectrogram, with a window of 20 milliseconds duration, so that exploiting the periodicity of the input sound, it produces a phase-locked sequence of images. This sequence is then animated at a rate of 50 images per second to produce a movie-like image displaying both the time-varying and time-independent information of the underlying sound. Results of several preliminary experiments for evaluation of the potential usefulness of the system for the deaf, undertaken by normal-hearing subjects, support the quick learning and persistence of the gestures for small sets of single words and motivate further investigations.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129669510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The tonemes of the Waromung Mongsen dialect of Ao Naga, a Tibeto-Burman of northeast India, are described with respect to their auditory and acoustic features. Even though rather small FO differences are found to separate each contrasting toneme, the results of a perception test nevertheless demonstrate that these small differences are perceptually salient to a native speaker and are readily identifiable. Each two demonstrate varying degrees of phonological, morphological and lexical divergence. A preliminary survey suggests that every village speaks its own variety; native speakers report that the unique village-specific characteristics of each variety serve as shibboleths to identify their speakers’ origin. Tonal across
{"title":"The acoustic and perceptual features of tone in the tibeto-burman language ao naga","authors":"A. Coupe","doi":"10.21437/ICSLP.1998-102","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-102","url":null,"abstract":"The tonemes of the Waromung Mongsen dialect of Ao Naga, a Tibeto-Burman of northeast India, are described with respect to their auditory and acoustic features. Even though rather small FO differences are found to separate each contrasting toneme, the results of a perception test nevertheless demonstrate that these small differences are perceptually salient to a native speaker and are readily identifiable. Each two demonstrate varying degrees of phonological, morphological and lexical divergence. A preliminary survey suggests that every village speaks its own variety; native speakers report that the unique village-specific characteristics of each variety serve as shibboleths to identify their speakers’ origin. Tonal across","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130421694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A command-response model for the process of F0 contour generation has been presented by Fujisaki and his coworkers. The present paper describes the results of a study on the variabilty and speech rate dependency of the model’s parameters in utterances of a speaker of Japanese. It was found that parameters α and β can be considered to be practically constant at a given speech rate, while Fb may vary slightly from utterance to utterance. Among these three parameters, only α was found to have a small but systematic tendency to increase with the speech rate.
{"title":"On the effects of speech rate upon parameters of the command-response model for the fundamental frequency contours of speech","authors":"S. Ohno, H. Fujisaki, Yoshikazu Hara","doi":"10.21437/ICSLP.1998-131","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-131","url":null,"abstract":"A command-response model for the process of F0 contour generation has been presented by Fujisaki and his coworkers. The present paper describes the results of a study on the variabilty and speech rate dependency of the model’s parameters in utterances of a speaker of Japanese. It was found that parameters α and β can be considered to be practically constant at a given speech rate, while Fb may vary slightly from utterance to utterance. Among these three parameters, only α was found to have a small but systematic tendency to increase with the speech rate.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130501222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Running speech contains abundant assimilated and phonologically reduced tokens, but there is considerable debate about how such varied pronunciations disrupt access to the corresponding words in listeners’ mental lexicons. While previous studies have examined the effects of carefully produced or electronically edited reductions, we present two experiments which compare cross-modal repetition priming for lexical decision by more reduced spontaneous forms and less reduced read forms of the same words uttered by the same speakers in the same phrases. Though less priming is found for the more reduced spontaneous tokens, both versions of words produce significant priming effects, whether the majority of stimuli are taken from spontaneous speech (Experiment 1) or from read speech (Experiment 2). Priming is more robust if the tokens themselves contain the context licensing the reduction.
{"title":"Lexical activation by assimilated and reduced tokens","authors":"M. L. Kelly, E. Bard, Catherine Sotillo","doi":"10.21437/ICSLP.1998-440","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-440","url":null,"abstract":"Running speech contains abundant assimilated and phonologically reduced tokens, but there is considerable debate about how such varied pronunciations disrupt access to the corresponding words in listeners’ mental lexicons. While previous studies have examined the effects of carefully produced or electronically edited reductions, we present two experiments which compare cross-modal repetition priming for lexical decision by more reduced spontaneous forms and less reduced read forms of the same words uttered by the same speakers in the same phrases. Though less priming is found for the more reduced spontaneous tokens, both versions of words produce significant priming effects, whether the majority of stimuli are taken from spontaneous speech (Experiment 1) or from read speech (Experiment 2). Priming is more robust if the tokens themselves contain the context licensing the reduction.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127013762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents our research efforts on Chinese text-to-speech towards higher naturalness, the main results can be summarized as follows: 1. In the proposed TTS system the syllable-sized units were cut out from the real recorded speech, the synthetic speech was generated by concatenating these units back together. 2. The integration of units synthesized by rules with natural units was tested. A LMA filter based synthesizer was developed successfully to test and generate those units, which were difficult to be collected from the speech corpus. 3. A new efficient Chinese character coding scheme - "Yin Xu Code"(YX Code) has been developed to assist the GB Code. Based on above results, a Chinese text-to-speech system named as "KD-863" has been developed. In the national assessment of Chinese TTS systems held at the end of March 1998 in Beijing, the system achieved a first of the naturalness MOS (Mean Opinion Score).
{"title":"Towards a Chinese text-to-speech system with higher naturalness","authors":"Ren-Hua Wang, Qingfeng Liu, Yongsheng Teng, Deyu Xia","doi":"10.21437/ICSLP.1998-47","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-47","url":null,"abstract":"This paper presents our research efforts on Chinese text-to-speech towards higher naturalness, the main results can be summarized as follows: 1. In the proposed TTS system the syllable-sized units were cut out from the real recorded speech, the synthetic speech was generated by concatenating these units back together. 2. The integration of units synthesized by rules with natural units was tested. A LMA filter based synthesizer was developed successfully to test and generate those units, which were difficult to be collected from the speech corpus. 3. A new efficient Chinese character coding scheme - \"Yin Xu Code\"(YX Code) has been developed to assist the GB Code. Based on above results, a Chinese text-to-speech system named as \"KD-863\" has been developed. In the national assessment of Chinese TTS systems held at the end of March 1998 in Beijing, the system achieved a first of the naturalness MOS (Mean Opinion Score).","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"2010 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129126930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a method to generate a database that contains a parametric representation of F0 contours associated with linguistic and acoustic information, to be used by data-driven Japanese text-to-speech (TTS) systems. The configuration of the database includes recorded speech, F0 contours and their parametric labels, phonetic transcription with durations, and other linguistic information such as orthographic transcription, part-of-speech (POS) tags, and accent types. All information that is not available by dictionary lookup is obtained automatically. In this paper, we propose a method to automatically obtain parametric labels that describe F0 contours based on a superpositional model. Preliminary tests on a small data set show that the method can find the parametric representation of F0 contours with acceptable accuracy, and that accuracy can be improved by introducing additional linguistic information.
{"title":"A linguistic and prosodic database for data-driven Japanese TTS synthesis","authors":"A. Sakurai, Takashi Natsume, K. Hirose","doi":"10.21437/ICSLP.1998-57","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-57","url":null,"abstract":"We propose a method to generate a database that contains a parametric representation of F0 contours associated with linguistic and acoustic information, to be used by data-driven Japanese text-to-speech (TTS) systems. The configuration of the database includes recorded speech, F0 contours and their parametric labels, phonetic transcription with durations, and other linguistic information such as orthographic transcription, part-of-speech (POS) tags, and accent types. All information that is not available by dictionary lookup is obtained automatically. In this paper, we propose a method to automatically obtain parametric labels that describe F0 contours based on a superpositional model. Preliminary tests on a small data set show that the method can find the parametric representation of F0 contours with acceptable accuracy, and that accuracy can be improved by introducing additional linguistic information.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123838535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Categorical perception, or the perceived equality of instances within a phoneme category, has been a central concept in the experimental and theoretical investigation of speech perception. It can be found as fact in most introductory textbooks in perception, cognition, linguistics and cognitive science. This paper analyzes the reasons for the persistent endurance of this concept. A variety of empirical and theoretical research findings are described in order to inform and hopefully to provide a more critical look at this pervasive concept. Given the demise of categorical perception, it is necessary to shift our theoretical focus to how multiple sources of continuous information are processed to support the perception of spoken language.
{"title":"Categorical perception: important phenomenon or lasting myth?","authors":"D. Massaro","doi":"10.21437/ICSLP.1998-463","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-463","url":null,"abstract":"Categorical perception, or the perceived equality of instances within a phoneme category, has been a central concept in the experimental and theoretical investigation of speech perception. It can be found as fact in most introductory textbooks in perception, cognition, linguistics and cognitive science. This paper analyzes the reasons for the persistent endurance of this concept. A variety of empirical and theoretical research findings are described in order to inform and hopefully to provide a more critical look at this pervasive concept. Given the demise of categorical perception, it is necessary to shift our theoretical focus to how multiple sources of continuous information are processed to support the perception of spoken language.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123846183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transform coding of LSF parameters using wavelets","authors":"D. Petrinović","doi":"10.21437/ICSLP.1998-394","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-394","url":null,"abstract":"","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114366360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}