2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041216
Meiko Fukuda, Ryota Nishimura, H. Nishizaki, Y. Iribe, N. Kitaoka
We have constructed a new speech data corpus consisting of the utterances of 221 elderly Japanese people (average age: 79.2) with the aim of improving the accuracy of automatic speech recognition (ASR) for the elderly. ASR is a beneficial modality for people with impaired vision or limited hand movement, including the elderly. However, speech recognition systems using standard recognition models, especially acoustic models, have been unable to achieve satisfactory performance for the elderly. Thus, creating more accurate acoustic models of the speech of elderly users is essential for improving speech recognition for the elderly. Using our new corpus, which includes the speech of elderly people living in three regions of Japan, we conducted speech recognition experiments using a variety of DNN-HNN acoustic models. As training data for our acoustic models, we examined whether a standard adult Japanese speech corpus (JNAS), an elderly speech corpus (S-JNAS) or a spontaneous speech corpus (CSJ) was most suitable, and whether or not adaptation to the dialect of each region improved recognition results. We adapted each of our three acoustic models to all of our speech data, and then re-adapt them using speech from each region. Without adaptation, the best recognition results were obtained when using the S-JNAS trained acoustic models (total corpus: 21.85% Word Error Rate). However, after adaptation of our acoustic models to our entire corpus, the CSJ trained models achieved the lowest WERs (entire corpus: 17.42%). Moreover, after readaptation to each regional dialect, the CSJ trained acoustic model with adaptation to regional speech data showed tendencies of improved recognition rates. We plan to collect more utterances from all over Japan, so that our corpus can be used as a key resource for elderly speech recognition in Japanese. We also hope to achieve further improvement in recognition performance for elderly speech.
{"title":"A New Corpus of Elderly Japanese Speech for Acoustic Modeling, and a Preliminary Investigation of Dialect-Dependent Speech Recognition","authors":"Meiko Fukuda, Ryota Nishimura, H. Nishizaki, Y. Iribe, N. Kitaoka","doi":"10.1109/O-COCOSDA46868.2019.9041216","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041216","url":null,"abstract":"We have constructed a new speech data corpus consisting of the utterances of 221 elderly Japanese people (average age: 79.2) with the aim of improving the accuracy of automatic speech recognition (ASR) for the elderly. ASR is a beneficial modality for people with impaired vision or limited hand movement, including the elderly. However, speech recognition systems using standard recognition models, especially acoustic models, have been unable to achieve satisfactory performance for the elderly. Thus, creating more accurate acoustic models of the speech of elderly users is essential for improving speech recognition for the elderly. Using our new corpus, which includes the speech of elderly people living in three regions of Japan, we conducted speech recognition experiments using a variety of DNN-HNN acoustic models. As training data for our acoustic models, we examined whether a standard adult Japanese speech corpus (JNAS), an elderly speech corpus (S-JNAS) or a spontaneous speech corpus (CSJ) was most suitable, and whether or not adaptation to the dialect of each region improved recognition results. We adapted each of our three acoustic models to all of our speech data, and then re-adapt them using speech from each region. Without adaptation, the best recognition results were obtained when using the S-JNAS trained acoustic models (total corpus: 21.85% Word Error Rate). However, after adaptation of our acoustic models to our entire corpus, the CSJ trained models achieved the lowest WERs (entire corpus: 17.42%). Moreover, after readaptation to each regional dialect, the CSJ trained acoustic model with adaptation to regional speech data showed tendencies of improved recognition rates. We plan to collect more utterances from all over Japan, so that our corpus can be used as a key resource for elderly speech recognition in Japanese. We also hope to achieve further improvement in recognition performance for elderly speech.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124303078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041157
Siyi Cao, Yizhong Xu, Xiaoli Ji
This paper investigated that based on [22] ‘s perspective that Pragmatic Markers (PMs) are realized mainly through prosody between native speakers and non-native speakers, when focus functions as pragmatic markers, whether pragmatic factors from non-native speakers restrict the realization of Pragmatic Markers through prosody leading to misunderstanding in intercultural communication, in the case of declarative questions and statements. Pitch contours of 17 Chinese EFL (English as a foreign language) learners (non-native speakers)’ sentences were compared with that of six native speakers using four sentences from AESOP. The results demonstrated that native speakers and non-native speakers indeed realized pragmatic markers (focused words) through prosodic cues (pitch range), but differed in the way of realization for pragmatic markers, leading to pragmatic misunderstanding. This paper proves [22] ‘s opinion and demonstrates that pragmatic elements from transfer, L2 teaching, proficiency of non-native speakers constraint prosodic ways for realizing pragmatic markers, which indicates conventionality in cross-culture conversation.
{"title":"The Study of Prosody-Pragmatics Interface with Focus Functioning as Pragmatic Markers: The Case of Question and Statement","authors":"Siyi Cao, Yizhong Xu, Xiaoli Ji","doi":"10.1109/O-COCOSDA46868.2019.9041157","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041157","url":null,"abstract":"This paper investigated that based on [22] ‘s perspective that Pragmatic Markers (PMs) are realized mainly through prosody between native speakers and non-native speakers, when focus functions as pragmatic markers, whether pragmatic factors from non-native speakers restrict the realization of Pragmatic Markers through prosody leading to misunderstanding in intercultural communication, in the case of declarative questions and statements. Pitch contours of 17 Chinese EFL (English as a foreign language) learners (non-native speakers)’ sentences were compared with that of six native speakers using four sentences from AESOP. The results demonstrated that native speakers and non-native speakers indeed realized pragmatic markers (focused words) through prosodic cues (pitch range), but differed in the way of realization for pragmatic markers, leading to pragmatic misunderstanding. This paper proves [22] ‘s opinion and demonstrates that pragmatic elements from transfer, L2 teaching, proficiency of non-native speakers constraint prosodic ways for realizing pragmatic markers, which indicates conventionality in cross-culture conversation.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132616409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9060846
S. Yang, Minhwa Chung
This paper describes two experiments aimed at exploring the relationship between linguistic aspects and perceived proficiency in read and spontaneous speech. 5,000 utterances of read speech by 50 non-native speakers of Korean in Experiment 1, and of 6,000 spontaneous speech utterances in Experiment 2 were scored for proficiency by native human raters and were analyzed by factors known to be related to perceived proficiency. The results show that the factors investigated in this study can be employed to predict proficiency ratings, and the predictive power of fluency and pitch and accent accuracy is strong for both read and spontaneous speech. We also observe that while proficiency ratings of read speech are mainly related to segmental accuracy, those of spontaneous speech appear to be more related to pitch and accent accuracy. Moreover, proficiency in read speech does not always equate to the proficiency in spontaneous speech, and vice versa, with Pearson’s per-speaker correlation score of 0.535.
{"title":"Comparison between read and spontaneous speech assessment of L2 Korean","authors":"S. Yang, Minhwa Chung","doi":"10.1109/O-COCOSDA46868.2019.9060846","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060846","url":null,"abstract":"This paper describes two experiments aimed at exploring the relationship between linguistic aspects and perceived proficiency in read and spontaneous speech. 5,000 utterances of read speech by 50 non-native speakers of Korean in Experiment 1, and of 6,000 spontaneous speech utterances in Experiment 2 were scored for proficiency by native human raters and were analyzed by factors known to be related to perceived proficiency. The results show that the factors investigated in this study can be employed to predict proficiency ratings, and the predictive power of fluency and pitch and accent accuracy is strong for both read and spontaneous speech. We also observe that while proficiency ratings of read speech are mainly related to segmental accuracy, those of spontaneous speech appear to be more related to pitch and accent accuracy. Moreover, proficiency in read speech does not always equate to the proficiency in spontaneous speech, and vice versa, with Pearson’s per-speaker correlation score of 0.535.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122246679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041215
Shintaro Ando, Z. Lin, Tasavat Trisitichoke, Y. Inoue, Fuki Yoshizawa, D. Saito, N. Minematsu
The main objective of language learning is to acquire good communication skills in the target language. From this viewpoint, the primary goal of pronunciation training is to become able to speak in an intelligible-enough or comprehensible-enough pronunciation, not a native-sounding one. However, achieving such pronunciation is still not easy for many learners mainly because of their lack of opportunity to use the language they learn and to receive some feedbacks on intelligibility or comprehensibility from native listeners. In order to solve this problem, the authors previously proposed a novel method of native speakers' reverse shadowing and showed that the degree of inarticulation observed in native speakers' shadowings of learners' utterances can be used to estimate the comprehensibility of learners' speech. One major problem in our previous research however, was that the experiment was done on a relatively small scale; the number of learners was only six. For this reason, in this study, we carried out a larger collection of Japanese utterances read aloud by 60 Vietnamese learners and Japanese native speakers' shadowings of those utterances. An analysis of the subjective ratings done by the native speakers implies that some modifications we made from our previous experiment contribute to making the framework of native speakers' reverse shadowing more pedagogically effective. Further, a preliminary analysis of the recorded shadowings shows good correlations to listeners' perceived shadowability.
{"title":"A Large Collection of Sentences Read Aloud by Vietnamese Learners of Japanese and Native Speaker's Reverse Shadowings","authors":"Shintaro Ando, Z. Lin, Tasavat Trisitichoke, Y. Inoue, Fuki Yoshizawa, D. Saito, N. Minematsu","doi":"10.1109/O-COCOSDA46868.2019.9041215","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041215","url":null,"abstract":"The main objective of language learning is to acquire good communication skills in the target language. From this viewpoint, the primary goal of pronunciation training is to become able to speak in an intelligible-enough or comprehensible-enough pronunciation, not a native-sounding one. However, achieving such pronunciation is still not easy for many learners mainly because of their lack of opportunity to use the language they learn and to receive some feedbacks on intelligibility or comprehensibility from native listeners. In order to solve this problem, the authors previously proposed a novel method of native speakers' reverse shadowing and showed that the degree of inarticulation observed in native speakers' shadowings of learners' utterances can be used to estimate the comprehensibility of learners' speech. One major problem in our previous research however, was that the experiment was done on a relatively small scale; the number of learners was only six. For this reason, in this study, we carried out a larger collection of Japanese utterances read aloud by 60 Vietnamese learners and Japanese native speakers' shadowings of those utterances. An analysis of the subjective ratings done by the native speakers implies that some modifications we made from our previous experiment contribute to making the framework of native speakers' reverse shadowing more pedagogically effective. Further, a preliminary analysis of the recorded shadowings shows good correlations to listeners' perceived shadowability.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"33 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132596172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041233
Ethel Ong, Junlyn Bryan Alburo, Christine Rachel De Jesus, Luisa Katherine Gilig, Dionne Tiffany Ong
Child-agent collaborative storytelling can be facilitated through text and voice interfaces. Voice interfaces are more intuitive and closely resemble the way people usually relate to one another. This may be attributed to the colloquial characteristics of everyday conversations that do away with rigid linguistic structures typically present in text interfaces, such as observing the use of correct grammar and spelling. However, the capabilities of voice-based interfaces currently available in virtual assistants can lead to failure in communication due to user frustration and confusion when the agent is not providing the needed support, possibly caused by the latter's misinterpretation of the user's input. In such situations, text-based interfaces from messaging applications may be used as an alternative communication channel. In this paper, we provide a comparative analysis of the performance of our collaborative storytelling agent in processing user input by analyzing conversation logs from voice-based interface using Google Assistant, and text-based interface using Google Firebase. To do this, we give a brief overview of the different dialogue strategies employed by our agent, and how these are manifested through the interfaces. We also identify the obstacles posed by incorrect input processing to the collaborative tasks, and offer suggestions on how these challenges can be addressed.
{"title":"Challenges Posed by Voice Interface to Child- Agent Collaborative Storytelling","authors":"Ethel Ong, Junlyn Bryan Alburo, Christine Rachel De Jesus, Luisa Katherine Gilig, Dionne Tiffany Ong","doi":"10.1109/O-COCOSDA46868.2019.9041233","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041233","url":null,"abstract":"Child-agent collaborative storytelling can be facilitated through text and voice interfaces. Voice interfaces are more intuitive and closely resemble the way people usually relate to one another. This may be attributed to the colloquial characteristics of everyday conversations that do away with rigid linguistic structures typically present in text interfaces, such as observing the use of correct grammar and spelling. However, the capabilities of voice-based interfaces currently available in virtual assistants can lead to failure in communication due to user frustration and confusion when the agent is not providing the needed support, possibly caused by the latter's misinterpretation of the user's input. In such situations, text-based interfaces from messaging applications may be used as an alternative communication channel. In this paper, we provide a comparative analysis of the performance of our collaborative storytelling agent in processing user input by analyzing conversation logs from voice-based interface using Google Assistant, and text-based interface using Google Firebase. To do this, we give a brief overview of the different dialogue strategies employed by our agent, and how these are manifested through the interfaces. We also identify the obstacles posed by incorrect input processing to the collaborative tasks, and offer suggestions on how these challenges can be addressed.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127344906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-07DOI: 10.1109/O-COCOSDA46868.2019.9041202
B. Nguyen, V. H. Nguyen, Hien Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong Do, Luong Chi Mai
In recent years, studies on automatic speech recognition (ASR) have shown outstanding results that reach human parity on short speech segments. However, there are still difficulties in standardizing the output of ASR such as capitalization and punctuation restoration for long-speech transcription. The problems obstruct readers to understand the ASR output semantically and also cause difficulties for natural language processing models such as NER, POS and semantic parsing. In this paper, we propose a method to restore the punctuation and capitalization for long-speech ASR transcription. The method is based on Transformer models and chunk merging that allows us to (1), build a single model that performs punctuation and capitalization in one go, and (2), perform decoding in parallel while improving the prediction accuracy. Experiments on British National Corpus showed that the proposed approach outperforms existing methods in both accuracy and decoding speed.
{"title":"Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging","authors":"B. Nguyen, V. H. Nguyen, Hien Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong Do, Luong Chi Mai","doi":"10.1109/O-COCOSDA46868.2019.9041202","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041202","url":null,"abstract":"In recent years, studies on automatic speech recognition (ASR) have shown outstanding results that reach human parity on short speech segments. However, there are still difficulties in standardizing the output of ASR such as capitalization and punctuation restoration for long-speech transcription. The problems obstruct readers to understand the ASR output semantically and also cause difficulties for natural language processing models such as NER, POS and semantic parsing. In this paper, we propose a method to restore the punctuation and capitalization for long-speech ASR transcription. The method is based on Transformer models and chunk merging that allows us to (1), build a single model that performs punctuation and capitalization in one go, and (2), perform decoding in parallel while improving the prediction accuracy. Experiments on British National Corpus showed that the proposed approach outperforms existing methods in both accuracy and decoding speed.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128874968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)