2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041196
Agung Santosa, Andi Djalal Latief, Hammam Riza, Asril Jarin, Lyla Ruslana Aini, Gunarso, Gita Citra Puspita, M. T. Uliniansyah, Elvira Nurfadhilah, Harnum A. Prafitia, Made Gunawan
With competencies and the results of the engineering of natural language processing technology owned by BPPT since 1987, BPPT develops an English-Bahasa Indonesia speech-to-speech translation system (S2ST). In this paper, we propose an architecture of speech-to-speech translation system for Android-based mobile conversation using separate mobile devices for each language. This architecture applies three leading technologies, namely: WebSocket, REST, and JSON. The system utilizes a two-way communication protocol between two users and a simple voice activation detector that can detect a boundary of user's utterance.
{"title":"The Architecture of Speech-to-Speech Translator for Mobile Conversation","authors":"Agung Santosa, Andi Djalal Latief, Hammam Riza, Asril Jarin, Lyla Ruslana Aini, Gunarso, Gita Citra Puspita, M. T. Uliniansyah, Elvira Nurfadhilah, Harnum A. Prafitia, Made Gunawan","doi":"10.1109/O-COCOSDA46868.2019.9041196","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041196","url":null,"abstract":"With competencies and the results of the engineering of natural language processing technology owned by BPPT since 1987, BPPT develops an English-Bahasa Indonesia speech-to-speech translation system (S2ST). In this paper, we propose an architecture of speech-to-speech translation system for Android-based mobile conversation using separate mobile devices for each language. This architecture applies three leading technologies, namely: WebSocket, REST, and JSON. The system utilizes a two-way communication protocol between two users and a simple voice activation detector that can detect a boundary of user's utterance.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116075403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041157
Siyi Cao, Yizhong Xu, Xiaoli Ji
This paper investigated that based on [22] ‘s perspective that Pragmatic Markers (PMs) are realized mainly through prosody between native speakers and non-native speakers, when focus functions as pragmatic markers, whether pragmatic factors from non-native speakers restrict the realization of Pragmatic Markers through prosody leading to misunderstanding in intercultural communication, in the case of declarative questions and statements. Pitch contours of 17 Chinese EFL (English as a foreign language) learners (non-native speakers)’ sentences were compared with that of six native speakers using four sentences from AESOP. The results demonstrated that native speakers and non-native speakers indeed realized pragmatic markers (focused words) through prosodic cues (pitch range), but differed in the way of realization for pragmatic markers, leading to pragmatic misunderstanding. This paper proves [22] ‘s opinion and demonstrates that pragmatic elements from transfer, L2 teaching, proficiency of non-native speakers constraint prosodic ways for realizing pragmatic markers, which indicates conventionality in cross-culture conversation.
{"title":"The Study of Prosody-Pragmatics Interface with Focus Functioning as Pragmatic Markers: The Case of Question and Statement","authors":"Siyi Cao, Yizhong Xu, Xiaoli Ji","doi":"10.1109/O-COCOSDA46868.2019.9041157","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041157","url":null,"abstract":"This paper investigated that based on [22] ‘s perspective that Pragmatic Markers (PMs) are realized mainly through prosody between native speakers and non-native speakers, when focus functions as pragmatic markers, whether pragmatic factors from non-native speakers restrict the realization of Pragmatic Markers through prosody leading to misunderstanding in intercultural communication, in the case of declarative questions and statements. Pitch contours of 17 Chinese EFL (English as a foreign language) learners (non-native speakers)’ sentences were compared with that of six native speakers using four sentences from AESOP. The results demonstrated that native speakers and non-native speakers indeed realized pragmatic markers (focused words) through prosodic cues (pitch range), but differed in the way of realization for pragmatic markers, leading to pragmatic misunderstanding. This paper proves [22] ‘s opinion and demonstrates that pragmatic elements from transfer, L2 teaching, proficiency of non-native speakers constraint prosodic ways for realizing pragmatic markers, which indicates conventionality in cross-culture conversation.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132616409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9060846
S. Yang, Minhwa Chung
This paper describes two experiments aimed at exploring the relationship between linguistic aspects and perceived proficiency in read and spontaneous speech. 5,000 utterances of read speech by 50 non-native speakers of Korean in Experiment 1, and of 6,000 spontaneous speech utterances in Experiment 2 were scored for proficiency by native human raters and were analyzed by factors known to be related to perceived proficiency. The results show that the factors investigated in this study can be employed to predict proficiency ratings, and the predictive power of fluency and pitch and accent accuracy is strong for both read and spontaneous speech. We also observe that while proficiency ratings of read speech are mainly related to segmental accuracy, those of spontaneous speech appear to be more related to pitch and accent accuracy. Moreover, proficiency in read speech does not always equate to the proficiency in spontaneous speech, and vice versa, with Pearson’s per-speaker correlation score of 0.535.
{"title":"Comparison between read and spontaneous speech assessment of L2 Korean","authors":"S. Yang, Minhwa Chung","doi":"10.1109/O-COCOSDA46868.2019.9060846","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060846","url":null,"abstract":"This paper describes two experiments aimed at exploring the relationship between linguistic aspects and perceived proficiency in read and spontaneous speech. 5,000 utterances of read speech by 50 non-native speakers of Korean in Experiment 1, and of 6,000 spontaneous speech utterances in Experiment 2 were scored for proficiency by native human raters and were analyzed by factors known to be related to perceived proficiency. The results show that the factors investigated in this study can be employed to predict proficiency ratings, and the predictive power of fluency and pitch and accent accuracy is strong for both read and spontaneous speech. We also observe that while proficiency ratings of read speech are mainly related to segmental accuracy, those of spontaneous speech appear to be more related to pitch and accent accuracy. Moreover, proficiency in read speech does not always equate to the proficiency in spontaneous speech, and vice versa, with Pearson’s per-speaker correlation score of 0.535.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122246679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041215
Shintaro Ando, Z. Lin, Tasavat Trisitichoke, Y. Inoue, Fuki Yoshizawa, D. Saito, N. Minematsu
The main objective of language learning is to acquire good communication skills in the target language. From this viewpoint, the primary goal of pronunciation training is to become able to speak in an intelligible-enough or comprehensible-enough pronunciation, not a native-sounding one. However, achieving such pronunciation is still not easy for many learners mainly because of their lack of opportunity to use the language they learn and to receive some feedbacks on intelligibility or comprehensibility from native listeners. In order to solve this problem, the authors previously proposed a novel method of native speakers' reverse shadowing and showed that the degree of inarticulation observed in native speakers' shadowings of learners' utterances can be used to estimate the comprehensibility of learners' speech. One major problem in our previous research however, was that the experiment was done on a relatively small scale; the number of learners was only six. For this reason, in this study, we carried out a larger collection of Japanese utterances read aloud by 60 Vietnamese learners and Japanese native speakers' shadowings of those utterances. An analysis of the subjective ratings done by the native speakers implies that some modifications we made from our previous experiment contribute to making the framework of native speakers' reverse shadowing more pedagogically effective. Further, a preliminary analysis of the recorded shadowings shows good correlations to listeners' perceived shadowability.
{"title":"A Large Collection of Sentences Read Aloud by Vietnamese Learners of Japanese and Native Speaker's Reverse Shadowings","authors":"Shintaro Ando, Z. Lin, Tasavat Trisitichoke, Y. Inoue, Fuki Yoshizawa, D. Saito, N. Minematsu","doi":"10.1109/O-COCOSDA46868.2019.9041215","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041215","url":null,"abstract":"The main objective of language learning is to acquire good communication skills in the target language. From this viewpoint, the primary goal of pronunciation training is to become able to speak in an intelligible-enough or comprehensible-enough pronunciation, not a native-sounding one. However, achieving such pronunciation is still not easy for many learners mainly because of their lack of opportunity to use the language they learn and to receive some feedbacks on intelligibility or comprehensibility from native listeners. In order to solve this problem, the authors previously proposed a novel method of native speakers' reverse shadowing and showed that the degree of inarticulation observed in native speakers' shadowings of learners' utterances can be used to estimate the comprehensibility of learners' speech. One major problem in our previous research however, was that the experiment was done on a relatively small scale; the number of learners was only six. For this reason, in this study, we carried out a larger collection of Japanese utterances read aloud by 60 Vietnamese learners and Japanese native speakers' shadowings of those utterances. An analysis of the subjective ratings done by the native speakers implies that some modifications we made from our previous experiment contribute to making the framework of native speakers' reverse shadowing more pedagogically effective. Further, a preliminary analysis of the recorded shadowings shows good correlations to listeners' perceived shadowability.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"33 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132596172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041233
Ethel Ong, Junlyn Bryan Alburo, Christine Rachel De Jesus, Luisa Katherine Gilig, Dionne Tiffany Ong
Child-agent collaborative storytelling can be facilitated through text and voice interfaces. Voice interfaces are more intuitive and closely resemble the way people usually relate to one another. This may be attributed to the colloquial characteristics of everyday conversations that do away with rigid linguistic structures typically present in text interfaces, such as observing the use of correct grammar and spelling. However, the capabilities of voice-based interfaces currently available in virtual assistants can lead to failure in communication due to user frustration and confusion when the agent is not providing the needed support, possibly caused by the latter's misinterpretation of the user's input. In such situations, text-based interfaces from messaging applications may be used as an alternative communication channel. In this paper, we provide a comparative analysis of the performance of our collaborative storytelling agent in processing user input by analyzing conversation logs from voice-based interface using Google Assistant, and text-based interface using Google Firebase. To do this, we give a brief overview of the different dialogue strategies employed by our agent, and how these are manifested through the interfaces. We also identify the obstacles posed by incorrect input processing to the collaborative tasks, and offer suggestions on how these challenges can be addressed.
{"title":"Challenges Posed by Voice Interface to Child- Agent Collaborative Storytelling","authors":"Ethel Ong, Junlyn Bryan Alburo, Christine Rachel De Jesus, Luisa Katherine Gilig, Dionne Tiffany Ong","doi":"10.1109/O-COCOSDA46868.2019.9041233","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041233","url":null,"abstract":"Child-agent collaborative storytelling can be facilitated through text and voice interfaces. Voice interfaces are more intuitive and closely resemble the way people usually relate to one another. This may be attributed to the colloquial characteristics of everyday conversations that do away with rigid linguistic structures typically present in text interfaces, such as observing the use of correct grammar and spelling. However, the capabilities of voice-based interfaces currently available in virtual assistants can lead to failure in communication due to user frustration and confusion when the agent is not providing the needed support, possibly caused by the latter's misinterpretation of the user's input. In such situations, text-based interfaces from messaging applications may be used as an alternative communication channel. In this paper, we provide a comparative analysis of the performance of our collaborative storytelling agent in processing user input by analyzing conversation logs from voice-based interface using Google Assistant, and text-based interface using Google Firebase. To do this, we give a brief overview of the different dialogue strategies employed by our agent, and how these are manifested through the interfaces. We also identify the obstacles posed by incorrect input processing to the collaborative tasks, and offer suggestions on how these challenges can be addressed.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127344906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-07DOI: 10.1109/O-COCOSDA46868.2019.9041202
B. Nguyen, V. H. Nguyen, Hien Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong Do, Luong Chi Mai
In recent years, studies on automatic speech recognition (ASR) have shown outstanding results that reach human parity on short speech segments. However, there are still difficulties in standardizing the output of ASR such as capitalization and punctuation restoration for long-speech transcription. The problems obstruct readers to understand the ASR output semantically and also cause difficulties for natural language processing models such as NER, POS and semantic parsing. In this paper, we propose a method to restore the punctuation and capitalization for long-speech ASR transcription. The method is based on Transformer models and chunk merging that allows us to (1), build a single model that performs punctuation and capitalization in one go, and (2), perform decoding in parallel while improving the prediction accuracy. Experiments on British National Corpus showed that the proposed approach outperforms existing methods in both accuracy and decoding speed.
{"title":"Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging","authors":"B. Nguyen, V. H. Nguyen, Hien Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong Do, Luong Chi Mai","doi":"10.1109/O-COCOSDA46868.2019.9041202","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041202","url":null,"abstract":"In recent years, studies on automatic speech recognition (ASR) have shown outstanding results that reach human parity on short speech segments. However, there are still difficulties in standardizing the output of ASR such as capitalization and punctuation restoration for long-speech transcription. The problems obstruct readers to understand the ASR output semantically and also cause difficulties for natural language processing models such as NER, POS and semantic parsing. In this paper, we propose a method to restore the punctuation and capitalization for long-speech ASR transcription. The method is based on Transformer models and chunk merging that allows us to (1), build a single model that performs punctuation and capitalization in one go, and (2), perform decoding in parallel while improving the prediction accuracy. Experiments on British National Corpus showed that the proposed approach outperforms existing methods in both accuracy and decoding speed.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128874968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)