2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041171
Joyanta Basu, Soma Khan, Rajib Roy, Babita Saxena, Dipankar Ganguly, Sunita Arora, K. Arora, S. Bansal, S. Agrawal
Robust Speech Recognition System for various languages have transcended beyond research labs to commercial products. It has been possible owing to the major developments in the area of machine learning, especially deep learning. However, development of advanced speech recognition systems could be leveraged only with the availability of specially curetted speech data. Such systems having usable quality are yet to be developed for most of the Indian languages. The present paper describes the design and development of a standard speech corpora which can be used for developing general purpose ASR systems and benchmarking them. This database has been developed for Indian languages namely Hindi, Bengali and Indian English. The corpus design incorporates important parameters such as phonetic coverage and distribution. The data was recorded by 1500 speakers in each language by male and female speakers of different age groups in varying environments. The data was recorded on a server using online recording system and transcribed using semi-automatic tools. The paper describes the corpus designing methodology, challenges faced and approach adopted to overcome them. The whole process of designing speech database has been generic enough to be used for other languages as well.
{"title":"Indian Languages Corpus for Speech Recognition","authors":"Joyanta Basu, Soma Khan, Rajib Roy, Babita Saxena, Dipankar Ganguly, Sunita Arora, K. Arora, S. Bansal, S. Agrawal","doi":"10.1109/O-COCOSDA46868.2019.9041171","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041171","url":null,"abstract":"Robust Speech Recognition System for various languages have transcended beyond research labs to commercial products. It has been possible owing to the major developments in the area of machine learning, especially deep learning. However, development of advanced speech recognition systems could be leveraged only with the availability of specially curetted speech data. Such systems having usable quality are yet to be developed for most of the Indian languages. The present paper describes the design and development of a standard speech corpora which can be used for developing general purpose ASR systems and benchmarking them. This database has been developed for Indian languages namely Hindi, Bengali and Indian English. The corpus design incorporates important parameters such as phonetic coverage and distribution. The data was recorded by 1500 speakers in each language by male and female speakers of different age groups in varying environments. The data was recorded on a server using online recording system and transcribed using semi-automatic tools. The paper describes the corpus designing methodology, challenges faced and approach adopted to overcome them. The whole process of designing speech database has been generic enough to be used for other languages as well.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130712526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9060843
Bin Li, Yuan Jia
This paper investigates, through an extensive acoustic analysis, the acquisition of English retroflex vowel [3] by learners of English as Foreign Language (EFL) from Beijing (BJ) and Changsha (CS), which are representative dialectal regions in the north and south China respectively. In our analysis, formant and duration were selected as parameters. The results demonstrate that all the EFL learners involved in the study produced the onset target and offset target of [3] with a more backward tendency. For formant patterns, both native speakers and EFL learners present a similar tendency, namely the decline of F3 and the rise of F2. For CS speakers, due to the effect of their mother tongue, their F3 falls more slowly. Moreover, from the spectral perspective, the F3 changing rate of CS male learners is significantly smaller than that of native speakers. On the other hand, BJ learners, especially female learners, show more obvious changes in F3 than native speakers. In addition, we speculate that the language background and gender can affect the acquisition of retroflex vowels.
{"title":"Acquisition of english retroflex vowel [3] by EFL learners from Chinese dialectal regions- A case study of Beijing and Changsha","authors":"Bin Li, Yuan Jia","doi":"10.1109/O-COCOSDA46868.2019.9060843","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060843","url":null,"abstract":"This paper investigates, through an extensive acoustic analysis, the acquisition of English retroflex vowel [3] by learners of English as Foreign Language (EFL) from Beijing (BJ) and Changsha (CS), which are representative dialectal regions in the north and south China respectively. In our analysis, formant and duration were selected as parameters. The results demonstrate that all the EFL learners involved in the study produced the onset target and offset target of [3] with a more backward tendency. For formant patterns, both native speakers and EFL learners present a similar tendency, namely the decline of F3 and the rise of F2. For CS speakers, due to the effect of their mother tongue, their F3 falls more slowly. Moreover, from the spectral perspective, the F3 changing rate of CS male learners is significantly smaller than that of native speakers. On the other hand, BJ learners, especially female learners, show more obvious changes in F3 than native speakers. In addition, we speculate that the language background and gender can affect the acquisition of retroflex vowels.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9060845
Mayuko Okamato, S. Sakti, Satoshi Nakamura
The development of text-to-speech synthesis (TTS) systems continues to advance, and the naturalness of their generated speech has significantly improved. But most TTS systems now learn from data using a deep learning framework and generate the output at a monotonous speaking rate. In contrast humans vary their speaking rates and tend to slow down to emphasize words to distinguish elements of focus in an utterance.
{"title":"Phoneme-level speaking rate variation on waveform generation using GAN-TTS","authors":"Mayuko Okamato, S. Sakti, Satoshi Nakamura","doi":"10.1109/O-COCOSDA46868.2019.9060845","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060845","url":null,"abstract":"The development of text-to-speech synthesis (TTS) systems continues to advance, and the naturalness of their generated speech has significantly improved. But most TTS systems now learn from data using a deep learning framework and generate the output at a monotonous speaking rate. In contrast humans vary their speaking rates and tend to slow down to emphasize words to distinguish elements of focus in an utterance.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129311620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041212
S. Saychum, A. Rugchatjaroen, C. Wutiwiwatchai
Thai toneme prediction has been one of the greatest difficulties for Thai grapheme to phoneme conversion (G2P). This paper presents an improvement in the prediction of linguistic features in terms of tone rules. Among these, there will always be exceptions, for example, the tones used in loan words and transliterated words, which are usually adopted from the original language. This paper does not concern itself with the transliteration problem, but aims to show the success of a method which uses an automatic toneme predictor based on the tone rules of Thai pronunciation for the development of a machine learning model. The proposed method attaches a predictor to the final stage of converting a grapheme to a phoneme. Furthermore, this work also explores end-to-end prediction using Long Short Term Memories (LSTM) that takes its input sequence from the National Electronic and Computer Technology Center's Pseudo-Syllable segmentation and alignment tool. An evaluation was conducted to show the success of the proposed system, and also to compare the results with our traditional end-to-end sequence-to-sequence G2P. A comparison of the results shows that sequence-to-sequence modeling obtains the lowest Word Error Rate at 1.6%, and the proposed system works well on a 2018 small device (Raspberry Pi).
{"title":"A Great Reduction of WER by Syllable Toneme Prediction for Thai Grapheme to Phoneme Conversion","authors":"S. Saychum, A. Rugchatjaroen, C. Wutiwiwatchai","doi":"10.1109/O-COCOSDA46868.2019.9041212","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041212","url":null,"abstract":"Thai toneme prediction has been one of the greatest difficulties for Thai grapheme to phoneme conversion (G2P). This paper presents an improvement in the prediction of linguistic features in terms of tone rules. Among these, there will always be exceptions, for example, the tones used in loan words and transliterated words, which are usually adopted from the original language. This paper does not concern itself with the transliteration problem, but aims to show the success of a method which uses an automatic toneme predictor based on the tone rules of Thai pronunciation for the development of a machine learning model. The proposed method attaches a predictor to the final stage of converting a grapheme to a phoneme. Furthermore, this work also explores end-to-end prediction using Long Short Term Memories (LSTM) that takes its input sequence from the National Electronic and Computer Technology Center's Pseudo-Syllable segmentation and alignment tool. An evaluation was conducted to show the success of the proposed system, and also to compare the results with our traditional end-to-end sequence-to-sequence G2P. A comparison of the results shows that sequence-to-sequence modeling obtains the lowest Word Error Rate at 1.6%, and the proposed system works well on a 2018 small device (Raspberry Pi).","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114726880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041225
Aye Mya Hlaing, Win Pa Pa
Grapheme to phoneme conversion is the production of pronunciation for a given word. Neural sequence to sequence models have been applied for grapheme to phoneme conversion recently. This paper analyzes the effectiveness of neural sequence to sequence models in grapheme to phoneme conversion for Myanmar language. The first large Myanmar pronunciation dictionary is introduced, and it is applied in building sequence to sequence models. The performance of four grapheme to phoneme conversion models, joint sequence model, Transformer, simple encoder-decoder, and attention enabled encoder-decoder models, are evaluated in terms of phoneme error rate(PER) and word error rate(WER). Analysis on three-word classes and six phoneme error types are done and discussed details in this paper. According to the evaluations, the Transformer has comparable results to traditional joint sequence model.
{"title":"Sequence-to-Sequence Models for Grapheme to Phoneme Conversion on Large Myanmar Pronunciation Dictionary","authors":"Aye Mya Hlaing, Win Pa Pa","doi":"10.1109/O-COCOSDA46868.2019.9041225","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041225","url":null,"abstract":"Grapheme to phoneme conversion is the production of pronunciation for a given word. Neural sequence to sequence models have been applied for grapheme to phoneme conversion recently. This paper analyzes the effectiveness of neural sequence to sequence models in grapheme to phoneme conversion for Myanmar language. The first large Myanmar pronunciation dictionary is introduced, and it is applied in building sequence to sequence models. The performance of four grapheme to phoneme conversion models, joint sequence model, Transformer, simple encoder-decoder, and attention enabled encoder-decoder models, are evaluated in terms of phoneme error rate(PER) and word error rate(WER). Analysis on three-word classes and six phoneme error types are done and discussed details in this paper. According to the evaluations, the Transformer has comparable results to traditional joint sequence model.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125820323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/o-cocosda46868.2019.9041241
{"title":"index","authors":"","doi":"10.1109/o-cocosda46868.2019.9041241","DOIUrl":"https://doi.org/10.1109/o-cocosda46868.2019.9041241","url":null,"abstract":"","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115056438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041203
Haruka Amatani, Yayoi Tanaka
How are conversations decontextualized apart from the here-and-now situation in a daily joint activity? More specifically, how are those (de/)contextualized utterances associated with movements in the activity? Applying Cloran's [1] Rhetoric Units, we identified the degrees of decontextualization for utterances, regarding their time and space distances from the ongoing situation. For the annotation of hand and body movements, we employed Kendon's [2] gesture phases. The association of speech and movements were examined using the degrees of decontextualization and movement phases. The results from the preliminary analysis suggested that when participants were pausing their movements they tend to utter in the high degrees of decontextualization than when they were moving.
{"title":"Annotation and preliminary analysis of utterance decontextualization in a multiactivity","authors":"Haruka Amatani, Yayoi Tanaka","doi":"10.1109/O-COCOSDA46868.2019.9041203","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041203","url":null,"abstract":"How are conversations decontextualized apart from the here-and-now situation in a daily joint activity? More specifically, how are those (de/)contextualized utterances associated with movements in the activity? Applying Cloran's [1] Rhetoric Units, we identified the degrees of decontextualization for utterances, regarding their time and space distances from the ongoing situation. For the annotation of hand and body movements, we employed Kendon's [2] gesture phases. The association of speech and movements were examined using the degrees of decontextualization and movement phases. The results from the preliminary analysis suggested that when participants were pausing their movements they tend to utter in the high degrees of decontextualization than when they were moving.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"35 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120982111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9060851
Ao Chen, Hintat Cheung, Yuchen Li, Liqun Gao
The current study investigated native Mandarin Chinese children’s production of native lexical tones, in particular the low-rising tone (T2) and low-dipping tone (T3), which are acoustically most similar among all the Mandarin lexical tones. Using a picture naming task, ten 3-year-old children produced fourteen monosyllabic and disyllabic familiar words. Ten female adult listeners performed the same task as a control group. Acoustical measurements on pitch values and pitch alignment were conducted to analyze whether children made use of acoustical cues to distinguish T2 and T3 in an adult like way, and whether presence of tonal context in the disyllabic words influenced the acoustical implementation of T2 and T3. The results showed that, overall children exhibited adult-like pitch contour for T2 and T3, yet unlike adults who maintained the low feature of T3 for both pitch minimum and pitch maximum, children tended to increase the pitch maximum and consequently the pitch range to allow for implementation of the complex pitch contour of T3. Such increase is more evident for the disyllabic than for the monosyllabic words. These findings suggest that the presence of tonal context and tonal carry-over effect makes it more demanding for children to realize the complex pitch contour of T3, and they increase the pitch range to achieve such a goal.
{"title":"Three-year-old children's production of native mandarin Chinese lexical tones","authors":"Ao Chen, Hintat Cheung, Yuchen Li, Liqun Gao","doi":"10.1109/O-COCOSDA46868.2019.9060851","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060851","url":null,"abstract":"The current study investigated native Mandarin Chinese children’s production of native lexical tones, in particular the low-rising tone (T2) and low-dipping tone (T3), which are acoustically most similar among all the Mandarin lexical tones. Using a picture naming task, ten 3-year-old children produced fourteen monosyllabic and disyllabic familiar words. Ten female adult listeners performed the same task as a control group. Acoustical measurements on pitch values and pitch alignment were conducted to analyze whether children made use of acoustical cues to distinguish T2 and T3 in an adult like way, and whether presence of tonal context in the disyllabic words influenced the acoustical implementation of T2 and T3. The results showed that, overall children exhibited adult-like pitch contour for T2 and T3, yet unlike adults who maintained the low feature of T3 for both pitch minimum and pitch maximum, children tended to increase the pitch maximum and consequently the pitch range to allow for implementation of the complex pitch contour of T3. Such increase is more evident for the disyllabic than for the monosyllabic words. These findings suggest that the presence of tonal context and tonal carry-over effect makes it more demanding for children to realize the complex pitch contour of T3, and they increase the pitch range to achieve such a goal.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"43 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132738049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/O-COCOSDA46868.2019.9041235
Yuriko Iseki, Keisuke Kadota, Yasuharu Den
This paper addresses an attempt to find out the characteristics of everyday conversation data through dialog act information. Although several earlier studies have discussed how to annotate DA information, few studies use the result of the annotation as a clue to derive the characteristics of conversation. We report on the work to annotate dialog act information on utterances in Japanese everyday conversation, and the possibility of extracting the interactional characteristics using the annotation. As a result of the analysis, it was found that the annotation reflects differences in behaviour depending on the type of conversation and participants' age. Also, even in conversations with similar settings, differences were found in the distribution of tags about interactional management. It is suggested that the annotation may also reflect information that is difficult to capture objectively such as the conversational atmosphere.
{"title":"Characteristics of everyday conversation derived from the analysis of dialog act annotation","authors":"Yuriko Iseki, Keisuke Kadota, Yasuharu Den","doi":"10.1109/O-COCOSDA46868.2019.9041235","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041235","url":null,"abstract":"This paper addresses an attempt to find out the characteristics of everyday conversation data through dialog act information. Although several earlier studies have discussed how to annotate DA information, few studies use the result of the annotation as a clue to derive the characteristics of conversation. We report on the work to annotate dialog act information on utterances in Japanese everyday conversation, and the possibility of extracting the interactional characteristics using the annotation. As a result of the analysis, it was found that the annotation reflects differences in behaviour depending on the type of conversation and participants' age. Also, even in conversations with similar settings, differences were found in the distribution of tags about interactional management. It is suggested that the annotation may also reflect information that is difficult to capture objectively such as the conversational atmosphere.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116227042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a progress report on a relatively difficult ASR task on a spontaneous speech corpus - Mandarin Conversational Dialogue Corpus (MCDC). A DNN-based acoustic model is constructed based on the CLDNN structure with a large dataset that comprises two spontaneous-speech corpora and one read-speech corpus. The study uses a large text dataset formed by seven corpora to train an efficient general language model (LM). Two adapted LMs specially for spontaneous speech recognition are also constructed. Experimental results showed that the best performances of 26.3% in character error rate (CER) and 32.5% in word error rate (WER) were reached on MCDC. They represented 27.9% and 22.2% of relative CER and WER reductions as compared with the performances by the previous best HMM-based method. This confirms that the proposed method is promising in tackling on Mandarin spontaneous speech recognition.
{"title":"Recent Progress of Mandrain Spontaneous Speech Recognition on Mandrain Conversation Dialogue Corpus","authors":"Yu-Chih Deng, Yih-Ru Wang, Sin-Horng Chen, Chen-Yu Chiang","doi":"10.1109/O-COCOSDA46868.2019.9041223","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041223","url":null,"abstract":"This paper presents a progress report on a relatively difficult ASR task on a spontaneous speech corpus - Mandarin Conversational Dialogue Corpus (MCDC). A DNN-based acoustic model is constructed based on the CLDNN structure with a large dataset that comprises two spontaneous-speech corpora and one read-speech corpus. The study uses a large text dataset formed by seven corpora to train an efficient general language model (LM). Two adapted LMs specially for spontaneous speech recognition are also constructed. Experimental results showed that the best performances of 26.3% in character error rate (CER) and 32.5% in word error rate (WER) were reached on MCDC. They represented 27.9% and 22.2% of relative CER and WER reductions as compared with the performances by the previous best HMM-based method. This confirms that the proposed method is promising in tackling on Mandarin spontaneous speech recognition.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121098961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)