2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)最新文献
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384418
Kwang Myung Jeon, Nam Kyun Kim, Moon Ju Jo, H. Kim
An indoor noise database is essential for development and assessment of distant speech recognition systems operating in indoor environments. This paper proposes a multi-channel indoor noise database. Each noise signal in the proposed database was recorded using a four-channel linear microphone array located in one corner of a living room in a condominium. Noise sources were generated either by physical actions or loudspeakers at various positions inside the condominium, including five different TV contents and 28 indoor noise sources categorized as repeated, stationary, or moving during the database recording. The indoor noise database was then verified by measuring a direction of arrival for each recorded noise source, which showed that the proposed database was suitable for developing and evaluating multi-channel speech processing algorithms in noisy indoor environments.
{"title":"Design of multi-channel indoor noise database for speech processing in noise","authors":"Kwang Myung Jeon, Nam Kyun Kim, Moon Ju Jo, H. Kim","doi":"10.1109/ICSDA.2017.8384418","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384418","url":null,"abstract":"An indoor noise database is essential for development and assessment of distant speech recognition systems operating in indoor environments. This paper proposes a multi-channel indoor noise database. Each noise signal in the proposed database was recorded using a four-channel linear microphone array located in one corner of a living room in a condominium. Noise sources were generated either by physical actions or loudspeakers at various positions inside the condominium, including five different TV contents and 28 indoor noise sources categorized as repeated, stationary, or moving during the database recording. The indoor noise database was then verified by measuring a direction of arrival for each recorded noise source, which showed that the proposed database was suitable for developing and evaluating multi-channel speech processing algorithms in noisy indoor environments.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123113469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384456
Quoc Bao Nguyen, Van Hai Do, Ba Quyen Dam, Minh Hung Le
In this paper, we first present our effort to collect a 85.8 hour corpus for Vietnamese telephone conversational speech from our Viettel call center. After that, various techniques such as time delay deep neural network (TDNN) with sequence training, data augmentation are applied to build the speech recognition system. Our final system achieves a low word error rate at 17.44% for this challenging corpus. To the best of our knowledge, it is the first attempt to build Vietnamese corpus and speech recognition system for the customer service domain.
{"title":"Development of a Vietnamese speech recognition system for Viettel call center","authors":"Quoc Bao Nguyen, Van Hai Do, Ba Quyen Dam, Minh Hung Le","doi":"10.1109/ICSDA.2017.8384456","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384456","url":null,"abstract":"In this paper, we first present our effort to collect a 85.8 hour corpus for Vietnamese telephone conversational speech from our Viettel call center. After that, various techniques such as time delay deep neural network (TDNN) with sequence training, data augmentation are applied to build the speech recognition system. Our final system achieves a low word error rate at 17.44% for this challenging corpus. To the best of our knowledge, it is the first attempt to build Vietnamese corpus and speech recognition system for the customer service domain.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129529029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384473
Sunhee Kim
This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.
{"title":"Corpus-based evaluation of Chinese text normalization","authors":"Sunhee Kim","doi":"10.1109/ICSDA.2017.8384473","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384473","url":null,"abstract":"This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"353 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115922475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384420
Weonhee Yun
The Seoul Corpus is a spontaneous speech corpus in Seoul Korean fully segmented with several levels of annotations in the Praat Textgrid format. A total of 40 people who were balanced for age and sex participated in the recordings. Each had an interview about various topics for an hour, and the recordings were labeled first by forced alignment using the HTK and then were fine-tuned by human labelers. About 220,000 phrasal words were included and 1,135,263 phoneme tokens were labeled. The corpus has already been distributed to the research community free of charge.
{"title":"The Seoul Corpus, spontaneous speech in Seoul Korean","authors":"Weonhee Yun","doi":"10.1109/ICSDA.2017.8384420","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384420","url":null,"abstract":"The Seoul Corpus is a spontaneous speech corpus in Seoul Korean fully segmented with several levels of annotations in the Praat Textgrid format. A total of 40 people who were balanced for age and sex participated in the recordings. Each had an interview about various topics for an hour, and the recordings were labeled first by forced alignment using the HTK and then were fine-tuned by human labelers. About 220,000 phrasal words were included and 1,135,263 phoneme tokens were labeled. The corpus has already been distributed to the research community free of charge.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132570756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384462
JeeSok Lee, S. Rhee
The aim of this study is to compare the aspects, particularly the occurrence of vocal folds vibration during the stop closure, of the stop consonants [b], [d], and [g] produced by the Native Speakers of English and Korean EFL Speakers. It will be examined whether stop voicing in the onset and coda positions is influenced by the place of articulation. Based on K-SEC (Korean-Spoken English Corpus), i) Korean Speakers' productions of the isolated words which have the voiced stops [b], [d], and [g] as onsets, followed by six different vowels [i], [e], [s], [a], [o], and [u], and ii) the same voiced codas preceded by the aforementioned vowels are to be used for the analysis. Aspects of the initial and final stop voicing manifested by Native Speakers are also to be analyzed and then compared with those by the Korean learners of English.
{"title":"The aspects of stop voicing in L1 and Korean-spoken L2 Englishes in regards to the place of articulation","authors":"JeeSok Lee, S. Rhee","doi":"10.1109/ICSDA.2017.8384462","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384462","url":null,"abstract":"The aim of this study is to compare the aspects, particularly the occurrence of vocal folds vibration during the stop closure, of the stop consonants [b], [d], and [g] produced by the Native Speakers of English and Korean EFL Speakers. It will be examined whether stop voicing in the onset and coda positions is influenced by the place of articulation. Based on K-SEC (Korean-Spoken English Corpus), i) Korean Speakers' productions of the isolated words which have the voiced stops [b], [d], and [g] as onsets, followed by six different vowels [i], [e], [s], [a], [o], and [u], and ii) the same voiced codas preceded by the aforementioned vowels are to be used for the analysis. Aspects of the initial and final stop voicing manifested by Native Speakers are also to be analyzed and then compared with those by the Korean learners of English.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130613553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384454
Yoshiko Kawabata, Toshihiko Matsuka, Yasuharu Den
The present study examined how four well-known particles of Japanese conditional clauses, namely TARA, TO, BA, and NARA, were actually used by analyzing Japanese Map Task Dialogue Corpus. We found clear differences in how they were used. In particular, different particles were used to refer to different contents of the main clauses. We argue that the differences are caused by difference in knowledge that speakers try to share with hearers, and we introduced discourse functions of the particles on the basis of the differences in knowledge that is tried to be shared.
{"title":"On the usages of conditional clauses in Japanese maptask dialogue","authors":"Yoshiko Kawabata, Toshihiko Matsuka, Yasuharu Den","doi":"10.1109/ICSDA.2017.8384454","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384454","url":null,"abstract":"The present study examined how four well-known particles of Japanese conditional clauses, namely TARA, TO, BA, and NARA, were actually used by analyzing Japanese Map Task Dialogue Corpus. We found clear differences in how they were used. In particular, different particles were used to refer to different contents of the main clauses. We argue that the differences are caused by difference in knowledge that speakers try to share with hearers, and we introduced discourse functions of the particles on the basis of the differences in knowledge that is tried to be shared.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122194873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384450
Y. Liao, Y. Chang, Sing-Yue Wang, Jhih-wei Chen, Sheng-Ming Wang, Jenq-Haur Wang
The Taiwan Mandarin Radio Speech Corpus contains 300 (and growing) hours of high-quality recordings selected from Taiwan's National Education Radio (NER) archive. The corpus features speech (of various speaking styles, produced by hundreds of speakers) and their corresponding transcriptions (automatically transcribed and manually corrected) and annotations, which are suitable for speech and language research. In this paper, we report the progress of the corpus development and especially show the experimental results of audio event detection/segmentation and semi-supervised acoustic model training on this corpus.
{"title":"A progress report of the Taiwan Mandarin radio speech corpus project","authors":"Y. Liao, Y. Chang, Sing-Yue Wang, Jhih-wei Chen, Sheng-Ming Wang, Jenq-Haur Wang","doi":"10.1109/ICSDA.2017.8384450","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384450","url":null,"abstract":"The Taiwan Mandarin Radio Speech Corpus contains 300 (and growing) hours of high-quality recordings selected from Taiwan's National Education Radio (NER) archive. The corpus features speech (of various speaking styles, produced by hundreds of speakers) and their corresponding transcriptions (automatically transcribed and manually corrected) and annotations, which are suitable for speech and language research. In this paper, we report the progress of the corpus development and especially show the experimental results of audio event detection/segmentation and semi-supervised acoustic model training on this corpus.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116682718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384467
Pipin Kurniawati, D. Lestari, M. L. Khodra
This paper describes our works to extend the previous work on emotion recognition for Indonesian spoken language. In this research, we construct an Indonesian emotional corpus (IDEC). In constructing the corpus, we aim at natural emotional occurrences from television talk shows. IDEC is utilized to construct the emotion recognizer using two main features, acoustic and lexical features. The Support Vector Machine (SVM), Random Forest (RF), and Multinomial Naive Bayes (MNB) algorithms are employed to model the emotions. Experiment result shows that SVM outperforms the RF and MNB algorithms. It achieves an average F- measure of 0.713 for 6 emotion classes by combining both acoustic and lexical features.
{"title":"Speech emotion recognition from Indonesian spoken language using acoustic and lexical features","authors":"Pipin Kurniawati, D. Lestari, M. L. Khodra","doi":"10.1109/ICSDA.2017.8384467","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384467","url":null,"abstract":"This paper describes our works to extend the previous work on emotion recognition for Indonesian spoken language. In this research, we construct an Indonesian emotional corpus (IDEC). In constructing the corpus, we aim at natural emotional occurrences from television talk shows. IDEC is utilized to construct the emotion recognizer using two main features, acoustic and lexical features. The Support Vector Machine (SVM), Random Forest (RF), and Multinomial Naive Bayes (MNB) algorithms are employed to model the emotions. Experiment result shows that SVM outperforms the RF and MNB algorithms. It achieves an average F- measure of 0.713 for 6 emotion classes by combining both acoustic and lexical features.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129510107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384458
Minghui Zhang, Fang Hu
This paper gives an acoustic phonetic description of the diphthongized vowels in the Xiuning Hui Chinese dialect in terms of temporal structure, spectral property and dynamic aspect. The results suggest that diphthongized vowels in Xiuning function as an intermediate vowel category between monophthongs and diphthongs. And comparisons between the Xiuning case, Yi county Hui, and Qimen Hui reveal that the process of diphthongization is gradient in Hui dialects.
{"title":"Diphthongized vowels in the Xiuning Hui Chinese Dialect","authors":"Minghui Zhang, Fang Hu","doi":"10.1109/ICSDA.2017.8384458","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384458","url":null,"abstract":"This paper gives an acoustic phonetic description of the diphthongized vowels in the Xiuning Hui Chinese dialect in terms of temporal structure, spectral property and dynamic aspect. The results suggest that diphthongized vowels in Xiuning function as an intermediate vowel category between monophthongs and diphthongs. And comparisons between the Xiuning case, Yi county Hui, and Qimen Hui reveal that the process of diphthongization is gradient in Hui dialects.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117257196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384463
Jiahong Yuan, Hongwei Ding, Sishi Liao, Yuqing Zhan, M. Liberman
This paper describes an effort to build a TIMIT-like corpus in Standard Chinese, which is part of our "Global TIMIT" project. Three steps are involved and detailed in the paper: selection of sentences; speaker recruitment and recording; and phonetic segmentation. The corpus consists of 6000 sentences read by 50 speakers (25 females and 25 males). Phonetic segmentation obtained from forced alignment is provided, which has 93.2% agreement (of phone boundaries) within 20 ms compared to manual segmentation on 50 randomly selected sentences. Statistics on the number of tokens and mean duration of phones and tones in the corpus are also reported. Males have shorter phones/tones but more and longer utterance internal silences than females, demonstrating that males in this dataset speak faster but pause more frequently and longer.
{"title":"Chinese TIMIT: A TIMIT-like corpus of standard Chinese","authors":"Jiahong Yuan, Hongwei Ding, Sishi Liao, Yuqing Zhan, M. Liberman","doi":"10.1109/ICSDA.2017.8384463","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384463","url":null,"abstract":"This paper describes an effort to build a TIMIT-like corpus in Standard Chinese, which is part of our \"Global TIMIT\" project. Three steps are involved and detailed in the paper: selection of sentences; speaker recruitment and recording; and phonetic segmentation. The corpus consists of 6000 sentences read by 50 speakers (25 females and 25 males). Phonetic segmentation obtained from forced alignment is provided, which has 93.2% agreement (of phone boundaries) within 20 ms compared to manual segmentation on 50 randomly selected sentences. Statistics on the number of tokens and mean duration of phones and tones in the corpus are also reported. Males have shorter phones/tones but more and longer utterance internal silences than females, demonstrating that males in this dataset speak faster but pause more frequently and longer.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127549122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)