2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)最新文献
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384452
Emmanuel Malaay, Michael Simora, R. J. Cabatic, Nathaniel Oco, R. Roxas
We present a multilingual speech corpus for isolated digits. As case study, we focused on languages in the Philippines: English, Filipino, Ilocano, Cebuano, and Spanish. Our isolated digits speech corpus has a duration of almost nine hours, collection from 262 speakers. These data were word- level annotated and will be used to train the acoustic models using the ASR toolkits. The corpus will be used for an automatic speech recognition (ASR) system and therefore the database must be sufficient to develop an ASR system.
{"title":"Development of a multilingual isolated digits speech corpus","authors":"Emmanuel Malaay, Michael Simora, R. J. Cabatic, Nathaniel Oco, R. Roxas","doi":"10.1109/ICSDA.2017.8384452","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384452","url":null,"abstract":"We present a multilingual speech corpus for isolated digits. As case study, we focused on languages in the Philippines: English, Filipino, Ilocano, Cebuano, and Spanish. Our isolated digits speech corpus has a duration of almost nine hours, collection from 262 speakers. These data were word- level annotated and will be used to train the acoustic models using the ASR toolkits. The corpus will be used for an automatic speech recognition (ASR) system and therefore the database must be sufficient to develop an ASR system.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121291966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384465
Johanes Effendi, S. Sakti, Satoshi Nakamura
Paraphrases resemble monolingual translations from a source sentence into other sentences that must preserve the original meaning. To build automatic paraphrasing, a collection of paraphrased expressions is required. However, manually collecting paraphrases is expensive and time-consuming. Most existing paraphrases corpora cover only one-to-one parallel sentences and neglect the fact that possible variants of paraphrases can be generated from a single source sentence. The manipulation applied to the original sentences is also difficult to track. Furthermore, a single corpus is mostly dedicated to a single application that is not reusable in other applications. In this research, we construct a paraphrase corpus based on various elementary operations (reordering, substitution, deletion, insertion) in a crowdsourcing platform to generate multi- paraphrase sentences from a source sentence. These elementary paraphrase operations can be utilized for various applications (i.e., deletion for summarization and reordering for machine translation). Our evaluations show the richness and effectiveness of our created corpus.
{"title":"Creation of a multi-paraphrase corpus based on various elementary operations","authors":"Johanes Effendi, S. Sakti, Satoshi Nakamura","doi":"10.1109/ICSDA.2017.8384465","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384465","url":null,"abstract":"Paraphrases resemble monolingual translations from a source sentence into other sentences that must preserve the original meaning. To build automatic paraphrasing, a collection of paraphrased expressions is required. However, manually collecting paraphrases is expensive and time-consuming. Most existing paraphrases corpora cover only one-to-one parallel sentences and neglect the fact that possible variants of paraphrases can be generated from a single source sentence. The manipulation applied to the original sentences is also difficult to track. Furthermore, a single corpus is mostly dedicated to a single application that is not reusable in other applications. In this research, we construct a paraphrase corpus based on various elementary operations (reordering, substitution, deletion, insertion) in a crowdsourcing platform to generate multi- paraphrase sentences from a source sentence. These elementary paraphrase operations can be utilized for various applications (i.e., deletion for summarization and reordering for machine translation). Our evaluations show the richness and effectiveness of our created corpus.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124202879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384471
Zhigang Yin, Ai-jun Li
The speech corpus is the basis of linguistic research and natural language processing. In order to make the speech corpus be collected more efficiently and be used or shared easier, it is necessary to develop the standardization scheme for speech corpus project. This paper tries to provide a standardization program that covers all aspects of data collection, annotation, and distribution. The specifications of constructing a speech corpus are also introduced in the paper. Finally, a telephone speech corpus, TSC973, be exemplified to illuminate the standardization program.
{"title":"A standardization program of speech corpus collection","authors":"Zhigang Yin, Ai-jun Li","doi":"10.1109/ICSDA.2017.8384471","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384471","url":null,"abstract":"The speech corpus is the basis of linguistic research and natural language processing. In order to make the speech corpus be collected more efficiently and be used or shared easier, it is necessary to develop the standardization scheme for speech corpus project. This paper tries to provide a standardization program that covers all aspects of data collection, annotation, and distribution. The specifications of constructing a speech corpus are also introduced in the paper. Finally, a telephone speech corpus, TSC973, be exemplified to illuminate the standardization program.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122930730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384470
Sunchan Park, Yongwon Jeong, H. S. Kim
The performance of automatic speech recognition (ASR) has been greatly improved by deep neural network (DNN) acoustic models. However, DNN-based systems still perform poorly in reverberant environments. Convolutional neural network (CNN) acoustic models showed lower word error rate (WER) in distant speech recognition than fully-connected DNN acoustic models. To improve the performance of reverberant speech recognition using CNN acoustic models, we propose the multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrowband feature with narrow-context window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed multiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data and the real-condition test data, respectively, compared with the conventional CNN based method.
{"title":"Multiresolution CNN for reverberant speech recognition","authors":"Sunchan Park, Yongwon Jeong, H. S. Kim","doi":"10.1109/ICSDA.2017.8384470","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384470","url":null,"abstract":"The performance of automatic speech recognition (ASR) has been greatly improved by deep neural network (DNN) acoustic models. However, DNN-based systems still perform poorly in reverberant environments. Convolutional neural network (CNN) acoustic models showed lower word error rate (WER) in distant speech recognition than fully-connected DNN acoustic models. To improve the performance of reverberant speech recognition using CNN acoustic models, we propose the multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrowband feature with narrow-context window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed multiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data and the real-condition test data, respectively, compared with the conventional CNN based method.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131555455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384468
S. Bansal, S. Agrawal
Spoken language identification is the task of identifying a language from the given speech signal. Efforts to develop language identification systems for Indian languages have been very limited due to the problem of speaker availability and language legibility but the requirement of SLID is increasing for civil and defense applications day by day. The present paper reports a study to develop a multilingual identification system for two Indian languages i.e. Hindi and Manipuri by using PPRLM approach that requires phoneme based labeled speech corpus for each language. For each language, data set of 300 phonetically rich sentences spoken by 25 native speakers (15000 utterances) were recorded, analyzed and annotated phonemically to make trigram based phonotactic model. The features of the speech signal have been extracted using MFCCs and GMM was used as a classifier. Results show that accuracy increases with the increase of Gaussians and also with the training samples.
{"title":"Modeling of linguistic and acoustic information from speech signal for multilingual spoken language identification system (SLID)","authors":"S. Bansal, S. Agrawal","doi":"10.1109/ICSDA.2017.8384468","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384468","url":null,"abstract":"Spoken language identification is the task of identifying a language from the given speech signal. Efforts to develop language identification systems for Indian languages have been very limited due to the problem of speaker availability and language legibility but the requirement of SLID is increasing for civil and defense applications day by day. The present paper reports a study to develop a multilingual identification system for two Indian languages i.e. Hindi and Manipuri by using PPRLM approach that requires phoneme based labeled speech corpus for each language. For each language, data set of 300 phonetically rich sentences spoken by 25 native speakers (15000 utterances) were recorded, analyzed and annotated phonemically to make trigram based phonotactic model. The features of the speech signal have been extracted using MFCCs and GMM was used as a classifier. Results show that accuracy increases with the increase of Gaussians and also with the training samples.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127592311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-16DOI: 10.1109/ICSDA.2017.8384449
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, Hao Zheng
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.
{"title":"AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline","authors":"Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, Hao Zheng","doi":"10.1109/ICSDA.2017.8384449","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384449","url":null,"abstract":"An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122518119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-05-09DOI: 10.1109/ICSDA.2017.8384445
Zhiyuan Tang, Dong Wang, Yixiang Chen, Ying Shi, Lantian Li
Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID). However, the phonetic information has been largely overlooked by most of existing neural LID models, although this information has been used in the conventional phonetic LID systems with a great success. We present a phone- aware neural LID architecture, which is a deep LSTM-RNN LID system but accepts output from an RNN-based ASR system. By utilizing the phonetic knowledge, the LID performance can be significantly improved. Interestingly, even if the test language is not involved in the ASR training, the phonetic knowledge still presents a large contribution. Our experiments conducted on four languages within the Babel corpus demonstrated that the phone-aware approach is highly effective.
{"title":"Phone-aware neural language identification","authors":"Zhiyuan Tang, Dong Wang, Yixiang Chen, Ying Shi, Lantian Li","doi":"10.1109/ICSDA.2017.8384445","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384445","url":null,"abstract":"Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID). However, the phonetic information has been largely overlooked by most of existing neural LID models, although this information has been used in the conventional phonetic LID systems with a great success. We present a phone- aware neural LID architecture, which is a deep LSTM-RNN LID system but accepts output from an RNN-based ASR system. By utilizing the phonetic knowledge, the LID performance can be significantly improved. Interestingly, even if the test language is not involved in the ASR training, the phonetic knowledge still presents a large contribution. Our experiments conducted on four languages within the Babel corpus demonstrated that the phone-aware approach is highly effective.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"252 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124163995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)