2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)最新文献
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384421
C. Hansakunbuntheung, Sumonmas Thatphithakkul
Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes "ConPro" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.
{"title":"ConPro: Heteronym pronunciation corpus with context information for text-to-phoneme evaluation in Thai","authors":"C. Hansakunbuntheung, Sumonmas Thatphithakkul","doi":"10.1109/ICSDA.2017.8384421","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384421","url":null,"abstract":"Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes \"ConPro\" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"373-375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117215279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384469
Dong Wang, T. Zheng, Zhiyuan Tang, Ying Shi, Lantian Li, Shiyue Zhang, Hongzhi Yu, Guanyu Li, Shipeng Xu, A. Hamdulla, Mijit Ablimit, Gulnigar Mahmut
In spite of the rapid development of speech techniques, most of the present achievements are for a few major languages, e.g., English and Chinese. Unfortunately, most of the languages in the world are 'minority languages', in the sense that they are spoken by a small population and with limited resource accumulation. Since the present speech technologies are mostly based on big data, partly due to the profound impact of deep learning, they are not directly applicable to minority languages. However, minority languages are so numerous and important that if we want to break the language barrier, they must be seriously taken into account. Recently, the Chinese government approved a fundamental research for minority languages in China: Multilingual Minorlingual Automatic Speech Recognition (M2ASR). Although the initial goal was speech recognition, the ambition of this project is more than that: it intends to publish all the achievements and make them free for the research community, including speech and text corpora, phone sets, lexicons, tools, recipes and prototype systems. In this paper, we will describe this project, report the first-year progress, and present the future plan.
{"title":"M2ASR: Ambitions and first year progress","authors":"Dong Wang, T. Zheng, Zhiyuan Tang, Ying Shi, Lantian Li, Shiyue Zhang, Hongzhi Yu, Guanyu Li, Shipeng Xu, A. Hamdulla, Mijit Ablimit, Gulnigar Mahmut","doi":"10.1109/ICSDA.2017.8384469","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384469","url":null,"abstract":"In spite of the rapid development of speech techniques, most of the present achievements are for a few major languages, e.g., English and Chinese. Unfortunately, most of the languages in the world are 'minority languages', in the sense that they are spoken by a small population and with limited resource accumulation. Since the present speech technologies are mostly based on big data, partly due to the profound impact of deep learning, they are not directly applicable to minority languages. However, minority languages are so numerous and important that if we want to break the language barrier, they must be seriously taken into account. Recently, the Chinese government approved a fundamental research for minority languages in China: Multilingual Minorlingual Automatic Speech Recognition (M2ASR). Although the initial goal was speech recognition, the ambition of this project is more than that: it intends to publish all the achievements and make them free for the research community, including speech and text corpora, phone sets, lexicons, tools, recipes and prototype systems. In this paper, we will describe this project, report the first-year progress, and present the future plan.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131085703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384451
Aye Nyein Mon, Win Pa Pa, Ye Kyaw Thu, Y. Sagisaka
Speech corpus is important for statistical model based automatic speech recognition and it reflects the performance of a speech recognizer. Although most of the speech corpora for resource-riched languages such as English are widely available and it can be used easily, there is no Myanmar speech corpus which is freely available for automatic speech recognition (ASR) research since Myanmar is a low resource language. This paper presents the design and development of Myanmar speech corpus for the news domain to be applied to convolutional neural network (CNN)-based Myanmar continuous speech recognition research. The speech corpus consists of 20 hours read speech data collected from online web news and there are 178 speakers (126 females and 52 males). Our speech corpus is evaluated on two test sets: TestSet1 (web data) and TestSet2 (news recording with 10 natives). Using CNN-based model, word error rate (WER) achieves 24.73% on TestSet1 and 22.95% on TestSet2.
{"title":"Developing a speech corpus from web news for Myanmar (Burmese) language","authors":"Aye Nyein Mon, Win Pa Pa, Ye Kyaw Thu, Y. Sagisaka","doi":"10.1109/ICSDA.2017.8384451","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384451","url":null,"abstract":"Speech corpus is important for statistical model based automatic speech recognition and it reflects the performance of a speech recognizer. Although most of the speech corpora for resource-riched languages such as English are widely available and it can be used easily, there is no Myanmar speech corpus which is freely available for automatic speech recognition (ASR) research since Myanmar is a low resource language. This paper presents the design and development of Myanmar speech corpus for the news domain to be applied to convolutional neural network (CNN)-based Myanmar continuous speech recognition research. The speech corpus consists of 20 hours read speech data collected from online web news and there are 178 speakers (126 females and 52 males). Our speech corpus is evaluated on two test sets: TestSet1 (web data) and TestSet2 (news recording with 10 natives). Using CNN-based model, word error rate (WER) achieves 24.73% on TestSet1 and 22.95% on TestSet2.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124910393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384443
Chao-yu Su, Chiu-yu Tseng
It has been reported in ASR literature that prosody helps retrieve important textual information by word. We therefore believe that prosodic information in the speech signal could be used to facilitate speech processing more directly. The prosodic word, a perceptually identifiable unit which is usually slightly larger in size than lexical word, can be a possible alternative to help locate important information in speech. Acoustic analysis across labels of perceived prosodic highlighted part in prosodic words and semantic foci in words are compared. The results demonstrate that prosodic highlights occur before targeted key information and function as advanced prompts to outline upcoming sematic foci ahead of time. Semantic saliency of targeted words are thus enhanced beforehand while correct anticipation can be facilitated prior to detailed lexical processing. Further automatic identification approach of key content by prosodic features also shows the possibility to retrieve important information through prosodic words. We believe the results demonstrate that not all information is equally important in speech, locating information center is the key to speech communication, and the contribution of prosody is critical.
{"title":"How prosodic cues could lead to information center in speech - An alternative to ASR","authors":"Chao-yu Su, Chiu-yu Tseng","doi":"10.1109/ICSDA.2017.8384443","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384443","url":null,"abstract":"It has been reported in ASR literature that prosody helps retrieve important textual information by word. We therefore believe that prosodic information in the speech signal could be used to facilitate speech processing more directly. The prosodic word, a perceptually identifiable unit which is usually slightly larger in size than lexical word, can be a possible alternative to help locate important information in speech. Acoustic analysis across labels of perceived prosodic highlighted part in prosodic words and semantic foci in words are compared. The results demonstrate that prosodic highlights occur before targeted key information and function as advanced prompts to outline upcoming sematic foci ahead of time. Semantic saliency of targeted words are thus enhanced beforehand while correct anticipation can be facilitated prior to detailed lexical processing. Further automatic identification approach of key content by prosodic features also shows the possibility to retrieve important information through prosodic words. We believe the results demonstrate that not all information is equally important in speech, locating information center is the key to speech communication, and the contribution of prosody is critical.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116784361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384460
Joyanta Basu, T. Basu, Soma Khan, Madhab Pal, Rajib Roy, M. S. Bepari, Sushmita Nandi, T. Basu, Swanirbhar Majumder, S. Chatterjee
This paper describes acoustic analysis of vowels in five different low resource languages of Nagaland namely Nagamese, Ao, Lotha, Sumi and Angami from North-Eastern India. Six major vowels (/u/, /o/, /a/, /a/, /e/, /i/) are studied for these languages to build up the characteristic features of these languages from readout speech. Vowel duration and 1st, 2nd and 3rd formant i.e. F1, F2 and F3 are investigated and analyzed for these languages. Using these vowels' knowledge, a small Language Identification module has been developed and tested with unseen samples of the above said languages. Result shows that instead of considering F1, F2 and vowel duration only, inclusion of F3 markedly improves the performance for identification of Nagaland languages except for Nagamese. This initial study unveils the importance of vowel characteristics. The result of language identification is also encouraging for these low resource languages.
{"title":"Acoustic analysis of vowels in five low resource north East Indian languages of Nagaland","authors":"Joyanta Basu, T. Basu, Soma Khan, Madhab Pal, Rajib Roy, M. S. Bepari, Sushmita Nandi, T. Basu, Swanirbhar Majumder, S. Chatterjee","doi":"10.1109/ICSDA.2017.8384460","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384460","url":null,"abstract":"This paper describes acoustic analysis of vowels in five different low resource languages of Nagaland namely Nagamese, Ao, Lotha, Sumi and Angami from North-Eastern India. Six major vowels (/u/, /o/, /a/, /a/, /e/, /i/) are studied for these languages to build up the characteristic features of these languages from readout speech. Vowel duration and 1st, 2nd and 3rd formant i.e. F1, F2 and F3 are investigated and analyzed for these languages. Using these vowels' knowledge, a small Language Identification module has been developed and tested with unseen samples of the above said languages. Result shows that instead of considering F1, F2 and vowel duration only, inclusion of F3 markedly improves the performance for identification of Nagaland languages except for Nagamese. This initial study unveils the importance of vowel characteristics. The result of language identification is also encouraging for these low resource languages.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115981145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emotion has a very important role in human communication and can be expressed either verbally through speech (e.g. pitch, intonation, prosody etc), or by facial expressions, gestures etc. Most of the contemporary human-computer interaction are deficient in interpreting these information and hence suffers from lack of emotional intelligence. In other words, these systems are unable to identify human's emotional state and hence is not able to react properly. To overcome these inabilities, machines are required to be trained using annotated emotional data samples. Motivated from this fact, here we have attempted to collect and create an audio-visual emotional corpus. Audio-visual signals of multiple subjects were recorded when they were asked to watch either presentation (having background music) or emotional video clips. Post recording subjects were asked to express how they felt, and to read out sentences that appeared on the screen. Self annotation from the subject itself, as well as annotation from others have also been carried out to annotate the recorded data.
{"title":"Methods and challenges for creating an emotional audio-visual database","authors":"Meghna Pandharipande, Rupayan Chakraborty, Sunil Kumar Kopparapu","doi":"10.1109/ICSDA.2017.8384466","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384466","url":null,"abstract":"Emotion has a very important role in human communication and can be expressed either verbally through speech (e.g. pitch, intonation, prosody etc), or by facial expressions, gestures etc. Most of the contemporary human-computer interaction are deficient in interpreting these information and hence suffers from lack of emotional intelligence. In other words, these systems are unable to identify human's emotional state and hence is not able to react properly. To overcome these inabilities, machines are required to be trained using annotated emotional data samples. Motivated from this fact, here we have attempted to collect and create an audio-visual emotional corpus. Audio-visual signals of multiple subjects were recorded when they were asked to watch either presentation (having background music) or emotional video clips. Post recording subjects were asked to express how they felt, and to read out sentences that appeared on the screen. Self annotation from the subject itself, as well as annotation from others have also been carried out to annotate the recorded data.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125622705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384464
S. Sinha, S. Sharan, S. Agrawal
More and more efforts on speech resource development will facilitate advancements in speech technology for spoken languages. Acquisition of speech data is a rigorous task due to high cost and non-availability of suitable speakers. Accessibility to online digital tools will greatly help in speaker availability and easy collection of speech samples. This paper describes an online multilingual audio resource collection interface (O-MARC) for speech samples and is used for three Indian languages i.e. Hindi, Punjabi, and Manipuri. The interface works in a distributed environment and provides a fast and easy collection of speech samples in a variety of recording environment for the prompted text messages. Metadata and the recorded samples are automatically saved to the centralized server and stored in base64 format. This application is accessible on smartphones, desktop/laptop or PDA running any operating system. To address the internet connectivity issue recorded samples are temporarily stored in the local storage that is continuously synchronized with the centralized server. Participant's feedback on the tool is also included in the paper.
{"title":"O-MARC: A multilingual online speech data acquisition for Indian languages","authors":"S. Sinha, S. Sharan, S. Agrawal","doi":"10.1109/ICSDA.2017.8384464","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384464","url":null,"abstract":"More and more efforts on speech resource development will facilitate advancements in speech technology for spoken languages. Acquisition of speech data is a rigorous task due to high cost and non-availability of suitable speakers. Accessibility to online digital tools will greatly help in speaker availability and easy collection of speech samples. This paper describes an online multilingual audio resource collection interface (O-MARC) for speech samples and is used for three Indian languages i.e. Hindi, Punjabi, and Manipuri. The interface works in a distributed environment and provides a fast and easy collection of speech samples in a variety of recording environment for the prompted text messages. Metadata and the recorded samples are automatically saved to the centralized server and stored in base64 format. This application is accessible on smartphones, desktop/laptop or PDA running any operating system. To address the internet connectivity issue recorded samples are temporarily stored in the local storage that is continuously synchronized with the centralized server. Participant's feedback on the tool is also included in the paper.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130486242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384446
Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Hoirin Kim
Voice activity detection (VAD) is an important preprocessing module in many speech applications. Choosing appropriate features and model structures is a significant challenge and an active area of current VAD research. Mel-scale features such as Mel-frequency cepstral coefficients (MFCCs) and log Mel-filterbank (LMFB) energies have been widely used in VAD as well as speech recognition. The reason for feature extraction in Mel- frequency scale to be one of the most popular methods is that it mimics how human ears process sound. However, for certain types of sound, in which important characteristics are reflected more in the high frequency range, a linear-scale in frequency may provide more information than the Mel- scale. Therefore, in this paper, we propose a deep neural network (DNN)-based VAD system using linear-scale feature. This study shows that the linear-scale feature, especially log linear-filterbank (LLFB) energy, can be used for the DNN-based VAD system and shows better performance than the LMFB for certain types of noise. Moreover, a combination of LMFB and LLFB can integrates both advantages of the two features.
{"title":"Linear-scale filterbank for deep neural network-based voice activity detection","authors":"Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Hoirin Kim","doi":"10.1109/ICSDA.2017.8384446","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384446","url":null,"abstract":"Voice activity detection (VAD) is an important preprocessing module in many speech applications. Choosing appropriate features and model structures is a significant challenge and an active area of current VAD research. Mel-scale features such as Mel-frequency cepstral coefficients (MFCCs) and log Mel-filterbank (LMFB) energies have been widely used in VAD as well as speech recognition. The reason for feature extraction in Mel- frequency scale to be one of the most popular methods is that it mimics how human ears process sound. However, for certain types of sound, in which important characteristics are reflected more in the high frequency range, a linear-scale in frequency may provide more information than the Mel- scale. Therefore, in this paper, we propose a deep neural network (DNN)-based VAD system using linear-scale feature. This study shows that the linear-scale feature, especially log linear-filterbank (LLFB) energy, can be used for the DNN-based VAD system and shows better performance than the LMFB for certain types of noise. Moreover, a combination of LMFB and LLFB can integrates both advantages of the two features.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131531262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384459
Jue Yu, Lu Zhang, Shengyi Wu, Bei Zhang
This paper mainly focused on the rhythm patterns in Chinese L2 English speech, either in read or spontaneous speech style. The main purpose is to investigate the rhythmic differences between Chinese L2 and English L1 speakers as well as the possibility of rhythmic variation between spontaneous and read speech style; and last but not least, to figure out the effects of disfluency on Chinese L2 English rhythm. It is found that Chinese L2 learners can successfully acquire discourse rhythm patterns in a more natural speech style but how to manipulate vocalic duration variability is still a major challenge. Compared with English natives, Chinese L2 learners are considerably more disfluent, in terms of time-related and performance-related aspects; moreover, apply different planning strategies from English natives. Temporal fluency has a big impact on Chinese L2 speech rhythm.
{"title":"Rhythm and disfluency: Interactions in Chinese L2 English speech","authors":"Jue Yu, Lu Zhang, Shengyi Wu, Bei Zhang","doi":"10.1109/ICSDA.2017.8384459","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384459","url":null,"abstract":"This paper mainly focused on the rhythm patterns in Chinese L2 English speech, either in read or spontaneous speech style. The main purpose is to investigate the rhythmic differences between Chinese L2 and English L1 speakers as well as the possibility of rhythmic variation between spontaneous and read speech style; and last but not least, to figure out the effects of disfluency on Chinese L2 English rhythm. It is found that Chinese L2 learners can successfully acquire discourse rhythm patterns in a more natural speech style but how to manipulate vocalic duration variability is still a major challenge. Compared with English natives, Chinese L2 learners are considerably more disfluent, in terms of time-related and performance-related aspects; moreover, apply different planning strategies from English natives. Temporal fluency has a big impact on Chinese L2 speech rhythm.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128987502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-01DOI: 10.1109/ICSDA.2017.8384453
Reda Elbarougy, M. Akagi
Feature selection is very important step to improve the accuracy of speech emotion recognition for many applications such as speech-to-speech translation system. Thousands of features can be extracted from speech signal however which features are the most related for speaker emotional state. Until now most of related features to emotional states are not yet found. The purpose of this paper is to propose a feature selection method which have the ability to find most related features with linear or non-linear relationship with the emotional state. Most of the previous studies used either correlation between acoustic features and emotions as for feature selection or principal component analysis (PCA) as a feature reduction method. These traditional methods does not reflect all types of relations between acoustic features and emotional state. They only can find the features which have a linear relationship. However, the relationship between any two variables can be linear, nonlinear or fuzzy. Therefore, the feature selection method should consider these kind of relationship between acoustic features and emotional state. Therefore, a feature selection method based on fuzzy inference system (FIS) was proposed. The proposed method can find all features which have any kind of above mentioned relationships. Then A FIS was used to estimate emotion dimensions valence and activations. Third FIS was used to map the values of estimated valence and activation to emotional category. The experimental results reveal that the proposed features selection method outperforms the traditional methods.
{"title":"Feature selection method for real-time speech emotion recognition","authors":"Reda Elbarougy, M. Akagi","doi":"10.1109/ICSDA.2017.8384453","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384453","url":null,"abstract":"Feature selection is very important step to improve the accuracy of speech emotion recognition for many applications such as speech-to-speech translation system. Thousands of features can be extracted from speech signal however which features are the most related for speaker emotional state. Until now most of related features to emotional states are not yet found. The purpose of this paper is to propose a feature selection method which have the ability to find most related features with linear or non-linear relationship with the emotional state. Most of the previous studies used either correlation between acoustic features and emotions as for feature selection or principal component analysis (PCA) as a feature reduction method. These traditional methods does not reflect all types of relations between acoustic features and emotional state. They only can find the features which have a linear relationship. However, the relationship between any two variables can be linear, nonlinear or fuzzy. Therefore, the feature selection method should consider these kind of relationship between acoustic features and emotional state. Therefore, a feature selection method based on fuzzy inference system (FIS) was proposed. The proposed method can find all features which have any kind of above mentioned relationships. Then A FIS was used to estimate emotion dimensions valence and activations. Third FIS was used to map the values of estimated valence and activation to emotional category. The experimental results reveal that the proposed features selection method outperforms the traditional methods.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)