首页 > 最新文献

2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)最新文献

英文 中文
ConPro: Heteronym pronunciation corpus with context information for text-to-phoneme evaluation in Thai 带上下文信息的泰语文本-音素评价的异音语料库
C. Hansakunbuntheung, Sumonmas Thatphithakkul
Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes "ConPro" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.
异音异义词是文本到音素转换中的一个关键问题,它是指根据上下文而具有多种发音的文本。传统的语音语料库仅收集字形-音素对不足以评估异音问题。此外,在没有断词的语言中,例如泰语,具有多种可能分词的正字法组的问题是造成歧义发音的另一个主要原因。因此,本文提出了“ConPro”语料库,这是一个具有系统收集和上下文信息的语境依赖的泰语异义词发音语料库,用于评估文本到音素转换的准确性。语料库设计的关键包括:1)以多词正字法组为基本单位;2)以语用和紧凑的语境文本为评价文本;3)分类矩阵标签用于表示正字法组的正字法类型和使用领域,并研究文本到音素转换中的问题类别;4)语音和意义优先的外来词收集,以扩大外来词和语境的覆盖范围。
{"title":"ConPro: Heteronym pronunciation corpus with context information for text-to-phoneme evaluation in Thai","authors":"C. Hansakunbuntheung, Sumonmas Thatphithakkul","doi":"10.1109/ICSDA.2017.8384421","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384421","url":null,"abstract":"Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes \"ConPro\" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"373-375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117215279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
M2ASR: Ambitions and first year progress M2ASR:目标和第一年的进展
Dong Wang, T. Zheng, Zhiyuan Tang, Ying Shi, Lantian Li, Shiyue Zhang, Hongzhi Yu, Guanyu Li, Shipeng Xu, A. Hamdulla, Mijit Ablimit, Gulnigar Mahmut
In spite of the rapid development of speech techniques, most of the present achievements are for a few major languages, e.g., English and Chinese. Unfortunately, most of the languages in the world are 'minority languages', in the sense that they are spoken by a small population and with limited resource accumulation. Since the present speech technologies are mostly based on big data, partly due to the profound impact of deep learning, they are not directly applicable to minority languages. However, minority languages are so numerous and important that if we want to break the language barrier, they must be seriously taken into account. Recently, the Chinese government approved a fundamental research for minority languages in China: Multilingual Minorlingual Automatic Speech Recognition (M2ASR). Although the initial goal was speech recognition, the ambition of this project is more than that: it intends to publish all the achievements and make them free for the research community, including speech and text corpora, phone sets, lexicons, tools, recipes and prototype systems. In this paper, we will describe this project, report the first-year progress, and present the future plan.
尽管语音技术发展迅速,但目前取得的成就大多是针对少数几种主要语言,如英语和汉语。不幸的是,世界上大多数语言都是“少数民族语言”,也就是说,这些语言的使用者人数少,资源积累有限。由于目前的语音技术大多是基于大数据的,部分原因是深度学习的影响深远,因此不能直接适用于少数民族语言。然而,少数民族语言是如此之多和重要,如果我们想要打破语言障碍,他们必须认真考虑。最近,中国政府批准了一项针对中国少数民族语言的基础研究:多语种少数民族语言自动语音识别(M2ASR)。虽然最初的目标是语音识别,但这个项目的野心不止于此:它打算公布所有的成就,并将它们免费提供给研究界,包括语音和文本语料库、电话机、词典、工具、食谱和原型系统。在本文中,我们将描述这个项目,报告第一年的进展,并提出未来的计划。
{"title":"M2ASR: Ambitions and first year progress","authors":"Dong Wang, T. Zheng, Zhiyuan Tang, Ying Shi, Lantian Li, Shiyue Zhang, Hongzhi Yu, Guanyu Li, Shipeng Xu, A. Hamdulla, Mijit Ablimit, Gulnigar Mahmut","doi":"10.1109/ICSDA.2017.8384469","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384469","url":null,"abstract":"In spite of the rapid development of speech techniques, most of the present achievements are for a few major languages, e.g., English and Chinese. Unfortunately, most of the languages in the world are 'minority languages', in the sense that they are spoken by a small population and with limited resource accumulation. Since the present speech technologies are mostly based on big data, partly due to the profound impact of deep learning, they are not directly applicable to minority languages. However, minority languages are so numerous and important that if we want to break the language barrier, they must be seriously taken into account. Recently, the Chinese government approved a fundamental research for minority languages in China: Multilingual Minorlingual Automatic Speech Recognition (M2ASR). Although the initial goal was speech recognition, the ambition of this project is more than that: it intends to publish all the achievements and make them free for the research community, including speech and text corpora, phone sets, lexicons, tools, recipes and prototype systems. In this paper, we will describe this project, report the first-year progress, and present the future plan.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131085703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Developing a speech corpus from web news for Myanmar (Burmese) language 开发缅甸语的网络新闻语料库
Aye Nyein Mon, Win Pa Pa, Ye Kyaw Thu, Y. Sagisaka
Speech corpus is important for statistical model based automatic speech recognition and it reflects the performance of a speech recognizer. Although most of the speech corpora for resource-riched languages such as English are widely available and it can be used easily, there is no Myanmar speech corpus which is freely available for automatic speech recognition (ASR) research since Myanmar is a low resource language. This paper presents the design and development of Myanmar speech corpus for the news domain to be applied to convolutional neural network (CNN)-based Myanmar continuous speech recognition research. The speech corpus consists of 20 hours read speech data collected from online web news and there are 178 speakers (126 females and 52 males). Our speech corpus is evaluated on two test sets: TestSet1 (web data) and TestSet2 (news recording with 10 natives). Using CNN-based model, word error rate (WER) achieves 24.73% on TestSet1 and 22.95% on TestSet2.
语音语料库是基于统计模型的自动语音识别的重要组成部分,它反映了语音识别器的性能。虽然大多数资源丰富的语言(如英语)的语音语料库广泛可用且易于使用,但由于缅甸语是一种低资源语言,因此没有免费用于自动语音识别(ASR)研究的缅甸语语音语料库。本文介绍了新闻领域缅甸语语料库的设计与开发,并将其应用于基于卷积神经网络(CNN)的缅甸语连续语音识别研究。语音语料库包括从在线网络新闻中收集的20小时阅读语音数据,共有178位演讲者(126位女性,52位男性)。我们的语音语料库在两个测试集上进行评估:TestSet1 (web数据)和TestSet2(带有10个原生用户的新闻记录)。使用基于cnn的模型,单词错误率(WER)在TestSet1和TestSet2上分别达到24.73%和22.95%。
{"title":"Developing a speech corpus from web news for Myanmar (Burmese) language","authors":"Aye Nyein Mon, Win Pa Pa, Ye Kyaw Thu, Y. Sagisaka","doi":"10.1109/ICSDA.2017.8384451","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384451","url":null,"abstract":"Speech corpus is important for statistical model based automatic speech recognition and it reflects the performance of a speech recognizer. Although most of the speech corpora for resource-riched languages such as English are widely available and it can be used easily, there is no Myanmar speech corpus which is freely available for automatic speech recognition (ASR) research since Myanmar is a low resource language. This paper presents the design and development of Myanmar speech corpus for the news domain to be applied to convolutional neural network (CNN)-based Myanmar continuous speech recognition research. The speech corpus consists of 20 hours read speech data collected from online web news and there are 178 speakers (126 females and 52 males). Our speech corpus is evaluated on two test sets: TestSet1 (web data) and TestSet2 (news recording with 10 natives). Using CNN-based model, word error rate (WER) achieves 24.73% on TestSet1 and 22.95% on TestSet2.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124910393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
How prosodic cues could lead to information center in speech - An alternative to ASR 韵律线索如何引导语音中的信息中心——ASR的替代方案
Chao-yu Su, Chiu-yu Tseng
It has been reported in ASR literature that prosody helps retrieve important textual information by word. We therefore believe that prosodic information in the speech signal could be used to facilitate speech processing more directly. The prosodic word, a perceptually identifiable unit which is usually slightly larger in size than lexical word, can be a possible alternative to help locate important information in speech. Acoustic analysis across labels of perceived prosodic highlighted part in prosodic words and semantic foci in words are compared. The results demonstrate that prosodic highlights occur before targeted key information and function as advanced prompts to outline upcoming sematic foci ahead of time. Semantic saliency of targeted words are thus enhanced beforehand while correct anticipation can be facilitated prior to detailed lexical processing. Further automatic identification approach of key content by prosodic features also shows the possibility to retrieve important information through prosodic words. We believe the results demonstrate that not all information is equally important in speech, locating information center is the key to speech communication, and the contribution of prosody is critical.
据ASR文献报道,韵律有助于通过单词检索重要的文本信息。因此,我们认为语音信号中的韵律信息可以更直接地用于语音处理。韵律词是一种感知上可识别的单位,通常比词汇词的大小稍大,可以作为一种可能的替代方法,帮助定位语音中的重要信息。比较了感知韵律词中韵律突出部分和语义焦点的跨标签声学分析。结果表明,韵律亮点出现在目标关键信息之前,并作为提前概述即将到来的语义焦点的高级提示。因此,目标词的语义显著性可以事先增强,而正确的预期可以在详细的词汇加工之前促进。进一步采用韵律特征自动识别关键内容的方法也显示了通过韵律词检索重要信息的可能性。我们认为,研究结果表明,并非所有信息在语音中都同等重要,信息中心的定位是语音交流的关键,韵律的贡献至关重要。
{"title":"How prosodic cues could lead to information center in speech - An alternative to ASR","authors":"Chao-yu Su, Chiu-yu Tseng","doi":"10.1109/ICSDA.2017.8384443","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384443","url":null,"abstract":"It has been reported in ASR literature that prosody helps retrieve important textual information by word. We therefore believe that prosodic information in the speech signal could be used to facilitate speech processing more directly. The prosodic word, a perceptually identifiable unit which is usually slightly larger in size than lexical word, can be a possible alternative to help locate important information in speech. Acoustic analysis across labels of perceived prosodic highlighted part in prosodic words and semantic foci in words are compared. The results demonstrate that prosodic highlights occur before targeted key information and function as advanced prompts to outline upcoming sematic foci ahead of time. Semantic saliency of targeted words are thus enhanced beforehand while correct anticipation can be facilitated prior to detailed lexical processing. Further automatic identification approach of key content by prosodic features also shows the possibility to retrieve important information through prosodic words. We believe the results demonstrate that not all information is equally important in speech, locating information center is the key to speech communication, and the contribution of prosody is critical.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116784361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Acoustic analysis of vowels in five low resource north East Indian languages of Nagaland 那加兰邦东北印度五种低资源语言元音的声学分析
Joyanta Basu, T. Basu, Soma Khan, Madhab Pal, Rajib Roy, M. S. Bepari, Sushmita Nandi, T. Basu, Swanirbhar Majumder, S. Chatterjee
This paper describes acoustic analysis of vowels in five different low resource languages of Nagaland namely Nagamese, Ao, Lotha, Sumi and Angami from North-Eastern India. Six major vowels (/u/, /o/, /a/, /a/, /e/, /i/) are studied for these languages to build up the characteristic features of these languages from readout speech. Vowel duration and 1st, 2nd and 3rd formant i.e. F1, F2 and F3 are investigated and analyzed for these languages. Using these vowels' knowledge, a small Language Identification module has been developed and tested with unseen samples of the above said languages. Result shows that instead of considering F1, F2 and vowel duration only, inclusion of F3 markedly improves the performance for identification of Nagaland languages except for Nagamese. This initial study unveils the importance of vowel characteristics. The result of language identification is also encouraging for these low resource languages.
本文对那加兰邦五种不同的低资源语言,即印度东北部的那加兰语、奥语、洛塔语、苏米语和安加米语的元音进行了声学分析。研究这些语言的六个主要元音(/u/, /o/, /a/, /a/, /e/, /i/),从读出语音中建立这些语言的特征。对这些语言的元音时长和第一、第二、第三构音即F1、F2和F3进行了调查和分析。利用这些元音知识,我们开发了一个小型的语言识别模块,并使用上述语言的未见样本进行了测试。结果表明,与只考虑F1、F2和元音时长相比,加入F3显著提高了对那加兰邦语言(除那加兰语外)的识别性能。这项初步研究揭示了元音特征的重要性。对于这些资源匮乏的语言,语言识别的结果也是令人鼓舞的。
{"title":"Acoustic analysis of vowels in five low resource north East Indian languages of Nagaland","authors":"Joyanta Basu, T. Basu, Soma Khan, Madhab Pal, Rajib Roy, M. S. Bepari, Sushmita Nandi, T. Basu, Swanirbhar Majumder, S. Chatterjee","doi":"10.1109/ICSDA.2017.8384460","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384460","url":null,"abstract":"This paper describes acoustic analysis of vowels in five different low resource languages of Nagaland namely Nagamese, Ao, Lotha, Sumi and Angami from North-Eastern India. Six major vowels (/u/, /o/, /a/, /a/, /e/, /i/) are studied for these languages to build up the characteristic features of these languages from readout speech. Vowel duration and 1st, 2nd and 3rd formant i.e. F1, F2 and F3 are investigated and analyzed for these languages. Using these vowels' knowledge, a small Language Identification module has been developed and tested with unseen samples of the above said languages. Result shows that instead of considering F1, F2 and vowel duration only, inclusion of F3 markedly improves the performance for identification of Nagaland languages except for Nagamese. This initial study unveils the importance of vowel characteristics. The result of language identification is also encouraging for these low resource languages.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115981145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Methods and challenges for creating an emotional audio-visual database 建立情感视听数据库的方法与挑战
Meghna Pandharipande, Rupayan Chakraborty, Sunil Kumar Kopparapu
Emotion has a very important role in human communication and can be expressed either verbally through speech (e.g. pitch, intonation, prosody etc), or by facial expressions, gestures etc. Most of the contemporary human-computer interaction are deficient in interpreting these information and hence suffers from lack of emotional intelligence. In other words, these systems are unable to identify human's emotional state and hence is not able to react properly. To overcome these inabilities, machines are required to be trained using annotated emotional data samples. Motivated from this fact, here we have attempted to collect and create an audio-visual emotional corpus. Audio-visual signals of multiple subjects were recorded when they were asked to watch either presentation (having background music) or emotional video clips. Post recording subjects were asked to express how they felt, and to read out sentences that appeared on the screen. Self annotation from the subject itself, as well as annotation from others have also been carried out to annotate the recorded data.
情感在人类交流中起着非常重要的作用,可以通过言语(如音调、语调、韵律等)口头表达,也可以通过面部表情、手势等表达。大多数当代人机交互都缺乏对这些信息的解释,因此缺乏情商。换句话说,这些系统无法识别人类的情绪状态,因此无法做出正确的反应。为了克服这些缺陷,机器需要使用带注释的情感数据样本进行训练。基于这一事实,我们试图收集并创造一个视听情感语料库。当多名受试者被要求观看演讲(有背景音乐)或情感视频片段时,他们的视听信号被记录下来。录制后的受试者被要求表达他们的感受,并读出屏幕上出现的句子。对记录的数据进行了主体自身的标注和他人的标注。
{"title":"Methods and challenges for creating an emotional audio-visual database","authors":"Meghna Pandharipande, Rupayan Chakraborty, Sunil Kumar Kopparapu","doi":"10.1109/ICSDA.2017.8384466","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384466","url":null,"abstract":"Emotion has a very important role in human communication and can be expressed either verbally through speech (e.g. pitch, intonation, prosody etc), or by facial expressions, gestures etc. Most of the contemporary human-computer interaction are deficient in interpreting these information and hence suffers from lack of emotional intelligence. In other words, these systems are unable to identify human's emotional state and hence is not able to react properly. To overcome these inabilities, machines are required to be trained using annotated emotional data samples. Motivated from this fact, here we have attempted to collect and create an audio-visual emotional corpus. Audio-visual signals of multiple subjects were recorded when they were asked to watch either presentation (having background music) or emotional video clips. Post recording subjects were asked to express how they felt, and to read out sentences that appeared on the screen. Self annotation from the subject itself, as well as annotation from others have also been carried out to annotate the recorded data.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125622705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
O-MARC: A multilingual online speech data acquisition for Indian languages O-MARC:印度语言的多语种在线语音数据采集
S. Sinha, S. Sharan, S. Agrawal
More and more efforts on speech resource development will facilitate advancements in speech technology for spoken languages. Acquisition of speech data is a rigorous task due to high cost and non-availability of suitable speakers. Accessibility to online digital tools will greatly help in speaker availability and easy collection of speech samples. This paper describes an online multilingual audio resource collection interface (O-MARC) for speech samples and is used for three Indian languages i.e. Hindi, Punjabi, and Manipuri. The interface works in a distributed environment and provides a fast and easy collection of speech samples in a variety of recording environment for the prompted text messages. Metadata and the recorded samples are automatically saved to the centralized server and stored in base64 format. This application is accessible on smartphones, desktop/laptop or PDA running any operating system. To address the internet connectivity issue recorded samples are temporarily stored in the local storage that is continuously synchronized with the centralized server. Participant's feedback on the tool is also included in the paper.
越来越多的语音资源开发将促进口语语言语音技术的进步。语音数据的采集是一项艰巨的任务,因为成本高且没有合适的说话人。在线数字工具的可访问性将大大有助于演讲者的可用性和轻松收集语音样本。本文描述了一个在线多语种语音资源采集接口(O-MARC),该接口用于三种印度语言,即印地语、旁遮普语和曼尼普尔语。该接口工作在分布式环境下,为提示文本消息提供了各种录音环境下快速简便的语音样本采集。元数据和记录的样本自动保存到集中服务器,以base64格式存储。这个应用程序可以在智能手机,台式机/笔记本电脑或PDA上运行任何操作系统。为了解决互联网连接问题,记录的样本暂时存储在本地存储中,该存储与中央服务器持续同步。参与者对该工具的反馈也包含在论文中。
{"title":"O-MARC: A multilingual online speech data acquisition for Indian languages","authors":"S. Sinha, S. Sharan, S. Agrawal","doi":"10.1109/ICSDA.2017.8384464","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384464","url":null,"abstract":"More and more efforts on speech resource development will facilitate advancements in speech technology for spoken languages. Acquisition of speech data is a rigorous task due to high cost and non-availability of suitable speakers. Accessibility to online digital tools will greatly help in speaker availability and easy collection of speech samples. This paper describes an online multilingual audio resource collection interface (O-MARC) for speech samples and is used for three Indian languages i.e. Hindi, Punjabi, and Manipuri. The interface works in a distributed environment and provides a fast and easy collection of speech samples in a variety of recording environment for the prompted text messages. Metadata and the recorded samples are automatically saved to the centralized server and stored in base64 format. This application is accessible on smartphones, desktop/laptop or PDA running any operating system. To address the internet connectivity issue recorded samples are temporarily stored in the local storage that is continuously synchronized with the centralized server. Participant's feedback on the tool is also included in the paper.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130486242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Linear-scale filterbank for deep neural network-based voice activity detection 基于深度神经网络的语音活动检测线性尺度滤波器组
Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Hoirin Kim
Voice activity detection (VAD) is an important preprocessing module in many speech applications. Choosing appropriate features and model structures is a significant challenge and an active area of current VAD research. Mel-scale features such as Mel-frequency cepstral coefficients (MFCCs) and log Mel-filterbank (LMFB) energies have been widely used in VAD as well as speech recognition. The reason for feature extraction in Mel- frequency scale to be one of the most popular methods is that it mimics how human ears process sound. However, for certain types of sound, in which important characteristics are reflected more in the high frequency range, a linear-scale in frequency may provide more information than the Mel- scale. Therefore, in this paper, we propose a deep neural network (DNN)-based VAD system using linear-scale feature. This study shows that the linear-scale feature, especially log linear-filterbank (LLFB) energy, can be used for the DNN-based VAD system and shows better performance than the LMFB for certain types of noise. Moreover, a combination of LMFB and LLFB can integrates both advantages of the two features.
语音活动检测(VAD)是许多语音应用中重要的预处理模块。选择合适的特征和模型结构是当前VAD研究的一个重大挑战和活跃领域。Mel-frequency倒谱系数(MFCCs)和对数Mel-filterbank (LMFB)能量等mel尺度特征在VAD和语音识别中得到了广泛的应用。低频尺度的特征提取之所以成为最受欢迎的方法之一,是因为它模仿了人耳处理声音的方式。然而,对于某些类型的声音,其重要特征更多地反映在高频范围内,频率的线性标度可能比Mel标度提供更多的信息。因此,在本文中,我们提出了一种基于深度神经网络(DNN)的基于线性尺度特征的VAD系统。该研究表明,线性尺度特征,特别是对数线性滤波器组(LLFB)能量,可以用于基于dnn的VAD系统,并且在某些类型的噪声下表现出比LMFB更好的性能。此外,LMFB和LLFB的结合可以将两种特性的优点结合起来。
{"title":"Linear-scale filterbank for deep neural network-based voice activity detection","authors":"Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Hoirin Kim","doi":"10.1109/ICSDA.2017.8384446","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384446","url":null,"abstract":"Voice activity detection (VAD) is an important preprocessing module in many speech applications. Choosing appropriate features and model structures is a significant challenge and an active area of current VAD research. Mel-scale features such as Mel-frequency cepstral coefficients (MFCCs) and log Mel-filterbank (LMFB) energies have been widely used in VAD as well as speech recognition. The reason for feature extraction in Mel- frequency scale to be one of the most popular methods is that it mimics how human ears process sound. However, for certain types of sound, in which important characteristics are reflected more in the high frequency range, a linear-scale in frequency may provide more information than the Mel- scale. Therefore, in this paper, we propose a deep neural network (DNN)-based VAD system using linear-scale feature. This study shows that the linear-scale feature, especially log linear-filterbank (LLFB) energy, can be used for the DNN-based VAD system and shows better performance than the LMFB for certain types of noise. Moreover, a combination of LMFB and LLFB can integrates both advantages of the two features.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131531262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Rhythm and disfluency: Interactions in Chinese L2 English speech 节奏与不流畅:中国第二语言英语演讲中的互动
Jue Yu, Lu Zhang, Shengyi Wu, Bei Zhang
This paper mainly focused on the rhythm patterns in Chinese L2 English speech, either in read or spontaneous speech style. The main purpose is to investigate the rhythmic differences between Chinese L2 and English L1 speakers as well as the possibility of rhythmic variation between spontaneous and read speech style; and last but not least, to figure out the effects of disfluency on Chinese L2 English rhythm. It is found that Chinese L2 learners can successfully acquire discourse rhythm patterns in a more natural speech style but how to manipulate vocalic duration variability is still a major challenge. Compared with English natives, Chinese L2 learners are considerably more disfluent, in terms of time-related and performance-related aspects; moreover, apply different planning strategies from English natives. Temporal fluency has a big impact on Chinese L2 speech rhythm.
本文主要研究了汉语二语英语口语的节奏模式,包括阅读式和自发式两种。本研究的主要目的是探讨汉语第二语言和英语第一语言说话者的节奏差异,以及自发性和阅读性说话风格之间节奏变化的可能性;最后,找出不流利对中国第二语言英语节奏的影响。研究发现,中国的二语学习者可以成功地以一种更自然的语言风格习得话语节奏模式,但如何控制语音持续时间的变化仍然是一个重大挑战。与母语为英语的人相比,中国的第二语言学习者在时间相关和表现相关方面明显不流利;此外,运用不同于英语母语者的规划策略。时间流畅性对汉语第二语言语言节奏有重要影响。
{"title":"Rhythm and disfluency: Interactions in Chinese L2 English speech","authors":"Jue Yu, Lu Zhang, Shengyi Wu, Bei Zhang","doi":"10.1109/ICSDA.2017.8384459","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384459","url":null,"abstract":"This paper mainly focused on the rhythm patterns in Chinese L2 English speech, either in read or spontaneous speech style. The main purpose is to investigate the rhythmic differences between Chinese L2 and English L1 speakers as well as the possibility of rhythmic variation between spontaneous and read speech style; and last but not least, to figure out the effects of disfluency on Chinese L2 English rhythm. It is found that Chinese L2 learners can successfully acquire discourse rhythm patterns in a more natural speech style but how to manipulate vocalic duration variability is still a major challenge. Compared with English natives, Chinese L2 learners are considerably more disfluent, in terms of time-related and performance-related aspects; moreover, apply different planning strategies from English natives. Temporal fluency has a big impact on Chinese L2 speech rhythm.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128987502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Feature selection method for real-time speech emotion recognition 实时语音情感识别的特征选择方法
Reda Elbarougy, M. Akagi
Feature selection is very important step to improve the accuracy of speech emotion recognition for many applications such as speech-to-speech translation system. Thousands of features can be extracted from speech signal however which features are the most related for speaker emotional state. Until now most of related features to emotional states are not yet found. The purpose of this paper is to propose a feature selection method which have the ability to find most related features with linear or non-linear relationship with the emotional state. Most of the previous studies used either correlation between acoustic features and emotions as for feature selection or principal component analysis (PCA) as a feature reduction method. These traditional methods does not reflect all types of relations between acoustic features and emotional state. They only can find the features which have a linear relationship. However, the relationship between any two variables can be linear, nonlinear or fuzzy. Therefore, the feature selection method should consider these kind of relationship between acoustic features and emotional state. Therefore, a feature selection method based on fuzzy inference system (FIS) was proposed. The proposed method can find all features which have any kind of above mentioned relationships. Then A FIS was used to estimate emotion dimensions valence and activations. Third FIS was used to map the values of estimated valence and activation to emotional category. The experimental results reveal that the proposed features selection method outperforms the traditional methods.
特征选择是提高语音情感识别准确率的重要步骤,对于语音到语音翻译系统等许多应用来说都是如此。从语音信号中可以提取出成千上万的特征,但哪些特征与说话人的情绪状态最相关。到目前为止,大多数与情绪状态相关的特征还没有被发现。本文的目的是提出一种特征选择方法,该方法能够找到与情绪状态有线性或非线性关系的大多数相关特征。以往的研究大多采用声学特征与情绪的相关性作为特征选择或主成分分析(PCA)作为特征约简方法。这些传统的方法并不能反映声音特征与情绪状态之间的所有类型的关系。他们只能找到有线性关系的特征。然而,任意两个变量之间的关系可以是线性的、非线性的或模糊的。因此,特征选择方法应考虑声学特征与情绪状态之间的这种关系。为此,提出了一种基于模糊推理系统(FIS)的特征选择方法。该方法可以找到具有上述任何一种关系的所有特征。然后用FIS估计情绪维度、效价和激活。第三,利用FIS将效价和激活值映射到情绪类别。实验结果表明,所提出的特征选择方法优于传统的特征选择方法。
{"title":"Feature selection method for real-time speech emotion recognition","authors":"Reda Elbarougy, M. Akagi","doi":"10.1109/ICSDA.2017.8384453","DOIUrl":"https://doi.org/10.1109/ICSDA.2017.8384453","url":null,"abstract":"Feature selection is very important step to improve the accuracy of speech emotion recognition for many applications such as speech-to-speech translation system. Thousands of features can be extracted from speech signal however which features are the most related for speaker emotional state. Until now most of related features to emotional states are not yet found. The purpose of this paper is to propose a feature selection method which have the ability to find most related features with linear or non-linear relationship with the emotional state. Most of the previous studies used either correlation between acoustic features and emotions as for feature selection or principal component analysis (PCA) as a feature reduction method. These traditional methods does not reflect all types of relations between acoustic features and emotional state. They only can find the features which have a linear relationship. However, the relationship between any two variables can be linear, nonlinear or fuzzy. Therefore, the feature selection method should consider these kind of relationship between acoustic features and emotional state. Therefore, a feature selection method based on fuzzy inference system (FIS) was proposed. The proposed method can find all features which have any kind of above mentioned relationships. Then A FIS was used to estimate emotion dimensions valence and activations. Third FIS was used to map the values of estimated valence and activation to emotional category. The experimental results reveal that the proposed features selection method outperforms the traditional methods.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1