2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献

英文中文

Indian Languages Corpus for Speech Recognition 印度语言语料库语音识别

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041171

Joyanta Basu, Soma Khan, Rajib Roy, Babita Saxena, Dipankar Ganguly, Sunita Arora, K. Arora, S. Bansal, S. Agrawal

Robust Speech Recognition System for various languages have transcended beyond research labs to commercial products. It has been possible owing to the major developments in the area of machine learning, especially deep learning. However, development of advanced speech recognition systems could be leveraged only with the availability of specially curetted speech data. Such systems having usable quality are yet to be developed for most of the Indian languages. The present paper describes the design and development of a standard speech corpora which can be used for developing general purpose ASR systems and benchmarking them. This database has been developed for Indian languages namely Hindi, Bengali and Indian English. The corpus design incorporates important parameters such as phonetic coverage and distribution. The data was recorded by 1500 speakers in each language by male and female speakers of different age groups in varying environments. The data was recorded on a server using online recording system and transcribed using semi-automatic tools. The paper describes the corpus designing methodology, challenges faced and approach adopted to overcome them. The whole process of designing speech database has been generic enough to be used for other languages as well.

针对多种语言的鲁棒语音识别系统已经超越了实验室的研究范畴，走向了商业化的产品。由于机器学习领域的重大发展，特别是深度学习，这已经成为可能。然而，先进的语音识别系统的发展，只能与专门修剪的语音数据的可用性。这种具有可用质量的系统还有待为大多数印度语言开发。本文描述了一个标准语音语料库的设计和开发，该语料库可用于开发通用ASR系统并对其进行基准测试。这个数据库是为印度语言开发的，即印地语、孟加拉语和印度英语。语料库设计包含语音覆盖和分布等重要参数。这些数据是由1500名不同年龄段的男女使用者在不同的环境中使用每种语言记录的。使用在线记录系统将数据记录在服务器上，并使用半自动工具进行转录。本文介绍了语料库的设计方法、面临的挑战以及为克服这些挑战所采取的措施。整个语音数据库的设计过程具有一定的通用性，可以应用于其他语言。

{"title":"Indian Languages Corpus for Speech Recognition","authors":"Joyanta Basu, Soma Khan, Rajib Roy, Babita Saxena, Dipankar Ganguly, Sunita Arora, K. Arora, S. Bansal, S. Agrawal","doi":"10.1109/O-COCOSDA46868.2019.9041171","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041171","url":null,"abstract":"Robust Speech Recognition System for various languages have transcended beyond research labs to commercial products. It has been possible owing to the major developments in the area of machine learning, especially deep learning. However, development of advanced speech recognition systems could be leveraged only with the availability of specially curetted speech data. Such systems having usable quality are yet to be developed for most of the Indian languages. The present paper describes the design and development of a standard speech corpora which can be used for developing general purpose ASR systems and benchmarking them. This database has been developed for Indian languages namely Hindi, Bengali and Indian English. The corpus design incorporates important parameters such as phonetic coverage and distribution. The data was recorded by 1500 speakers in each language by male and female speakers of different age groups in varying environments. The data was recorded on a server using online recording system and transcribed using semi-automatic tools. The paper describes the corpus designing methodology, challenges faced and approach adopted to overcome them. The whole process of designing speech database has been generic enough to be used for other languages as well.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130712526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Acquisition of english retroflex vowel [3] by EFL learners from Chinese dialectal regions- A case study of Beijing and Changsha 汉语方言地区英语学习者英语反折元音习得[3]——以北京和长沙为例

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060843

Bin Li, Yuan Jia

This paper investigates, through an extensive acoustic analysis, the acquisition of English retroflex vowel [3] by learners of English as Foreign Language (EFL) from Beijing (BJ) and Changsha (CS), which are representative dialectal regions in the north and south China respectively. In our analysis, formant and duration were selected as parameters. The results demonstrate that all the EFL learners involved in the study produced the onset target and offset target of [3] with a more backward tendency. For formant patterns, both native speakers and EFL learners present a similar tendency, namely the decline of F3 and the rise of F2. For CS speakers, due to the effect of their mother tongue, their F3 falls more slowly. Moreover, from the spectral perspective, the F3 changing rate of CS male learners is significantly smaller than that of native speakers. On the other hand, BJ learners, especially female learners, show more obvious changes in F3 than native speakers. In addition, we speculate that the language background and gender can affect the acquisition of retroflex vowels.

本文通过广泛的声学分析，调查了北京和长沙作为中国北方和南方方言的代表性地区，作为外语的英语学习者对英语反旋元音的习得[3]。在我们的分析中，选择了形成峰和持续时间作为参数。结果表明，参与研究的所有英语学习者都产生了[3]的起始目标和抵消目标，其倾向更向后。在形成峰模式上，本族语使用者和英语学习者都表现出相似的趋势，即F3的下降和F2的上升。对于说CS的人来说，由于母语的影响，他们的F3下降得更慢。此外，从谱上看，男性CS学习者的F3变化率明显小于母语者。另一方面，BJ学习者，尤其是女性学习者在F3上的变化比母语者更明显。此外，我们推测语言背景和性别也会影响反旋元音的习得。

{"title":"Acquisition of english retroflex vowel [3] by EFL learners from Chinese dialectal regions- A case study of Beijing and Changsha","authors":"Bin Li, Yuan Jia","doi":"10.1109/O-COCOSDA46868.2019.9060843","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060843","url":null,"abstract":"This paper investigates, through an extensive acoustic analysis, the acquisition of English retroflex vowel [3] by learners of English as Foreign Language (EFL) from Beijing (BJ) and Changsha (CS), which are representative dialectal regions in the north and south China respectively. In our analysis, formant and duration were selected as parameters. The results demonstrate that all the EFL learners involved in the study produced the onset target and offset target of [3] with a more backward tendency. For formant patterns, both native speakers and EFL learners present a similar tendency, namely the decline of F3 and the rise of F2. For CS speakers, due to the effect of their mother tongue, their F3 falls more slowly. Moreover, from the spectral perspective, the F3 changing rate of CS male learners is significantly smaller than that of native speakers. On the other hand, BJ learners, especially female learners, show more obvious changes in F3 than native speakers. In addition, we speculate that the language background and gender can affect the acquisition of retroflex vowels.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Phoneme-level speaking rate variation on waveform generation using GAN-TTS 基于GAN-TTS的音素级说话速率变化波形生成

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060845

Mayuko Okamato, S. Sakti, Satoshi Nakamura

The development of text-to-speech synthesis (TTS) systems continues to advance, and the naturalness of their generated speech has significantly improved. But most TTS systems now learn from data using a deep learning framework and generate the output at a monotonous speaking rate. In contrast humans vary their speaking rates and tend to slow down to emphasize words to distinguish elements of focus in an utterance.

文本到语音合成(TTS)系统的发展不断推进，其生成语音的自然度有了显著提高。但大多数TTS系统现在使用深度学习框架从数据中学习，并以单调的语速生成输出。相比之下，人类会改变语速，并倾向于放慢语速来强调单词，以区分话语中的重点元素。

引用次数: 2

A Great Reduction of WER by Syllable Toneme Prediction for Thai Grapheme to Phoneme Conversion 用音节语调预测泰文字素到音素转换中WER的大幅减少

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041212

S. Saychum, A. Rugchatjaroen, C. Wutiwiwatchai

Thai toneme prediction has been one of the greatest difficulties for Thai grapheme to phoneme conversion (G2P). This paper presents an improvement in the prediction of linguistic features in terms of tone rules. Among these, there will always be exceptions, for example, the tones used in loan words and transliterated words, which are usually adopted from the original language. This paper does not concern itself with the transliteration problem, but aims to show the success of a method which uses an automatic toneme predictor based on the tone rules of Thai pronunciation for the development of a machine learning model. The proposed method attaches a predictor to the final stage of converting a grapheme to a phoneme. Furthermore, this work also explores end-to-end prediction using Long Short Term Memories (LSTM) that takes its input sequence from the National Electronic and Computer Technology Center's Pseudo-Syllable segmentation and alignment tool. An evaluation was conducted to show the success of the proposed system, and also to compare the results with our traditional end-to-end sequence-to-sequence G2P. A comparison of the results shows that sequence-to-sequence modeling obtains the lowest Word Error Rate at 1.6%, and the proposed system works well on a 2018 small device (Raspberry Pi).

泰语的音素预测一直是泰语字形到音素转换(G2P)的最大难题之一。本文提出了一种基于声调规则的语言特征预测方法。其中，总会有例外，比如外来词和音译词所用的音调，通常是从原语言中采用的。本文不关注音译问题，而是旨在展示一种方法的成功，该方法使用基于泰语发音音调规则的自动音调预测器来开发机器学习模型。该方法将一个预测器附加到将字素转换为音素的最后阶段。此外，这项工作还探索了使用长短期记忆(LSTM)的端到端预测，LSTM的输入序列来自国家电子和计算机技术中心的伪音节分割和对齐工具。结果表明，该系统是成功的，并与传统的端到端序列对序列G2P进行了比较。结果比较表明，序列到序列建模获得的单词错误率最低，为1.6%，并且所提出的系统在2018年的小型设备(树莓派)上运行良好。

{"title":"A Great Reduction of WER by Syllable Toneme Prediction for Thai Grapheme to Phoneme Conversion","authors":"S. Saychum, A. Rugchatjaroen, C. Wutiwiwatchai","doi":"10.1109/O-COCOSDA46868.2019.9041212","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041212","url":null,"abstract":"Thai toneme prediction has been one of the greatest difficulties for Thai grapheme to phoneme conversion (G2P). This paper presents an improvement in the prediction of linguistic features in terms of tone rules. Among these, there will always be exceptions, for example, the tones used in loan words and transliterated words, which are usually adopted from the original language. This paper does not concern itself with the transliteration problem, but aims to show the success of a method which uses an automatic toneme predictor based on the tone rules of Thai pronunciation for the development of a machine learning model. The proposed method attaches a predictor to the final stage of converting a grapheme to a phoneme. Furthermore, this work also explores end-to-end prediction using Long Short Term Memories (LSTM) that takes its input sequence from the National Electronic and Computer Technology Center's Pseudo-Syllable segmentation and alignment tool. An evaluation was conducted to show the success of the proposed system, and also to compare the results with our traditional end-to-end sequence-to-sequence G2P. A comparison of the results shows that sequence-to-sequence modeling obtains the lowest Word Error Rate at 1.6%, and the proposed system works well on a 2018 small device (Raspberry Pi).","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114726880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sequence-to-Sequence Models for Grapheme to Phoneme Conversion on Large Myanmar Pronunciation Dictionary 大型缅甸语语音词典中字素到音素转换的序列到序列模型

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041225

Aye Mya Hlaing, Win Pa Pa

Grapheme to phoneme conversion is the production of pronunciation for a given word. Neural sequence to sequence models have been applied for grapheme to phoneme conversion recently. This paper analyzes the effectiveness of neural sequence to sequence models in grapheme to phoneme conversion for Myanmar language. The first large Myanmar pronunciation dictionary is introduced, and it is applied in building sequence to sequence models. The performance of four grapheme to phoneme conversion models, joint sequence model, Transformer, simple encoder-decoder, and attention enabled encoder-decoder models, are evaluated in terms of phoneme error rate(PER) and word error rate(WER). Analysis on three-word classes and six phoneme error types are done and discussed details in this paper. According to the evaluations, the Transformer has comparable results to traditional joint sequence model.

字素到音素的转换是给定单词发音的产物。近年来，神经序列到序列模型被应用于字素到音素的转换。本文分析了神经序列-序列模型在缅甸语字素-音素转换中的有效性。介绍了第一个大型缅甸语语音词典，并将其应用于序列到序列模型的构建。根据音素错误率(PER)和单词错误率(WER)对四种字形到音素转换模型(联合序列模型、Transformer模型、简单编码器-解码器模型和注意激活编码器-解码器模型)的性能进行了评估。本文对三词类和六种音位错误类型进行了详细的分析和讨论。通过评价，变压器模型与传统的关节序列模型具有可比性。

引用次数: 2

index 指数

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/o-cocosda46868.2019.9041241

引用次数: 0

Annotation and preliminary analysis of utterance decontextualization in a multiactivity 多活动中话语去语境化的注释与初步分析

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041203

Haruka Amatani, Yayoi Tanaka

How are conversations decontextualized apart from the here-and-now situation in a daily joint activity? More specifically, how are those (de/)contextualized utterances associated with movements in the activity? Applying Cloran's [1] Rhetoric Units, we identified the degrees of decontextualization for utterances, regarding their time and space distances from the ongoing situation. For the annotation of hand and body movements, we employed Kendon's [2] gesture phases. The association of speech and movements were examined using the degrees of decontextualization and movement phases. The results from the preliminary analysis suggested that when participants were pausing their movements they tend to utter in the high degrees of decontextualization than when they were moving.

对话如何脱离日常联合活动中的此时此地的情境?更具体地说，那些(去/)语境化的话语是如何与活动中的动作联系起来的?运用克罗兰的b[1]修辞单位，我们根据话语与当前情境的时间和空间距离，确定了话语的非语境化程度。对于手和身体动作的注释，我们使用Kendon的[2]手势阶段。言语和动作的关联是用去语境化和运动阶段的程度来检验的。初步分析的结果表明，当参与者暂停他们的动作时，他们比在运动时更倾向于以高度的脱离情境的方式说话。

引用次数: 0

Three-year-old children's production of native mandarin Chinese lexical tones 三岁儿童产生的地道汉语词汇声调

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060851

Ao Chen, Hintat Cheung, Yuchen Li, Liqun Gao

The current study investigated native Mandarin Chinese children’s production of native lexical tones, in particular the low-rising tone (T2) and low-dipping tone (T3), which are acoustically most similar among all the Mandarin lexical tones. Using a picture naming task, ten 3-year-old children produced fourteen monosyllabic and disyllabic familiar words. Ten female adult listeners performed the same task as a control group. Acoustical measurements on pitch values and pitch alignment were conducted to analyze whether children made use of acoustical cues to distinguish T2 and T3 in an adult like way, and whether presence of tonal context in the disyllabic words influenced the acoustical implementation of T2 and T3. The results showed that, overall children exhibited adult-like pitch contour for T2 and T3, yet unlike adults who maintained the low feature of T3 for both pitch minimum and pitch maximum, children tended to increase the pitch maximum and consequently the pitch range to allow for implementation of the complex pitch contour of T3. Such increase is more evident for the disyllabic than for the monosyllabic words. These findings suggest that the presence of tonal context and tonal carry-over effect makes it more demanding for children to realize the complex pitch contour of T3, and they increase the pitch range to achieve such a goal.

本研究调查了母语汉语儿童产生的母语词汇声调，特别是低升声调(T2)和低降声调(T3)，它们在所有普通话词汇声调中声学上最相似。通过一项图片命名任务，10名3岁的儿童产生了14个单音节和双音节熟悉的单词。10名成年女性听众作为对照组完成了同样的任务。通过对音高值和音准度的声学测量来分析儿童是否像成人一样使用声学线索来区分T2和T3，以及双音节词中音调上下文的存在是否影响T2和T3的声学执行。结果表明，儿童在T2和T3阶段总体表现出与成人相似的音高轮廓，但与成人在最小音高和最大音高上都保持低音高特征不同，儿童倾向于增加最大音高，从而增加音高范围，以实现复杂的T3音高轮廓。这种增加在双音节单词中比单音节单词中更为明显。这些研究结果表明，音调语境和音调结转效应的存在使得儿童对T3复杂的音高轮廓的实现要求更高，他们为实现这一目标而增加了音高范围。

{"title":"Three-year-old children's production of native mandarin Chinese lexical tones","authors":"Ao Chen, Hintat Cheung, Yuchen Li, Liqun Gao","doi":"10.1109/O-COCOSDA46868.2019.9060851","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060851","url":null,"abstract":"The current study investigated native Mandarin Chinese children’s production of native lexical tones, in particular the low-rising tone (T2) and low-dipping tone (T3), which are acoustically most similar among all the Mandarin lexical tones. Using a picture naming task, ten 3-year-old children produced fourteen monosyllabic and disyllabic familiar words. Ten female adult listeners performed the same task as a control group. Acoustical measurements on pitch values and pitch alignment were conducted to analyze whether children made use of acoustical cues to distinguish T2 and T3 in an adult like way, and whether presence of tonal context in the disyllabic words influenced the acoustical implementation of T2 and T3. The results showed that, overall children exhibited adult-like pitch contour for T2 and T3, yet unlike adults who maintained the low feature of T3 for both pitch minimum and pitch maximum, children tended to increase the pitch maximum and consequently the pitch range to allow for implementation of the complex pitch contour of T3. Such increase is more evident for the disyllabic than for the monosyllabic words. These findings suggest that the presence of tonal context and tonal carry-over effect makes it more demanding for children to realize the complex pitch contour of T3, and they increase the pitch range to achieve such a goal.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"43 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132738049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characteristics of everyday conversation derived from the analysis of dialog act annotation 通过对对话行为注释的分析，得出日常会话的特征

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041235

Yuriko Iseki, Keisuke Kadota, Yasuharu Den

This paper addresses an attempt to find out the characteristics of everyday conversation data through dialog act information. Although several earlier studies have discussed how to annotate DA information, few studies use the result of the annotation as a clue to derive the characteristics of conversation. We report on the work to annotate dialog act information on utterances in Japanese everyday conversation, and the possibility of extracting the interactional characteristics using the annotation. As a result of the analysis, it was found that the annotation reflects differences in behaviour depending on the type of conversation and participants' age. Also, even in conversations with similar settings, differences were found in the distribution of tags about interactional management. It is suggested that the annotation may also reflect information that is difficult to capture objectively such as the conversational atmosphere.

本文试图通过对话行为信息来发现日常会话数据的特征。虽然早期的一些研究已经讨论了如何注释DA信息，但很少有研究将注释结果作为线索来推导会话特征。本文报道了对日语日常会话中话语的对话行为信息进行标注的工作，以及利用标注提取交互特征的可能性。分析的结果发现，注释反映的行为差异取决于对话类型和参与者的年龄。此外，即使在具有相似设置的对话中，也发现了关于交互管理的标签分布的差异。有人建议，注释也可以反映难以客观捕捉的信息，如会话氛围。

引用次数: 3

Recent Progress of Mandrain Spontaneous Speech Recognition on Mandrain Conversation Dialogue Corpus 基于汉语会话对话语料库的汉语自发语音识别研究进展

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041223

Yu-Chih Deng, Yih-Ru Wang, Sin-Horng Chen, Chen-Yu Chiang

This paper presents a progress report on a relatively difficult ASR task on a spontaneous speech corpus - Mandarin Conversational Dialogue Corpus (MCDC). A DNN-based acoustic model is constructed based on the CLDNN structure with a large dataset that comprises two spontaneous-speech corpora and one read-speech corpus. The study uses a large text dataset formed by seven corpora to train an efficient general language model (LM). Two adapted LMs specially for spontaneous speech recognition are also constructed. Experimental results showed that the best performances of 26.3% in character error rate (CER) and 32.5% in word error rate (WER) were reached on MCDC. They represented 27.9% and 22.2% of relative CER and WER reductions as compared with the performances by the previous best HMM-based method. This confirms that the proposed method is promising in tackling on Mandarin spontaneous speech recognition.

本文介绍了在自发语音语料库——汉语会话对话语料库(MCDC)上进行相对困难的语音识别任务的研究进展。在CLDNN结构的基础上，利用包含两个自发语音语料库和一个读-语音语料库的大数据集构建了基于dnn的声学模型。该研究使用由七个语料库组成的大型文本数据集来训练一个高效的通用语言模型(LM)。构造了两个专门用于自发语音识别的自适应LMs。实验结果表明，该算法在MCDC上的字符错误率达到26.3%，单词错误率达到32.5%。与之前基于hmm的最佳方法相比，它们代表了相对CER和WER降低的27.9%和22.2%。这证实了该方法在解决汉语自发语音识别问题上是有希望的。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀