2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献

Automatic Pronunciation Generator for Indonesian Speech Recognition System Based on Sequence-to-Sequence Model 基于序列到序列模型的印尼语语音识别系统语音自动生成

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041182

Devin Hoesen, Fanda Yuliana Putri, D. Lestari

Pronunciation dictionary plays an important role in a speech recognition system. Expert knowledge is required to obtain an accurate dictionary by manually giving pronunciation for each word. On account of the continually increasing vocabulary size, especially for Indonesian language, it is impractical to manually give the pronunciation for each word. Indonesian spelling-to-pronunciation rules are relatively regular; thus, it is plausible to produce pronunciation for a word by using the predefined rules. Nevertheless, the rules still contain a few irregularities for some spellings and they still cannot handle the presence of code-mixed words and abbreviations. In this paper, we employ a sequence-to-sequence (seq2seq) approach to generate pronunciation for each word in an Indonesian dictionary. It is demonstrated that by using this approach, we can obtain a similar speech-recognition error-rate while requiring only a fractional amount of resource. Our cross-validation experiment for validating the resulting phonetic sequences achieves 4.15-6.24% phone error rate (PER). When an automatically produced dictionary is applied in a speech recognition system, the word accuracy only degrades 2.22 percentage point compared to the one produced manually. Therefore, creating a new large pronunciation dictionary using the proposed model is more efficient without degrading the recognition accuracy significantly.

语音词典在语音识别系统中起着重要的作用。通过手动给出每个单词的发音来获得准确的词典需要专家知识。由于词汇量的不断增加，特别是对于印尼语来说，手动给出每个单词的发音是不切实际的。印尼语的拼写-发音规则相对规则;因此，使用预定义的规则来产生单词的发音是合理的。然而，这些规则仍然包含一些拼写的不规则性，并且它们仍然无法处理代码混合的单词和缩写的存在。在本文中，我们采用序列到序列(seq2seq)方法来生成印尼语字典中每个单词的发音。结果表明，使用该方法可以获得相似的语音识别错误率，而只需要少量的资源。我们的交叉验证实验验证结果语音序列达到4.15-6.24%的电话错误率(PER)。在语音识别系统中使用自动生成的词典时，与人工生成的词典相比，单词正确率只下降了2.22个百分点。因此，使用所提出的模型创建一个新的大型发音字典在不显著降低识别精度的情况下效率更高。

{"title":"Automatic Pronunciation Generator for Indonesian Speech Recognition System Based on Sequence-to-Sequence Model","authors":"Devin Hoesen, Fanda Yuliana Putri, D. Lestari","doi":"10.1109/O-COCOSDA46868.2019.9041182","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041182","url":null,"abstract":"Pronunciation dictionary plays an important role in a speech recognition system. Expert knowledge is required to obtain an accurate dictionary by manually giving pronunciation for each word. On account of the continually increasing vocabulary size, especially for Indonesian language, it is impractical to manually give the pronunciation for each word. Indonesian spelling-to-pronunciation rules are relatively regular; thus, it is plausible to produce pronunciation for a word by using the predefined rules. Nevertheless, the rules still contain a few irregularities for some spellings and they still cannot handle the presence of code-mixed words and abbreviations. In this paper, we employ a sequence-to-sequence (seq2seq) approach to generate pronunciation for each word in an Indonesian dictionary. It is demonstrated that by using this approach, we can obtain a similar speech-recognition error-rate while requiring only a fractional amount of resource. Our cross-validation experiment for validating the resulting phonetic sequences achieves 4.15-6.24% phone error rate (PER). When an automatically produced dictionary is applied in a speech recognition system, the word accuracy only degrades 2.22 percentage point compared to the one produced manually. Therefore, creating a new large pronunciation dictionary using the proposed model is more efficient without degrading the recognition accuracy significantly.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121644354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

RSL2019: A Realistic Speech Localization Corpus RSL2019:一个现实的语音定位语料库

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060842

R. Sheelvant, Bidisha Sharma, Maulik C. Madhavi, Rohan Kumar Das, S. Prasanna, Haizhou Li

In this work, we present the development of a new database for speech localization that we refer to as Realistic Speech Localization 2019 (RSL2019) corpus. The corpus is designed for the study of sound source localization in real-world applications. The RSL2019 corpus is a continuing effort, which presently contains 22.60 hours of speech data, recorded using a four channel microphone array, and played over a loudspeaker from different directions of arrival (DOA). We consider 180 speech utterances spoken by 6 speakers, selected from RSR2015 database, which are played over the loudspeaker positioned at different angles and distances from the microphone array. We vary the DOA from 0 to 360 degree angle at an interval of 5 degree, at 1 metre and 1.5 metre distance. From each position and DOA, we also record white noise to study the robustness, and time stretched pulse to generate the transfer function for speech localization algorithm. Furthermore, we present the experimental results and analysis on state-of-the-art sound source localization algorithm using the open source HARK toolkit on the created RSL2019 database. This database will be provided for research purpose upon request to the authors.

在这项工作中，我们提出了一个新的语音定位数据库的开发，我们称之为现实语音定位2019 (RSL2019)语料库。该语料库是为研究实际应用中的声源定位而设计的。RSL2019语料库是一项持续的努力，目前包含22.60小时的语音数据，使用四声道麦克风阵列记录，并通过扬声器从不同的到达方向(DOA)播放。我们考虑了从RSR2015数据库中选择的6位扬声器的180个语音，这些语音通过放置在与麦克风阵列不同角度和距离的扬声器播放。我们改变方位从0到360度每间隔5度，在1米和1.5米的距离。我们还从每个位置和DOA记录白噪声来研究鲁棒性，并通过时间拉伸脉冲来生成语音定位算法的传递函数。此外，我们在创建的RSL2019数据库上使用开源的HARK工具包对最先进的声源定位算法进行了实验结果和分析。本数据库将应作者要求提供研究用途。

{"title":"RSL2019: A Realistic Speech Localization Corpus","authors":"R. Sheelvant, Bidisha Sharma, Maulik C. Madhavi, Rohan Kumar Das, S. Prasanna, Haizhou Li","doi":"10.1109/O-COCOSDA46868.2019.9060842","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060842","url":null,"abstract":"In this work, we present the development of a new database for speech localization that we refer to as Realistic Speech Localization 2019 (RSL2019) corpus. The corpus is designed for the study of sound source localization in real-world applications. The RSL2019 corpus is a continuing effort, which presently contains 22.60 hours of speech data, recorded using a four channel microphone array, and played over a loudspeaker from different directions of arrival (DOA). We consider 180 speech utterances spoken by 6 speakers, selected from RSR2015 database, which are played over the loudspeaker positioned at different angles and distances from the microphone array. We vary the DOA from 0 to 360 degree angle at an interval of 5 degree, at 1 metre and 1.5 metre distance. From each position and DOA, we also record white noise to study the robustness, and time stretched pulse to generate the transfer function for speech localization algorithm. Furthermore, we present the experimental results and analysis on state-of-the-art sound source localization algorithm using the open source HARK toolkit on the created RSL2019 database. This database will be provided for research purpose upon request to the authors.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128106524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An analysis of voice quality of Chinese patients with depression 中国抑郁症患者语音质量分析

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060848

Yuan Jia, Yuzhu Liang, T. Zhu

In the present study, we empirically explore how the voice quality of depression patients (as experimental group) differs from that of healthy people (as control group), in terms of jitter, shimmer, HNR and pitch. Our analysis results reveal that the shimmer, maximum HNR and minimum HNR of patients are significantly different from those of the control group. Specifically, the patients tend to have a higher shimmer and lower maximum and mean HNR. To figure out to what extent the emotion has influenced the results, we further investigate whether there are significant differences in voice quality among different variations of emotion (positive, neutral, and negative) embedded in text reading. It turns out that no significant differences in voice hoarseness are found, showing that the voice quality is immune to emotion. Therefore, we can conclude that in general the voice of depression patients is hoarser than non-depressed people.

在本研究中，我们实证探讨抑郁症患者(实验组)与健康人(对照组)的语音质量在抖动、闪烁、HNR和音高方面的差异。我们的分析结果显示，患者的shimmer、最大HNR和最小HNR与对照组有显著差异。具体而言，患者往往有较高的闪烁和较低的最大和平均HNR。为了弄清楚情绪对结果的影响程度，我们进一步研究了文本阅读中嵌入的不同情绪(积极、中性和消极)在语音质量上是否存在显著差异。结果发现，在声音嘶哑方面没有明显的差异，这表明声音质量不受情绪的影响。因此，我们可以得出结论，总体而言，抑郁症患者的声音比非抑郁症患者的声音更沙哑。

引用次数: 2

XDF-REPA: A Densely Labeled Dataset toward Refined Pronunciation Assessment for English Learning XDF-REPA:面向英语学习的精细发音评估的密集标记数据集

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041154

Yun Gao, Zhigang Ou, Jianfeng Cheng, Yong Ruan, Xiangdong Wang, Yueliang Qian

Currently, most computer assisted pronunciation training (CAPT) systems focus on overall scoring or mispronunciation detection. In this paper, we address the issue of refined pronunciation assessment (RPA), which aims at providing more refined information to L2 learners. To meet the major challenge of the lack of densely labeled data, we present the XDF-REPA dataset, which is freely available to the public. The dataset contains 19,213 English word utterances by 18 Chinese adults, among which 4,200 audio clips from 9 speakers are densely labeled by 3 linguists with intended phoneme, actually uttered phoneme, phoneme score for each phoneme, and an overall score for the word as well. To reduce the difference between annotators, scoring rules combining subjectivity and objectivity are defined. To demonstrate the usage of the dataset and provide a baseline for other researchers, a prototype system for RPA is developed and described in the paper, which adopts a DNN-HMM based acoustic model and a variant of Goodness of Pronunciation (GOP) to yield all the corrective feedbacks needed for RPA. Experimental results show error detection accuracy varies from 80.1% to 85.1% for different subsets and linguists, and accuracy of actually-uttered-phoneme recognition varies from 70.9% to 80.8% for different subsets and linguists.

目前，大多数计算机辅助发音训练(CAPT)系统侧重于整体评分或错误发音检测。在本文中，我们讨论了精炼发音评估(RPA)的问题，旨在为二语学习者提供更精炼的信息。为了应对缺乏密集标记数据的主要挑战，我们提出了XDF-REPA数据集，该数据集对公众免费提供。该数据集包含18位中国成年人的19213个英语单词的发音，其中来自9位说话者的4200个音频片段被3位语言学家密集地标记为预定音素、实际发出的音素、每个音素的音素得分以及单词的总分。为了减少标注者之间的差异，定义了主观性与客观性相结合的评分规则。为了演示数据集的使用并为其他研究人员提供基准，本文开发并描述了一个RPA原型系统，该系统采用基于DNN-HMM的声学模型和发音良度(GOP)的变体来产生RPA所需的所有纠正反馈。实验结果表明，不同子集和不同语言学家的错误检测准确率在80.1% ~ 85.1%之间，不同子集和不同语言学家的实际发音音素识别准确率在70.9% ~ 80.8%之间。

{"title":"XDF-REPA: A Densely Labeled Dataset toward Refined Pronunciation Assessment for English Learning","authors":"Yun Gao, Zhigang Ou, Jianfeng Cheng, Yong Ruan, Xiangdong Wang, Yueliang Qian","doi":"10.1109/O-COCOSDA46868.2019.9041154","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041154","url":null,"abstract":"Currently, most computer assisted pronunciation training (CAPT) systems focus on overall scoring or mispronunciation detection. In this paper, we address the issue of refined pronunciation assessment (RPA), which aims at providing more refined information to L2 learners. To meet the major challenge of the lack of densely labeled data, we present the XDF-REPA dataset, which is freely available to the public. The dataset contains 19,213 English word utterances by 18 Chinese adults, among which 4,200 audio clips from 9 speakers are densely labeled by 3 linguists with intended phoneme, actually uttered phoneme, phoneme score for each phoneme, and an overall score for the word as well. To reduce the difference between annotators, scoring rules combining subjectivity and objectivity are defined. To demonstrate the usage of the dataset and provide a baseline for other researchers, a prototype system for RPA is developed and described in the paper, which adopts a DNN-HMM based acoustic model and a variant of Goodness of Pronunciation (GOP) to yield all the corrective feedbacks needed for RPA. Experimental results show error detection accuracy varies from 80.1% to 85.1% for different subsets and linguists, and accuracy of actually-uttered-phoneme recognition varies from 70.9% to 80.8% for different subsets and linguists.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125955308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

voisTUTOR corpus: A speech corpus of Indian L2 English learners for pronunciation assessment voisTUTOR语料库:印度第二语言英语学习者的语音语料库，用于发音评估

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041162

Chiranjeevi Yarra, Aparna Srinivasan, Chandana Srinivasa, Ritu Aggarwal, P. Ghosh

This paper describes the voisTUTOR corpus, a pronunciation assessment corpus of Indian second language (L2) learners learning English. This corpus consists of 26529 utterances approximately totalling to 14 hours. The recorded data was collected from 16 Indian L2 learners who are from six native languages, namely, Kannada, Telugu, Tamil, Malayalam, Hindi and Gujarati. A total of 1676 unique stimuli were considered for the recording. The stimuli were designed such that they ranged from single word stimuli to multiple word stimuli containing simple, complex and compound sentences. The corpus also consists of ratings representing overall quality on a scale of 0 to 10 for every utterance. In addition to the overall rating, unlike the existing corpora, a binary decision (0 or 1) is provided indicating the quality of the following seven factors, on which overall pronunciation typically depends, - 1) intelligibility, 2) phoneme quality, 3) phoneme mispronunciation, 4) syllable stress quality, 5) intonation quality, 6) correctness of pauses and 7) mother tongue influence. A spoken English expert provides the ratings and binary decisions for all the utterances. Furthermore, the corpus also consists of recordings of all the stimuli obtained from a male and a female spoken English expert. Considering factor dependent binary decisions and spoken English experts' recordings, voisTUTOR corpus is unique compared to the existing corpora. To the best of our knowledge, there exists no such corpus for pronunciation assessment in Indian nativity.

本文介绍了voisTUTOR语料库，这是印度第二语言学习者学习英语的语音评估语料库。该语料库包含26529个话语，总计约14小时。记录数据来自16名印度L2学习者，他们来自六种母语，即卡纳达语、泰卢固语、泰米尔语、马拉雅拉姆语、印地语和古吉拉特语。共有1676个独特的刺激被考虑用于记录。刺激的设计范围从单词刺激到包含简单、复杂和复合句的多词刺激。语料库还包括代表每个话语的整体质量的评分，评分范围为0到10。除了总体评分外，与现有语料库不同的是，它提供了一个二元决策(0或1)，表示以下七个因素的质量，这些因素通常取决于整体发音:1)可理解性，2)音素质量，3)音素发音错误，4)音节重音质量，5)语调质量，6)停顿的准确性和7)母语影响。一位英语口语专家为所有的话语提供评级和二元决策。此外，语料库还包括从一位男性和一位女性英语口语专家那里获得的所有刺激的记录。voisTUTOR语料库与现有语料库相比具有独特性，它考虑了因素依赖的二元决策和英语口语专家的录音。据我们所知，目前还没有这样的语料库来评估印度出生语的发音。

{"title":"voisTUTOR corpus: A speech corpus of Indian L2 English learners for pronunciation assessment","authors":"Chiranjeevi Yarra, Aparna Srinivasan, Chandana Srinivasa, Ritu Aggarwal, P. Ghosh","doi":"10.1109/O-COCOSDA46868.2019.9041162","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041162","url":null,"abstract":"This paper describes the voisTUTOR corpus, a pronunciation assessment corpus of Indian second language (L2) learners learning English. This corpus consists of 26529 utterances approximately totalling to 14 hours. The recorded data was collected from 16 Indian L2 learners who are from six native languages, namely, Kannada, Telugu, Tamil, Malayalam, Hindi and Gujarati. A total of 1676 unique stimuli were considered for the recording. The stimuli were designed such that they ranged from single word stimuli to multiple word stimuli containing simple, complex and compound sentences. The corpus also consists of ratings representing overall quality on a scale of 0 to 10 for every utterance. In addition to the overall rating, unlike the existing corpora, a binary decision (0 or 1) is provided indicating the quality of the following seven factors, on which overall pronunciation typically depends, - 1) intelligibility, 2) phoneme quality, 3) phoneme mispronunciation, 4) syllable stress quality, 5) intonation quality, 6) correctness of pauses and 7) mother tongue influence. A spoken English expert provides the ratings and binary decisions for all the utterances. Furthermore, the corpus also consists of recordings of all the stimuli obtained from a male and a female spoken English expert. Considering factor dependent binary decisions and spoken English experts' recordings, voisTUTOR corpus is unique compared to the existing corpora. To the best of our knowledge, there exists no such corpus for pronunciation assessment in Indian nativity.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129979987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

The effect of focus on trisyllabic syllable duration in Mandarin 关注对普通话三音节音长的影响

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041173

Ziyu Xiong, Q. Lin, Maolin Wang, Zhouyu Chen

In this study, the temporal pattern of trisyllabic sequences in Mandarin are investigated, in particular the interaction between the temporal pattern and the focal effort. There are neutral tone syllables in Mandarin, and neutral tone syllables are metrically weak (W), while non-neutral tone syllables are metrically strong (S). Four types of trisyllabic sequences are investigated: those with no neutral tone syllables (SSS), those with one neutral tone at the final position (SSW), those with one neutral tone at the second syllable (SWS), and those with the second and the final syllables as neutral tones (SWW). It is found that if there is neutral tone syllable in a sequence, the last non-neutral tone syllable is the longest. Under a focused condition, all the syllables in the trisyllabic sequences lengthen. Strong syllables lengthen more than weak syllables, and among strong syllables, later syllable lengthens more than earlier syllables.

本文研究了普通话三音节序列的时间模式，特别是时间模式与焦点努力之间的相互作用。普通话中存在中性音节，中性音节的韵律弱(W)，而非中性音节的韵律强(S)。本文研究了四种类型的三音节序列:无中性音节(SSS)、尾位有一个中性音节(SSW)、次位有一个中性音节(SWS)、次位和尾位都有中性音节(SWW)。研究发现，如果序列中存在中性音节，则最后一个非中性音节最长。在集中的情况下，三音节序列中的所有音节都延长了。强音节比弱音节长，在强音节中，后面的音节比前面的音节长。

引用次数: 0

An acoustic-articulatory database of VCV sequences and words in Toda at different speaking rates 日语中不同语速的VCV序列和单词的声学-发音数据库

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041190

Shankar Narayanan, Aravind Illa, Nayan Anand, Ganesh Sinisetty, Karthick Narayanan, P. Ghosh

We present a database comprising simultaneous acoustic and articulatory recordings of thirty $mathrm{V}_{1}mathrm{CV}_{2}$ nonsense words and forty two Toda words, recorded with an electromagnetic articulograph and spoken by six Toda speakers (two males and four females) at four different speaking rates, namely slow, normal, fast and very fast. The vowels in the $mathrm{V}_{1}mathrm{CV}_{2}(mathrm{V}_{1}neq mathrm{V}_{2}$) come from a set of six vowels, namely, /a/, /e/, /i/, /o/, /u/, /y/, where the last vowel is a front rounded vowel in Toda. The consonant in the $mathrm{V}_{1}mathrm{CV}_{2}$ stimuli is chosen as /p/ for all the recordings. The articulatory data in the proposed database comprises recording of movements of five articulatory points, namely, upper lip, lower lip, jaw, tongue tip and tongue dorsum in the midsagittal plane. The acoustic and articulatory recordings are made available at 16 kHz and 100 Hz respectively. Boundaries of vowels and consonant in $mathrm{V}_{1}mathrm{CV}_{2}$ stimuli are provided along with this database. Basic acoustic and articulatory analysis of the $mathrm{V}_{1}mathrm{CV}_{2}$ recordings in this database are presented, which show the manner in which the acoustic and articulatory spaces as well as coarticulation change with speaking rates. The proposed database is suited for a number of research studies including the effect of speaking rates on the acoustic and articulatory aspects of coarticulation in Toda, analysis of labial kinematics during consonant production at different speaking rates, and acoustic-articulatory analysis of front rounded vowel in Toda.

我们建立了一个数据库，包括30个$ mathm {V}_{1} mathm {CV}_{2}$无意义词和42个户田语词的同时声学和发音记录，由6个户田语使用者(2男4女)以四种不同的语速(慢速、正常、快速和非常快)说话。$mathrm{V}_{1}mathrm{CV}_{2}(mathrm{V}_{1}neq mathrm{V}_{2}$)中的元音来自一组六个元音，即/a/， /e/， /i/， /o/， /u/， /y/，其中最后一个元音在Toda中是一个前面的圆元音。对于所有录音，$mathrm{V}_{1}mathrm{CV}_{2}$刺激中的辅音被选为/p/。该数据库中的发音数据包括记录上唇、下唇、下颚、舌尖和舌背五个发音点在中矢状面内的运动。声学和发音录音分别以16千赫和100赫兹提供。$mathrm{V}_{1}mathrm{CV}_{2}$刺激中的元音和辅音边界随此数据库一起提供。本文对该数据库中$ mathm {V}_{1} mathm {CV}_{2}$录音进行了基本的声学和发音分析，显示了声学和发音空间以及协发音随语速的变化方式。所提出的数据库适用于许多研究，包括说话速度对Toda中协同发音的声学和发音方面的影响，不同说话速度下辅音产生过程中的唇部运动学分析，以及Toda中前圆元音的声学-发音分析。

{"title":"An acoustic-articulatory database of VCV sequences and words in Toda at different speaking rates","authors":"Shankar Narayanan, Aravind Illa, Nayan Anand, Ganesh Sinisetty, Karthick Narayanan, P. Ghosh","doi":"10.1109/O-COCOSDA46868.2019.9041190","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041190","url":null,"abstract":"We present a database comprising simultaneous acoustic and articulatory recordings of thirty $mathrm{V}_{1}mathrm{CV}_{2}$ nonsense words and forty two Toda words, recorded with an electromagnetic articulograph and spoken by six Toda speakers (two males and four females) at four different speaking rates, namely slow, normal, fast and very fast. The vowels in the $mathrm{V}_{1}mathrm{CV}_{2}(mathrm{V}_{1}neq mathrm{V}_{2}$) come from a set of six vowels, namely, /a/, /e/, /i/, /o/, /u/, /y/, where the last vowel is a front rounded vowel in Toda. The consonant in the $mathrm{V}_{1}mathrm{CV}_{2}$ stimuli is chosen as /p/ for all the recordings. The articulatory data in the proposed database comprises recording of movements of five articulatory points, namely, upper lip, lower lip, jaw, tongue tip and tongue dorsum in the midsagittal plane. The acoustic and articulatory recordings are made available at 16 kHz and 100 Hz respectively. Boundaries of vowels and consonant in $mathrm{V}_{1}mathrm{CV}_{2}$ stimuli are provided along with this database. Basic acoustic and articulatory analysis of the $mathrm{V}_{1}mathrm{CV}_{2}$ recordings in this database are presented, which show the manner in which the acoustic and articulatory spaces as well as coarticulation change with speaking rates. The proposed database is suited for a number of research studies including the effect of speaking rates on the acoustic and articulatory aspects of coarticulation in Toda, analysis of labial kinematics during consonant production at different speaking rates, and acoustic-articulatory analysis of front rounded vowel in Toda.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131561063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An acoustic study of affricates produced by L2 english learners in Harbin 哈尔滨地区第二语言英语学习者的打杂音的声学研究

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060844

Chenyang Zhao, Ai-jun Li, Zhiqiang Li, Ying Tang

The present study focuses on the acquisition of English affricates by L2 learners in Harbin where a major variety of Mandarin is spoken, and explores possible interferences from L1 leading to L2 learners’ deviations in pronunciation. Specifically, features of L2 learners’ productions are compared with native speakers by using different acoustic parameters so that the differences between the two groups could be discovered. The results can help find out whether the differences or the similarities between L1 and L2 sounds contribute more to L2 speech acquisition. Independent sample t-test and affricate acoustic patterns are used in the comparisons. The results show that affricates in English are not completely well acquired by L2 learners from Harbin. Specifically, L2 learners’ production of /tʃ/ has a longer Duration of Frication (DOF) and a weaker plosion than that by native speakers, and their production of /dʒ/ is longer and stronger in its frication part compared to that by native speakers. The similar durations of GAP between the two groups of speakers indicates that the articulatory precision of /tʃ/ and /dʒ/ are well acquired. /tr/ and /dr/ are produced by L2 learners in a longer and more tense manner. The unsatisfactory acquisition, according to the analysis, is caused by both similarities and differences of linguistic features in L1 and L2. The Transfer Theory and the Speech Learning Model (SLM) are adopted to explain the results.

本研究主要关注哈尔滨地区二语学习者的英语打音习得情况，并探讨母语可能对二语学习者发音产生的干扰。具体来说，通过使用不同的声学参数来比较二语学习者和母语者的作品特征，从而发现两组之间的差异。研究结果可以帮助我们了解母语和二语语音的异同对二语语音习得的贡献更大还是更大。在比较中使用了独立样本t检验和重叠声学模式。结果表明，哈尔滨市的二语学习者对英语中的叠音习得并不完全。具体来说，第二语言学习者发/t /音时的摩擦音持续时间(DOF)比母语者长，爆破音较弱，发/d /音时的摩擦音部分比母语者长、强。两组说话者之间相似的间隔时间表明/t /和/d /的发音精度得到了很好的习得。/tr/和/dr/是由第二语言学习者以更长、更紧张的方式发出的。分析认为，习得不理想是由母语和二语语言特征的异同共同造成的。本文采用迁移理论和语音学习模型(SLM)来解释研究结果。

{"title":"An acoustic study of affricates produced by L2 english learners in Harbin","authors":"Chenyang Zhao, Ai-jun Li, Zhiqiang Li, Ying Tang","doi":"10.1109/O-COCOSDA46868.2019.9060844","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060844","url":null,"abstract":"The present study focuses on the acquisition of English affricates by L2 learners in Harbin where a major variety of Mandarin is spoken, and explores possible interferences from L1 leading to L2 learners’ deviations in pronunciation. Specifically, features of L2 learners’ productions are compared with native speakers by using different acoustic parameters so that the differences between the two groups could be discovered. The results can help find out whether the differences or the similarities between L1 and L2 sounds contribute more to L2 speech acquisition. Independent sample t-test and affricate acoustic patterns are used in the comparisons. The results show that affricates in English are not completely well acquired by L2 learners from Harbin. Specifically, L2 learners’ production of /tʃ/ has a longer Duration of Frication (DOF) and a weaker plosion than that by native speakers, and their production of /dʒ/ is longer and stronger in its frication part compared to that by native speakers. The similar durations of GAP between the two groups of speakers indicates that the articulatory precision of /tʃ/ and /dʒ/ are well acquired. /tr/ and /dr/ are produced by L2 learners in a longer and more tense manner. The unsatisfactory acquisition, according to the analysis, is caused by both similarities and differences of linguistic features in L1 and L2. The Transfer Theory and the Speech Learning Model (SLM) are adopted to explain the results.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128108722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Statistical studies on Japanese sonority by using loudness calibration scores 用响度校正分数对日语响度进行统计研究

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041186

Takayuki Kagomiya

This study aimed to examine Japanese sonority by using a quantitative method and contribute to Japanese phonetics and phonology. Thus, loudness calibration scores from the NTT-Tohoku University Speech Dataset for a Word Intelligibility Test based on Word Familiarity (FW03) were analyzed. The intensity of each monosyllable sound stored in FW03 was equalized, and thus, perceptual sound levels varied with the difference in the original sound intensity of syllables. To adjust the difference in perceptual sound levels, calibration scores were estimated using a series of psychometric experiments. These scores reflected the difference in sound intensity and perceptual levels and can be considered subjective sonority scores for Japanese monosyllables. The results of the statistical analysis of the scores revealed that the sonority level of Japanese vowels was primarily accounted for its openness. The sonority of consonants was affected by its articulation and voicing, whereas that of monosyllables can be clustered based on the openness of vowels.

本研究旨在以定量的方法检视日语的音韵，对日语的语音学及音系学有贡献。因此，我们分析了NTT-Tohoku大学语音数据集中基于单词熟悉度的单词可理解性测试(FW03)的响度校准分数。存储在FW03中的每个单音节的声音强度是均匀的，因此感知声级随着音节原始声音强度的差异而变化。为了调整感知声音水平的差异，使用一系列心理测量实验估计校准分数。这些分数反映了声音强度和感知水平的差异，可以被认为是日语单音节的主观响度分数。对分数进行统计分析的结果表明，日语元音的音高水平是其开放性的主要原因。辅音的音高受其发音和发声的影响，而单音节的音高可根据元音的开放程度进行聚类。

{"title":"Statistical studies on Japanese sonority by using loudness calibration scores","authors":"Takayuki Kagomiya","doi":"10.1109/O-COCOSDA46868.2019.9041186","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041186","url":null,"abstract":"This study aimed to examine Japanese sonority by using a quantitative method and contribute to Japanese phonetics and phonology. Thus, loudness calibration scores from the NTT-Tohoku University Speech Dataset for a Word Intelligibility Test based on Word Familiarity (FW03) were analyzed. The intensity of each monosyllable sound stored in FW03 was equalized, and thus, perceptual sound levels varied with the difference in the original sound intensity of syllables. To adjust the difference in perceptual sound levels, calibration scores were estimated using a series of psychometric experiments. These scores reflected the difference in sound intensity and perceptual levels and can be considered subjective sonority scores for Japanese monosyllables. The results of the statistical analysis of the scores revealed that the sonority level of Japanese vowels was primarily accounted for its openness. The sonority of consonants was affected by its articulation and voicing, whereas that of monosyllables can be clustered based on the openness of vowels.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132661479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Investigation of Prosodic Features Related to Next Speaker Selection in Spontaneous Japanese Conversation 日语自发会话中选择下一个说话人的韵律特征研究

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041205

Y. Ishimoto, Takehiro Teraoka, M. Enomoto

This study aims to reveal prosodic features related to next speaker selection in spontaneous Japanese conversation. A turn-taking system in the field of conversation analysis in sociology was proposed as a systematic basis of speaker-change for conversation. In a previous study, we demonstrated that the prosody of Japanese utterance is relevant to one component of the system, namely, a turn-constructional component. However, it is unclear whether the prosody is also relevant to another component, that is, a turn-allocation component. In this paper, while focusing on next speaker selection as one of the turn-allocation techniques, we investigated relationships between the prosodic features and types of next speaker selection in utterances. The results showed that the difference of the F0s between the penultimate and the final accent phrase in utterance differs whether the current utterance selects the next speaker. Also, the power and the mora duration at the final accent phrase differ whether the current speaker does self-selection in the current utterance. These suggest that the prosodic features are clues for selecting the next speaker in the current utterance.

本研究旨在揭示自发日语会话中与下一个说话人选择相关的韵律特征。在社会学会话分析领域中，提出了一种轮换系统，作为会话转换的系统基础。在之前的研究中，我们证明了日语话语的韵律与该系统的一个组成部分有关，即转向结构成分。然而，尚不清楚韵律是否也与另一个组成部分相关，即轮换组成部分。本文将下一说话人选择作为一种回合分配技术，重点研究了语音韵律特征与下一说话人选择类型之间的关系。结果表明，无论当前话语是否选择了下一个说话者，话语中倒数第二个和最后一个重音短语的f0差异都是不同的。同时，当前说话者在当前话语中是否进行自我选择，其在最后重音短语上的力量和语调持续时间也不同。这表明韵律特征是在当前话语中选择下一个说话人的线索。

{"title":"An Investigation of Prosodic Features Related to Next Speaker Selection in Spontaneous Japanese Conversation","authors":"Y. Ishimoto, Takehiro Teraoka, M. Enomoto","doi":"10.1109/O-COCOSDA46868.2019.9041205","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041205","url":null,"abstract":"This study aims to reveal prosodic features related to next speaker selection in spontaneous Japanese conversation. A turn-taking system in the field of conversation analysis in sociology was proposed as a systematic basis of speaker-change for conversation. In a previous study, we demonstrated that the prosody of Japanese utterance is relevant to one component of the system, namely, a turn-constructional component. However, it is unclear whether the prosody is also relevant to another component, that is, a turn-allocation component. In this paper, while focusing on next speaker selection as one of the turn-allocation techniques, we investigated relationships between the prosodic features and types of next speaker selection in utterances. The results showed that the difference of the F0s between the penultimate and the final accent phrase in utterance differs whether the current utterance selects the next speaker. Also, the power and the mora duration at the final accent phrase differ whether the current speaker does self-selection in the current utterance. These suggest that the prosodic features are clues for selecting the next speaker in the current utterance.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131781773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1