Speech Prosody 2022最新文献

The Role of Rhythm and Vowel Space in Speech Recognition 节奏与元音空间在语音识别中的作用

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-87

Li-Fang Lai, J. G. Hell, John M. Lipski

This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.

本文探讨了节奏和元音空间在自动语音识别(ASR)中的作用，特别关注阿巴拉契亚地区的米德兰和南美英语。进行了三组分析。首先，我们计算了真实情况与DARLA生成的文本之间的单词错误率。与之前的研究一致，结果显示南方英语的错误率(59.5%)高于米德兰英语(47.2%)，这表明语音识别方面存在方言差距。接下来，我们检查了错误率是否受到节奏的影响。结果表明，%V和ΔV都不能可靠地预测ASR性能。我们还试图找出元音空间、语音可理解性和ASR表现之间的联系。三个元音空间度量被考虑:凸包，形成体分散，和多边形面积。我们注意到，随着凸包体和波峰色散的增加，错误率下降，特别是对米德兰人来说。这与我们的假设一致，即更大的元音空间提高了语音的可理解性，从而降低了米德兰队列的错误率。多边形面积、语音可理解性和错误率之间没有明显的联系。这些结果虽然具有启发性，但为语音识别中声学建模的改进指出了一些有希望的方向。

{"title":"The Role of Rhythm and Vowel Space in Speech Recognition","authors":"Li-Fang Lai, J. G. Hell, John M. Lipski","doi":"10.21437/speechprosody.2022-87","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-87","url":null,"abstract":"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123112458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Listener adjustment of stress cue use to fit language vocabulary structure 听者调整重音线索的使用以适应语言词汇结构

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-54

Laurence Bruggeman, Jenny Yu, A. Cutler

In lexical stress languages, phonemically identical syllables can differ suprasegmentally (in duration, amplitude, F0). Such stress cues allow listeners to speed spoken-word recognition by rejecting mismatching competitors (e.g., unstressed set - in settee rules out stressed set- in setting , setter , settle ). Such processing effects have indeed been observed in Spanish, Dutch and German, but English listeners are known to largely ignore stress cues. Dutch and German listeners even outdo English listeners in distinguishing stressed versus unstressed English syllables. This has been attributed to the relative frequency across the stress languages of unstressed syllables with full vowels; in English most unstressed syllables contain schwa, instead, and stress cues on full vowels are thus least often informative in this language. If only informativeness matters, would English listeners who encounter situations where such cues would pay off for them (e.g., learning one of those other stress languages) then shift to using stress cues? Likewise, would stress cue users with English as L2, if mainly using English, shift away from using the cues in English? Here we report tests of these two questions, with each receiving a yes answer. We propose that English listeners’ disregard of stress cues is purely pragmatic.

在词汇重音语言中，音素相同的音节可以在超段上(持续时间、振幅、F0)不同。这样的重音提示可以让听者通过拒绝不匹配的竞争者来加快对口语单词的识别(例如，不重读的set- in settee排除重读的set- in setting, setter, settle)。在西班牙语、荷兰语和德语中确实观察到了这种加工效应，但众所周知，英语听众在很大程度上忽略了重音提示。荷兰语和德语的听众在区分英语重读音节和非重读音节方面甚至胜过英语听众。这归因于重读语言中带有完整元音的非重读音节的相对频率;在英语中，大多数非重读音节都包含弱读音，因此，完整元音上的重音提示在这种语言中信息量最少。如果信息性是唯一重要的因素，那么当英语听众遇到这样的提示对他们有益的情况时(例如，学习其他重音语言之一)，他们会转而使用重音提示吗?同样，如果主要使用英语，那么英语为第二语言的强调提示用户是否会放弃使用英语提示?这里我们报告这两个问题的测试，每个问题的答案都是肯定的。我们认为英语听者无视重音线索纯粹是实用主义的。

{"title":"Listener adjustment of stress cue use to fit language vocabulary structure","authors":"Laurence Bruggeman, Jenny Yu, A. Cutler","doi":"10.21437/speechprosody.2022-54","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-54","url":null,"abstract":"In lexical stress languages, phonemically identical syllables can differ suprasegmentally (in duration, amplitude, F0). Such stress cues allow listeners to speed spoken-word recognition by rejecting mismatching competitors (e.g., unstressed set - in settee rules out stressed set- in setting , setter , settle ). Such processing effects have indeed been observed in Spanish, Dutch and German, but English listeners are known to largely ignore stress cues. Dutch and German listeners even outdo English listeners in distinguishing stressed versus unstressed English syllables. This has been attributed to the relative frequency across the stress languages of unstressed syllables with full vowels; in English most unstressed syllables contain schwa, instead, and stress cues on full vowels are thus least often informative in this language. If only informativeness matters, would English listeners who encounter situations where such cues would pay off for them (e.g., learning one of those other stress languages) then shift to using stress cues? Likewise, would stress cue users with English as L2, if mainly using English, shift away from using the cues in English? Here we report tests of these two questions, with each receiving a yes answer. We propose that English listeners’ disregard of stress cues is purely pragmatic.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116625739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prosody of contrastive adjectives in mono- and bilingual speakers of English and Russian: a corpus study 英语和俄语单语和双语者对比形容词的韵律:语料库研究

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-165

Sabine Zerbian, Marlene Böttcher, Yulia Zuban

The study reports on the frequency of occurrence and prosodic realization of adjective-noun phrases in which the adjective is contrastively focused. The productions of bilingual speakers are investigated in both their languages, Heritage Russian and majority English. The data are extracted from a corpus of semi-spontaneous speech which was collected in a comparable way from mono- and bilingual speakers in the U.S. and Russia. Results of the analysis show that there is a language-specific difference in that Russian speakers use ADJ CF +N combinations less frequently than English speakers despite a reported parallel between the languages in terms of semantics and prosody. Moreover, English and Russian seem to differ in their accentuation pattern in ADJ CF +N. Speakers of Russian as a Heritage Language frequently use double accents in ADJ CF +N. Across English and Russian, double accents in ADJ CF +N occur more frequently in formal than in informal situation, and more frequently in bilingual than in monolingual speakers. The results are discussed in light of the often reported tendency in heritage language grammars to avoid ambiguity.

本研究报告了形容词对比集中的形容词-名词短语的出现频率和韵律实现。双语者的作品被调查在他们的语言，传统俄语和多数英语。这些数据是从半自发语料库中提取的，这些语料库是以类似的方式从美国和俄罗斯的单语和双语者中收集的。分析结果表明，俄语使用者使用ADJ CF +N组合的频率低于英语使用者，这是一种特定语言的差异，尽管有报道称两种语言在语义和韵律方面相似。此外，英语和俄语在ADJ CF +N中的重音模式似乎有所不同。把俄语作为传统语言的人经常在ADJ CF +N中使用双重口音。在英语和俄语中，双重口音在正式场合比在非正式场合更常见，在双语者中比在单语者中更常见。根据传统语言语法中经常报道的避免歧义的倾向，对结果进行了讨论。

{"title":"Prosody of contrastive adjectives in mono- and bilingual speakers of English and Russian: a corpus study","authors":"Sabine Zerbian, Marlene Böttcher, Yulia Zuban","doi":"10.21437/speechprosody.2022-165","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-165","url":null,"abstract":"The study reports on the frequency of occurrence and prosodic realization of adjective-noun phrases in which the adjective is contrastively focused. The productions of bilingual speakers are investigated in both their languages, Heritage Russian and majority English. The data are extracted from a corpus of semi-spontaneous speech which was collected in a comparable way from mono- and bilingual speakers in the U.S. and Russia. Results of the analysis show that there is a language-specific difference in that Russian speakers use ADJ CF +N combinations less frequently than English speakers despite a reported parallel between the languages in terms of semantics and prosody. Moreover, English and Russian seem to differ in their accentuation pattern in ADJ CF +N. Speakers of Russian as a Heritage Language frequently use double accents in ADJ CF +N. Across English and Russian, double accents in ADJ CF +N occur more frequently in formal than in informal situation, and more frequently in bilingual than in monolingual speakers. The results are discussed in light of the often reported tendency in heritage language grammars to avoid ambiguity.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125371139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The effect of musicality on the development of Mandarin prosody 乐感对普通话韵律发展的影响

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-143

Nari Rhee, Jianjing Kuang, Aoju Chen

Past work has shown a link between children’s musicality and language learning. But research is still sparse on the effect of musicality on the development of prosody, which uses tonal and temporal cues also relevant for processing music. In particular, the questions of when and how musicality affects the development of various aspects of the prosodic grammar remain largely unknown. In this study, we investigate the effect of musicality on the development of focus-marking in Mandarin-speaking 4-to 6-year-olds using speech data elicited in a controlled but interactive setting. We have found that the development of focus-marking in Mandarin is only weakly affected by the learner’s musicality. Specifically, children produce adult-like distinctions between on-focus and pre-focus positions, regardless of musicality. A musicality effect is observed in the contrast between on-focus and post-focus positions only in the 4-year-olds. The limited musicality effect on focus-marking is in contrast with our previous work, in which we found that musicality has a salient effect on the lexical tone production by children younger than 6 years. Together, the current results suggest that musicality advantage in the development of prosody depends on aspects of the prosodic grammar and the stage of development.

过去的研究表明，儿童的乐感和语言学习之间存在联系。但是关于音乐性对韵律发展的影响的研究仍然很少，韵律使用音调和时间线索，也与处理音乐有关。特别是，音乐性何时以及如何影响韵律语法各个方面的发展的问题在很大程度上仍然是未知的。在本研究中，我们利用在控制但互动的环境中引出的语音数据，研究了音乐性对普通话4- 6岁儿童焦点标记发展的影响。我们发现汉语焦点标记的发展受学习者乐感的影响很小。具体地说，孩子们产生了像成人一样的集中和集中前的位置的区别，而不考虑音乐性。仅在4岁儿童的注意力集中和注意力集中后位置的对比中观察到音乐性效应。乐感对焦点标记的影响有限，这与我们之前的研究相反，我们发现乐感对6岁以下儿童的词汇音调产生有显著影响。综上所述，目前的研究结果表明，韵律发展中的音乐性优势取决于韵律语法和发展阶段的各个方面。

{"title":"The effect of musicality on the development of Mandarin prosody","authors":"Nari Rhee, Jianjing Kuang, Aoju Chen","doi":"10.21437/speechprosody.2022-143","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-143","url":null,"abstract":"Past work has shown a link between children’s musicality and language learning. But research is still sparse on the effect of musicality on the development of prosody, which uses tonal and temporal cues also relevant for processing music. In particular, the questions of when and how musicality affects the development of various aspects of the prosodic grammar remain largely unknown. In this study, we investigate the effect of musicality on the development of focus-marking in Mandarin-speaking 4-to 6-year-olds using speech data elicited in a controlled but interactive setting. We have found that the development of focus-marking in Mandarin is only weakly affected by the learner’s musicality. Specifically, children produce adult-like distinctions between on-focus and pre-focus positions, regardless of musicality. A musicality effect is observed in the contrast between on-focus and post-focus positions only in the 4-year-olds. The limited musicality effect on focus-marking is in contrast with our previous work, in which we found that musicality has a salient effect on the lexical tone production by children younger than 6 years. Together, the current results suggest that musicality advantage in the development of prosody depends on aspects of the prosodic grammar and the stage of development.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114263293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation 文本到语音评价的自然度和可理解性监测

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-91

Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque

Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reﬂect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using ﬁve different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.

当前的文本到语音(TTS)系统是基于深度学习的模型，能够学习语音的发音和可理解性，以及模仿说话风格的韵律属性，为合成声音提供自然性。然而，这些模型的性能在很大程度上依赖于它们的超参数训练和迭代。此外，传统的损失函数不能反映正确的语音建模;因此，我们认为有必要对TTS进行专门的培训评估。为此，我们分两步对Tacotron2模型训练过程中的可理解性和自然度进行监测。首先，我们报告了一种方法来跟踪字符级令牌错误率(TER)方面的TTS通过使用五种不同的自动语音识别(ASR)系统的可理解性分析。其次，我们用最近发表的TTS自然度预测器来扩展这项工作，该预测器根据平均意见分数(MOS)来估计这方面的情况。最后，我们将预测的MOS与TER测量统一起来，在每个训练检查点返回一个单一的分数，我们称之为完整评估分数(FAS)。我们报告了听众在FAS最大的检查点上的相关偏好，而不是在验证损失最小的检查点上的相关偏好，可理解性和自然度都高达62。后者为3%。

{"title":"Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation","authors":"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque","doi":"10.21437/speechprosody.2022-91","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-91","url":null,"abstract":"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reﬂect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using ﬁve different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117265240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The effects of prosodic prominence on the acquisition of L2 phonological features 韵律突出对二语语音特征习得的影响

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-77

Fabián Santiago, Paolo Mairano, Bianca De Paolis

Mainstream L2 phonology models do not include predictions concerning how the prosodic structure interacts with the acquisition of segments. However, many studies have shown that the realization of pitch accents or melodic contours associated to prosodic boundaries results in the hyper-articulation of segments in correspondence of such prosodic boundaries. Our goal is to provide empirical evidence for the positive effects of prosodic prominence on the acquisition of challenging L2 French sounds The prosodic-phonetic interface has been largely underestimated in second language acquisition. Few studies have investigated whether prosodic prominence may serve as an optimal context for learners to extract information on the acoustic properties of new sounds, which may then be reflected in more accurate productions. In this paper, we report the acoustic patterns of L2 French vowels produced in two different prosodic conditions: (1) in word internal position (unaccented), (2) in initial and final boundaries of Accentual Phrases and Intonation Phrases. We analyzed oral productions by 40 participants: 10 French native speakers and 30 L2 French learners with L1 Spanish, L1 English and L1 Italian (10 each). We extracted acoustic parameters for ~15k vowels and calculated the degree of acoustic overlap via Pillai scores for the following triplets: /i/~/y/~/u/, /e/~/ø/~/o/. Our results show that prosodic prominence results in a smaller acoustic overlap of some L2 French vowel contrasts.

主流的二语音系模型不包括韵律结构如何与语段习得相互作用的预测。然而，许多研究表明，与韵律边界相关的音高重音或旋律轮廓的实现导致了与这些韵律边界对应的音段的超发音。我们的目标是为韵律突出对法语第二语言语音习得的积极影响提供经验证据。韵律-语音界面在第二语言习得中被大大低估了。很少有研究调查韵律突出是否可以作为学习者提取新声音声学特性信息的最佳环境，这些信息可能会在更准确的作品中得到反映。在本文中，我们报告了法语二语元音在两种不同韵律条件下产生的声学模式:(1)在单词内部位置(非重读)，(2)在重音短语和语调短语的起始和结束边界。我们分析了40名参与者的口语作品:10名母语为法语的人，30名母语为西班牙语、英语和意大利语的第二语言法语学习者(各10人)。我们提取了约15k个元音的声学参数，并通过Pillai分数计算了以下三个元音的声学重叠程度:/i/~/y/~/u/， /e/~/ø/~/o/。我们的研究结果表明，韵律突出导致一些L2法语元音对比的声学重叠较小。

{"title":"The effects of prosodic prominence on the acquisition of L2 phonological features","authors":"Fabián Santiago, Paolo Mairano, Bianca De Paolis","doi":"10.21437/speechprosody.2022-77","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-77","url":null,"abstract":"Mainstream L2 phonology models do not include predictions concerning how the prosodic structure interacts with the acquisition of segments. However, many studies have shown that the realization of pitch accents or melodic contours associated to prosodic boundaries results in the hyper-articulation of segments in correspondence of such prosodic boundaries. Our goal is to provide empirical evidence for the positive effects of prosodic prominence on the acquisition of challenging L2 French sounds The prosodic-phonetic interface has been largely underestimated in second language acquisition. Few studies have investigated whether prosodic prominence may serve as an optimal context for learners to extract information on the acoustic properties of new sounds, which may then be reflected in more accurate productions. In this paper, we report the acoustic patterns of L2 French vowels produced in two different prosodic conditions: (1) in word internal position (unaccented), (2) in initial and final boundaries of Accentual Phrases and Intonation Phrases. We analyzed oral productions by 40 participants: 10 French native speakers and 30 L2 French learners with L1 Spanish, L1 English and L1 Italian (10 each). We extracted acoustic parameters for ~15k vowels and calculated the degree of acoustic overlap via Pillai scores for the following triplets: /i/~/y/~/u/, /e/~/ø/~/o/. Our results show that prosodic prominence results in a smaller acoustic overlap of some L2 French vowel contrasts.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128337840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The many shapes of H* H*的多种形状

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-153

Stella Gryllia, K. Marcoux, Kathleen Jepson, A. Arvaniti

We examined individual and task-related variability in the realization of Greek nuclear H* followed by L-L% edge tones. The accents (N = 748) were elicited from native speakers of Greek, producing scripted and unscripted speech, and examined using functional Principal Components Analysis. The accented vowel onset was used for landmark registration to capture accent shape and the alignment of the fall. The resulting PCs were analysed using LMEMs (fixed factors: speaker; task type (scripted, unscripted); accented syllable distance from the analysis window offset, to examine the effects of tonal crowding). Tonal scaling and the steepness of the fall (reflected in PC1 and PC2 respectively) changed by task in ways that differed across speakers. PC3, which captured accent shape, also varied by speaker, reflecting shape differences between a rise-fall and (the expected) plateau-plus-fall realization. Tonal crowding did not have consistent effects. In short, the overall accent shape and the alignment of the accentual fall varied by speaker and task. These results hint at substantial variability in tonal realization. At the same time, they indicate that tonal alignment is not as consistent as is sometimes portrayed and thus it should not be the sole criterion for tone categorization.

我们检查了个人和任务相关的变异性在实现希腊核H*随后的L-L%边缘音调。这些口音(N = 748)是从母语为希腊语的人那里提取出来的，产生有脚本和无脚本的演讲，并使用功能主成分分析进行检查。重音元音的起头被用于地标注册，以捕捉重音形状和落的对齐。使用LMEMs(固定因素:扬声器;任务类型(脚本化，非脚本化);重音音节距离分析窗口偏移，检查音调拥挤的影响)。音调的缩放和下降的陡峭度(分别反映在PC1和PC2中)随着任务的不同而改变，不同的说话者也有所不同。捕获重音形状的PC3也因说话者而异，反映了上升-下降和(预期的)高原+下降实现之间的形状差异。色调拥挤没有一致的影响。简而言之，整个重音形状和重音落的排列因说话者和任务而异。这些结果暗示了音调实现的实质性变化。同时，它们表明音调对齐并不像有时描绘的那样一致，因此它不应该是音调分类的唯一标准。

{"title":"The many shapes of H*","authors":"Stella Gryllia, K. Marcoux, Kathleen Jepson, A. Arvaniti","doi":"10.21437/speechprosody.2022-153","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-153","url":null,"abstract":"We examined individual and task-related variability in the realization of Greek nuclear H* followed by L-L% edge tones. The accents (N = 748) were elicited from native speakers of Greek, producing scripted and unscripted speech, and examined using functional Principal Components Analysis. The accented vowel onset was used for landmark registration to capture accent shape and the alignment of the fall. The resulting PCs were analysed using LMEMs (fixed factors: speaker; task type (scripted, unscripted); accented syllable distance from the analysis window offset, to examine the effects of tonal crowding). Tonal scaling and the steepness of the fall (reflected in PC1 and PC2 respectively) changed by task in ways that differed across speakers. PC3, which captured accent shape, also varied by speaker, reflecting shape differences between a rise-fall and (the expected) plateau-plus-fall realization. Tonal crowding did not have consistent effects. In short, the overall accent shape and the alignment of the accentual fall varied by speaker and task. These results hint at substantial variability in tonal realization. At the same time, they indicate that tonal alignment is not as consistent as is sometimes portrayed and thus it should not be the sole criterion for tone categorization.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128556677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Hierarchical Predictive Processing Approach to Modelling Prosody 韵律建模的层次预测处理方法

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-86

J. Šimko, Adaeze Adigwe, A. Suni, M. Vainio

Prosodic patterns—and linguistic structures in general— are hierarchical in nature, providing for efﬁcient means for encoding information in temporally constrained situations where communicative events occur. However, there are no theoretical frameworks that are capable of representing the full extent of linguistic behaviour in a cohesive way that could capture the paradigmatic and syntagmatic links between the organizational levels present in everyday speech. Here we propose a novel theoretical and modelling account of perception and production of prosodic patterns in speech communication, derived from the inﬂuential Predictive Processing theory of neural implementation of perception and action based on a hierarchical system of generative models producing progressively more detailed probabilistic predictions of future events. The framework provides a conceptualization of the hierarchical organization of speech prosody as well as a principled way of unifying speech perception and production by postulat-ing a single processing hierarchy shared by both modalities. We discuss the possible implications of the theory for prosodic analysis of speech communication, including conversational setting. In addition, we outline a viable computational implementation in the form of a machine learning architecture that can be used as a testbed for generating and evaluating predictions brought forth by the theory.

韵律模式和一般的语言结构在本质上是分层的，在交际事件发生的时间限制的情况下，为编码信息提供了有效的手段。然而，目前还没有理论框架能够以一种连贯的方式表达语言行为的全部范围，从而捕捉到日常言语中存在的组织层面之间的范式和组合联系。在这里，我们提出了一种新的理论和模型来解释语音交流中韵律模式的感知和产生，该理论来源于有影响力的预测处理理论，该理论是基于基于生成模型的分层系统的感知和行动的神经实现，该系统逐渐产生对未来事件更详细的概率预测。该框架提供了语音韵律层次组织的概念化，以及通过假设两种模式共享的单一处理层次来统一语音感知和产生的原则方法。我们讨论了这一理论对语音交际的韵律分析的可能意义，包括会话设置。此外，我们以机器学习架构的形式概述了一个可行的计算实现，该架构可以用作生成和评估理论提出的预测的测试平台。

{"title":"A Hierarchical Predictive Processing Approach to Modelling Prosody","authors":"J. Šimko, Adaeze Adigwe, A. Suni, M. Vainio","doi":"10.21437/speechprosody.2022-86","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-86","url":null,"abstract":"Prosodic patterns—and linguistic structures in general— are hierarchical in nature, providing for efﬁcient means for encoding information in temporally constrained situations where communicative events occur. However, there are no theoretical frameworks that are capable of representing the full extent of linguistic behaviour in a cohesive way that could capture the paradigmatic and syntagmatic links between the organizational levels present in everyday speech. Here we propose a novel theoretical and modelling account of perception and production of prosodic patterns in speech communication, derived from the inﬂuential Predictive Processing theory of neural implementation of perception and action based on a hierarchical system of generative models producing progressively more detailed probabilistic predictions of future events. The framework provides a conceptualization of the hierarchical organization of speech prosody as well as a principled way of unifying speech perception and production by postulat-ing a single processing hierarchy shared by both modalities. We discuss the possible implications of the theory for prosodic analysis of speech communication, including conversational setting. In addition, we outline a viable computational implementation in the form of a machine learning architecture that can be used as a testbed for generating and evaluating predictions brought forth by the theory.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128958480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Effects of Gender and Language Proficiency on Phonetic Accommodation in Chinese EFL Learners 性别和语言能力对中国英语学习者语音适应的影响

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-156

Xiaoqing Wang, Wentao Gu

Phonetic accommodation is ubiquitous in cross-linguistic/cultural speech communication. The present study examined the effects of gender and language proficiency on phonetic accommodation in Chinese EFL learners. Five vowels /i/, /u/, /æ/, /ɑ/ and /ʌ/ were embedded in a pair of syllables /hVt/ and /hVd/ to compose ten target words. Three groups of Chinese EFL learners differing in the level of English language proficiency (i.e., elementary, intermediate, and advanced) participated in the experiment. To elicit spontaneous conversational speech, a Diapix task embedded with all ten target words was conducted between each participant and a model talker who was a native speaker of American English. Also, each participant read aloud the ten words before and after the Diapix task. Phonetic accommodation was measured by acoustic analysis of vowel duration and formants. For vowel duration, the higher-proficiency learners converged more than the lower-proficiency ones. For vowel formants, a significant interaction effect was found between gender and language proficiency, i.e., females converged less than males in the advanced learners, whereas females converged more than males in the lower-proficiency learners.

语音适应在跨语言/文化言语交际中是普遍存在的。本研究考察了性别和语言熟练程度对中国英语学习者语音适应的影响。五个元音/i/， /u/， /æ/， / /和/ / /被嵌入到一对音节/hVt/和/hVd/中，组成十个目标单词。实验将英语水平不同的中国英语学习者分为初级、中级和高级三组。为了引出自发的对话，在每个参与者和一个以美国英语为母语的模范谈话者之间进行了一个嵌入所有十个目标单词的Diapix任务。此外，每个参与者在Diapix任务前后都大声朗读了10个单词。通过对元音持续时间和共振峰的声学分析来测量语音适应性。在元音时长方面，高水平学习者比低水平学习者收敛程度更高。在元音共振峰上，性别与语言水平之间存在显著的交互作用，即高水平学习者的女性趋同程度低于男性，而低水平学习者的女性趋同程度高于男性。

{"title":"Effects of Gender and Language Proficiency on Phonetic Accommodation in Chinese EFL Learners","authors":"Xiaoqing Wang, Wentao Gu","doi":"10.21437/speechprosody.2022-156","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-156","url":null,"abstract":"Phonetic accommodation is ubiquitous in cross-linguistic/cultural speech communication. The present study examined the effects of gender and language proficiency on phonetic accommodation in Chinese EFL learners. Five vowels /i/, /u/, /æ/, /ɑ/ and /ʌ/ were embedded in a pair of syllables /hVt/ and /hVd/ to compose ten target words. Three groups of Chinese EFL learners differing in the level of English language proficiency (i.e., elementary, intermediate, and advanced) participated in the experiment. To elicit spontaneous conversational speech, a Diapix task embedded with all ten target words was conducted between each participant and a model talker who was a native speaker of American English. Also, each participant read aloud the ten words before and after the Diapix task. Phonetic accommodation was measured by acoustic analysis of vowel duration and formants. For vowel duration, the higher-proficiency learners converged more than the lower-proficiency ones. For vowel formants, a significant interaction effect was found between gender and language proficiency, i.e., females converged less than males in the advanced learners, whereas females converged more than males in the lower-proficiency learners.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128959317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Predictability of the Prosody of Dialog Markers from the Prosody of the Local Context 从局部语境韵律看对话标记语韵律的可预测性

Speech Prosody 2022

Pub Date : 2022-05-23 DOI: 10.21437/speechprosody.2022-135

Anindita Nath, Nigel G. Ward

Dialog markers, such as yeah and okay generally seem to fit smoothly in the flow of dialog, with prosody that is natural and appropriate for the local context. We here examine this effect, specifically looking at the predictability of the prosody of dialog markers from the prosody of the local context. Using 72 prosodic features representing the local context, we built simple models able to predict the average pitch, log energy, cepstral flux, and harmonic ratio for the 12 most common dialog markers of American English. The model’s predictions accounted for over a third of the variance in the observed prosody, showing a modest but meaningful context dependence.

对话标记，如“是的”和“好的”，通常看起来与对话流畅，韵律自然，适合当地语境。我们在这里研究了这种影响，特别是从当地语境的韵律来看对话标记的韵律的可预测性。利用代表当地语境的72个韵律特征，我们建立了简单的模型，能够预测12个最常见的美式英语对话标记的平均音高、对数能量、倒谱通量和谐波比。该模型的预测占了观察到的韵律变化的三分之一以上，显示出适度但有意义的上下文依赖。

引用次数: 1