Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-87
Li-Fang Lai, J. G. Hell, John M. Lipski
This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.
{"title":"The Role of Rhythm and Vowel Space in Speech Recognition","authors":"Li-Fang Lai, J. G. Hell, John M. Lipski","doi":"10.21437/speechprosody.2022-87","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-87","url":null,"abstract":"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123112458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-54
Laurence Bruggeman, Jenny Yu, A. Cutler
In lexical stress languages, phonemically identical syllables can differ suprasegmentally (in duration, amplitude, F0). Such stress cues allow listeners to speed spoken-word recognition by rejecting mismatching competitors (e.g., unstressed set - in settee rules out stressed set- in setting , setter , settle ). Such processing effects have indeed been observed in Spanish, Dutch and German, but English listeners are known to largely ignore stress cues. Dutch and German listeners even outdo English listeners in distinguishing stressed versus unstressed English syllables. This has been attributed to the relative frequency across the stress languages of unstressed syllables with full vowels; in English most unstressed syllables contain schwa, instead, and stress cues on full vowels are thus least often informative in this language. If only informativeness matters, would English listeners who encounter situations where such cues would pay off for them (e.g., learning one of those other stress languages) then shift to using stress cues? Likewise, would stress cue users with English as L2, if mainly using English, shift away from using the cues in English? Here we report tests of these two questions, with each receiving a yes answer. We propose that English listeners’ disregard of stress cues is purely pragmatic.
在词汇重音语言中,音素相同的音节可以在超段上(持续时间、振幅、F0)不同。这样的重音提示可以让听者通过拒绝不匹配的竞争者来加快对口语单词的识别(例如,不重读的set- in settee排除重读的set- in setting, setter, settle)。在西班牙语、荷兰语和德语中确实观察到了这种加工效应,但众所周知,英语听众在很大程度上忽略了重音提示。荷兰语和德语的听众在区分英语重读音节和非重读音节方面甚至胜过英语听众。这归因于重读语言中带有完整元音的非重读音节的相对频率;在英语中,大多数非重读音节都包含弱读音,因此,完整元音上的重音提示在这种语言中信息量最少。如果信息性是唯一重要的因素,那么当英语听众遇到这样的提示对他们有益的情况时(例如,学习其他重音语言之一),他们会转而使用重音提示吗?同样,如果主要使用英语,那么英语为第二语言的强调提示用户是否会放弃使用英语提示?这里我们报告这两个问题的测试,每个问题的答案都是肯定的。我们认为英语听者无视重音线索纯粹是实用主义的。
{"title":"Listener adjustment of stress cue use to fit language vocabulary structure","authors":"Laurence Bruggeman, Jenny Yu, A. Cutler","doi":"10.21437/speechprosody.2022-54","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-54","url":null,"abstract":"In lexical stress languages, phonemically identical syllables can differ suprasegmentally (in duration, amplitude, F0). Such stress cues allow listeners to speed spoken-word recognition by rejecting mismatching competitors (e.g., unstressed set - in settee rules out stressed set- in setting , setter , settle ). Such processing effects have indeed been observed in Spanish, Dutch and German, but English listeners are known to largely ignore stress cues. Dutch and German listeners even outdo English listeners in distinguishing stressed versus unstressed English syllables. This has been attributed to the relative frequency across the stress languages of unstressed syllables with full vowels; in English most unstressed syllables contain schwa, instead, and stress cues on full vowels are thus least often informative in this language. If only informativeness matters, would English listeners who encounter situations where such cues would pay off for them (e.g., learning one of those other stress languages) then shift to using stress cues? Likewise, would stress cue users with English as L2, if mainly using English, shift away from using the cues in English? Here we report tests of these two questions, with each receiving a yes answer. We propose that English listeners’ disregard of stress cues is purely pragmatic.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116625739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-165
Sabine Zerbian, Marlene Böttcher, Yulia Zuban
The study reports on the frequency of occurrence and prosodic realization of adjective-noun phrases in which the adjective is contrastively focused. The productions of bilingual speakers are investigated in both their languages, Heritage Russian and majority English. The data are extracted from a corpus of semi-spontaneous speech which was collected in a comparable way from mono- and bilingual speakers in the U.S. and Russia. Results of the analysis show that there is a language-specific difference in that Russian speakers use ADJ CF +N combinations less frequently than English speakers despite a reported parallel between the languages in terms of semantics and prosody. Moreover, English and Russian seem to differ in their accentuation pattern in ADJ CF +N. Speakers of Russian as a Heritage Language frequently use double accents in ADJ CF +N. Across English and Russian, double accents in ADJ CF +N occur more frequently in formal than in informal situation, and more frequently in bilingual than in monolingual speakers. The results are discussed in light of the often reported tendency in heritage language grammars to avoid ambiguity.
{"title":"Prosody of contrastive adjectives in mono- and bilingual speakers of English and Russian: a corpus study","authors":"Sabine Zerbian, Marlene Böttcher, Yulia Zuban","doi":"10.21437/speechprosody.2022-165","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-165","url":null,"abstract":"The study reports on the frequency of occurrence and prosodic realization of adjective-noun phrases in which the adjective is contrastively focused. The productions of bilingual speakers are investigated in both their languages, Heritage Russian and majority English. The data are extracted from a corpus of semi-spontaneous speech which was collected in a comparable way from mono- and bilingual speakers in the U.S. and Russia. Results of the analysis show that there is a language-specific difference in that Russian speakers use ADJ CF +N combinations less frequently than English speakers despite a reported parallel between the languages in terms of semantics and prosody. Moreover, English and Russian seem to differ in their accentuation pattern in ADJ CF +N. Speakers of Russian as a Heritage Language frequently use double accents in ADJ CF +N. Across English and Russian, double accents in ADJ CF +N occur more frequently in formal than in informal situation, and more frequently in bilingual than in monolingual speakers. The results are discussed in light of the often reported tendency in heritage language grammars to avoid ambiguity.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125371139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-143
Nari Rhee, Jianjing Kuang, Aoju Chen
Past work has shown a link between children’s musicality and language learning. But research is still sparse on the effect of musicality on the development of prosody, which uses tonal and temporal cues also relevant for processing music. In particular, the questions of when and how musicality affects the development of various aspects of the prosodic grammar remain largely unknown. In this study, we investigate the effect of musicality on the development of focus-marking in Mandarin-speaking 4-to 6-year-olds using speech data elicited in a controlled but interactive setting. We have found that the development of focus-marking in Mandarin is only weakly affected by the learner’s musicality. Specifically, children produce adult-like distinctions between on-focus and pre-focus positions, regardless of musicality. A musicality effect is observed in the contrast between on-focus and post-focus positions only in the 4-year-olds. The limited musicality effect on focus-marking is in contrast with our previous work, in which we found that musicality has a salient effect on the lexical tone production by children younger than 6 years. Together, the current results suggest that musicality advantage in the development of prosody depends on aspects of the prosodic grammar and the stage of development.
{"title":"The effect of musicality on the development of Mandarin prosody","authors":"Nari Rhee, Jianjing Kuang, Aoju Chen","doi":"10.21437/speechprosody.2022-143","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-143","url":null,"abstract":"Past work has shown a link between children’s musicality and language learning. But research is still sparse on the effect of musicality on the development of prosody, which uses tonal and temporal cues also relevant for processing music. In particular, the questions of when and how musicality affects the development of various aspects of the prosodic grammar remain largely unknown. In this study, we investigate the effect of musicality on the development of focus-marking in Mandarin-speaking 4-to 6-year-olds using speech data elicited in a controlled but interactive setting. We have found that the development of focus-marking in Mandarin is only weakly affected by the learner’s musicality. Specifically, children produce adult-like distinctions between on-focus and pre-focus positions, regardless of musicality. A musicality effect is observed in the contrast between on-focus and post-focus positions only in the 4-year-olds. The limited musicality effect on focus-marking is in contrast with our previous work, in which we found that musicality has a salient effect on the lexical tone production by children younger than 6 years. Together, the current results suggest that musicality advantage in the development of prosody depends on aspects of the prosodic grammar and the stage of development.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114263293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-91
Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque
Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.
{"title":"Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation","authors":"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque","doi":"10.21437/speechprosody.2022-91","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-91","url":null,"abstract":"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117265240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-77
Fabián Santiago, Paolo Mairano, Bianca De Paolis
Mainstream L2 phonology models do not include predictions concerning how the prosodic structure interacts with the acquisition of segments. However, many studies have shown that the realization of pitch accents or melodic contours associated to prosodic boundaries results in the hyper-articulation of segments in correspondence of such prosodic boundaries. Our goal is to provide empirical evidence for the positive effects of prosodic prominence on the acquisition of challenging L2 French sounds The prosodic-phonetic interface has been largely underestimated in second language acquisition. Few studies have investigated whether prosodic prominence may serve as an optimal context for learners to extract information on the acoustic properties of new sounds, which may then be reflected in more accurate productions. In this paper, we report the acoustic patterns of L2 French vowels produced in two different prosodic conditions: (1) in word internal position (unaccented), (2) in initial and final boundaries of Accentual Phrases and Intonation Phrases. We analyzed oral productions by 40 participants: 10 French native speakers and 30 L2 French learners with L1 Spanish, L1 English and L1 Italian (10 each). We extracted acoustic parameters for ~15k vowels and calculated the degree of acoustic overlap via Pillai scores for the following triplets: /i/~/y/~/u/, /e/~/ø/~/o/. Our results show that prosodic prominence results in a smaller acoustic overlap of some L2 French vowel contrasts.
{"title":"The effects of prosodic prominence on the acquisition of L2 phonological features","authors":"Fabián Santiago, Paolo Mairano, Bianca De Paolis","doi":"10.21437/speechprosody.2022-77","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-77","url":null,"abstract":"Mainstream L2 phonology models do not include predictions concerning how the prosodic structure interacts with the acquisition of segments. However, many studies have shown that the realization of pitch accents or melodic contours associated to prosodic boundaries results in the hyper-articulation of segments in correspondence of such prosodic boundaries. Our goal is to provide empirical evidence for the positive effects of prosodic prominence on the acquisition of challenging L2 French sounds The prosodic-phonetic interface has been largely underestimated in second language acquisition. Few studies have investigated whether prosodic prominence may serve as an optimal context for learners to extract information on the acoustic properties of new sounds, which may then be reflected in more accurate productions. In this paper, we report the acoustic patterns of L2 French vowels produced in two different prosodic conditions: (1) in word internal position (unaccented), (2) in initial and final boundaries of Accentual Phrases and Intonation Phrases. We analyzed oral productions by 40 participants: 10 French native speakers and 30 L2 French learners with L1 Spanish, L1 English and L1 Italian (10 each). We extracted acoustic parameters for ~15k vowels and calculated the degree of acoustic overlap via Pillai scores for the following triplets: /i/~/y/~/u/, /e/~/ø/~/o/. Our results show that prosodic prominence results in a smaller acoustic overlap of some L2 French vowel contrasts.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128337840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-153
Stella Gryllia, K. Marcoux, Kathleen Jepson, A. Arvaniti
We examined individual and task-related variability in the realization of Greek nuclear H* followed by L-L% edge tones. The accents (N = 748) were elicited from native speakers of Greek, producing scripted and unscripted speech, and examined using functional Principal Components Analysis. The accented vowel onset was used for landmark registration to capture accent shape and the alignment of the fall. The resulting PCs were analysed using LMEMs (fixed factors: speaker; task type (scripted, unscripted); accented syllable distance from the analysis window offset, to examine the effects of tonal crowding). Tonal scaling and the steepness of the fall (reflected in PC1 and PC2 respectively) changed by task in ways that differed across speakers. PC3, which captured accent shape, also varied by speaker, reflecting shape differences between a rise-fall and (the expected) plateau-plus-fall realization. Tonal crowding did not have consistent effects. In short, the overall accent shape and the alignment of the accentual fall varied by speaker and task. These results hint at substantial variability in tonal realization. At the same time, they indicate that tonal alignment is not as consistent as is sometimes portrayed and thus it should not be the sole criterion for tone categorization.
{"title":"The many shapes of H*","authors":"Stella Gryllia, K. Marcoux, Kathleen Jepson, A. Arvaniti","doi":"10.21437/speechprosody.2022-153","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-153","url":null,"abstract":"We examined individual and task-related variability in the realization of Greek nuclear H* followed by L-L% edge tones. The accents (N = 748) were elicited from native speakers of Greek, producing scripted and unscripted speech, and examined using functional Principal Components Analysis. The accented vowel onset was used for landmark registration to capture accent shape and the alignment of the fall. The resulting PCs were analysed using LMEMs (fixed factors: speaker; task type (scripted, unscripted); accented syllable distance from the analysis window offset, to examine the effects of tonal crowding). Tonal scaling and the steepness of the fall (reflected in PC1 and PC2 respectively) changed by task in ways that differed across speakers. PC3, which captured accent shape, also varied by speaker, reflecting shape differences between a rise-fall and (the expected) plateau-plus-fall realization. Tonal crowding did not have consistent effects. In short, the overall accent shape and the alignment of the accentual fall varied by speaker and task. These results hint at substantial variability in tonal realization. At the same time, they indicate that tonal alignment is not as consistent as is sometimes portrayed and thus it should not be the sole criterion for tone categorization.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128556677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-86
J. Šimko, Adaeze Adigwe, A. Suni, M. Vainio
Prosodic patterns—and linguistic structures in general— are hierarchical in nature, providing for efficient means for encoding information in temporally constrained situations where communicative events occur. However, there are no theoretical frameworks that are capable of representing the full extent of linguistic behaviour in a cohesive way that could capture the paradigmatic and syntagmatic links between the organizational levels present in everyday speech. Here we propose a novel theoretical and modelling account of perception and production of prosodic patterns in speech communication, derived from the influential Predictive Processing theory of neural implementation of perception and action based on a hierarchical system of generative models producing progressively more detailed probabilistic predictions of future events. The framework provides a conceptualization of the hierarchical organization of speech prosody as well as a principled way of unifying speech perception and production by postulat-ing a single processing hierarchy shared by both modalities. We discuss the possible implications of the theory for prosodic analysis of speech communication, including conversational setting. In addition, we outline a viable computational implementation in the form of a machine learning architecture that can be used as a testbed for generating and evaluating predictions brought forth by the theory.
{"title":"A Hierarchical Predictive Processing Approach to Modelling Prosody","authors":"J. Šimko, Adaeze Adigwe, A. Suni, M. Vainio","doi":"10.21437/speechprosody.2022-86","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-86","url":null,"abstract":"Prosodic patterns—and linguistic structures in general— are hierarchical in nature, providing for efficient means for encoding information in temporally constrained situations where communicative events occur. However, there are no theoretical frameworks that are capable of representing the full extent of linguistic behaviour in a cohesive way that could capture the paradigmatic and syntagmatic links between the organizational levels present in everyday speech. Here we propose a novel theoretical and modelling account of perception and production of prosodic patterns in speech communication, derived from the influential Predictive Processing theory of neural implementation of perception and action based on a hierarchical system of generative models producing progressively more detailed probabilistic predictions of future events. The framework provides a conceptualization of the hierarchical organization of speech prosody as well as a principled way of unifying speech perception and production by postulat-ing a single processing hierarchy shared by both modalities. We discuss the possible implications of the theory for prosodic analysis of speech communication, including conversational setting. In addition, we outline a viable computational implementation in the form of a machine learning architecture that can be used as a testbed for generating and evaluating predictions brought forth by the theory.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128958480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-156
Xiaoqing Wang, Wentao Gu
Phonetic accommodation is ubiquitous in cross-linguistic/cultural speech communication. The present study examined the effects of gender and language proficiency on phonetic accommodation in Chinese EFL learners. Five vowels /i/, /u/, /æ/, /ɑ/ and /ʌ/ were embedded in a pair of syllables /hVt/ and /hVd/ to compose ten target words. Three groups of Chinese EFL learners differing in the level of English language proficiency (i.e., elementary, intermediate, and advanced) participated in the experiment. To elicit spontaneous conversational speech, a Diapix task embedded with all ten target words was conducted between each participant and a model talker who was a native speaker of American English. Also, each participant read aloud the ten words before and after the Diapix task. Phonetic accommodation was measured by acoustic analysis of vowel duration and formants. For vowel duration, the higher-proficiency learners converged more than the lower-proficiency ones. For vowel formants, a significant interaction effect was found between gender and language proficiency, i.e., females converged less than males in the advanced learners, whereas females converged more than males in the lower-proficiency learners.
{"title":"Effects of Gender and Language Proficiency on Phonetic Accommodation in Chinese EFL Learners","authors":"Xiaoqing Wang, Wentao Gu","doi":"10.21437/speechprosody.2022-156","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-156","url":null,"abstract":"Phonetic accommodation is ubiquitous in cross-linguistic/cultural speech communication. The present study examined the effects of gender and language proficiency on phonetic accommodation in Chinese EFL learners. Five vowels /i/, /u/, /æ/, /ɑ/ and /ʌ/ were embedded in a pair of syllables /hVt/ and /hVd/ to compose ten target words. Three groups of Chinese EFL learners differing in the level of English language proficiency (i.e., elementary, intermediate, and advanced) participated in the experiment. To elicit spontaneous conversational speech, a Diapix task embedded with all ten target words was conducted between each participant and a model talker who was a native speaker of American English. Also, each participant read aloud the ten words before and after the Diapix task. Phonetic accommodation was measured by acoustic analysis of vowel duration and formants. For vowel duration, the higher-proficiency learners converged more than the lower-proficiency ones. For vowel formants, a significant interaction effect was found between gender and language proficiency, i.e., females converged less than males in the advanced learners, whereas females converged more than males in the lower-proficiency learners.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128959317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-135
Anindita Nath, Nigel G. Ward
Dialog markers, such as yeah and okay generally seem to fit smoothly in the flow of dialog, with prosody that is natural and appropriate for the local context. We here examine this effect, specifically looking at the predictability of the prosody of dialog markers from the prosody of the local context. Using 72 prosodic features representing the local context, we built simple models able to predict the average pitch, log energy, cepstral flux, and harmonic ratio for the 12 most common dialog markers of American English. The model’s predictions accounted for over a third of the variance in the observed prosody, showing a modest but meaningful context dependence.
{"title":"On the Predictability of the Prosody of Dialog Markers from the Prosody of the Local Context","authors":"Anindita Nath, Nigel G. Ward","doi":"10.21437/speechprosody.2022-135","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-135","url":null,"abstract":"Dialog markers, such as yeah and okay generally seem to fit smoothly in the flow of dialog, with prosody that is natural and appropriate for the local context. We here examine this effect, specifically looking at the predictability of the prosody of dialog markers from the prosody of the local context. Using 72 prosodic features representing the local context, we built simple models able to predict the average pitch, log energy, cepstral flux, and harmonic ratio for the 12 most common dialog markers of American English. The model’s predictions accounted for over a third of the variance in the observed prosody, showing a modest but meaningful context dependence.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129045062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}