Pub Date : 2025-12-13DOI: 10.1016/j.specom.2025.103345
Jia Ying
This study investigates articulatory-acoustic relationships in Australian English /l/ using simultaneous 3D electromagnetic articulography (EMA) and acoustic recordings from six speakers producing /l/ in onset and coda positions with /æ/ and /ɪ/ vowels. Linear mixed-effects models revealed significant relationships between tongue lateralization and all three formants, with F3 emerging as the primary acoustic correlate of lateralization (β = 0.081, p < 0.001). Acoustic properties of /l/ were strongly influenced by vowel context, with significant vowel-lateralization interactions for F1 and F2, indicating that the acoustic consequences of lateralization vary by vowel environment. Temporal analysis revealed position-dependent timing relationships: F3 preceded articulatory peaks in coda position but showed near-synchronous timing in onset position, while F1 and F2 consistently lagged behind articulatory peaks across all conditions. These findings suggest distinct articulatory-acoustic coupling mechanisms for onset versus coda /l/, with F3 serving as an anticipatory cue in coda position. The results highlight the complex, context-dependent nature of /l/'s articulatory-acoustic relationships and underscore the importance of considering both spectral and temporal dimensions in understanding liquid consonant production.
本研究使用同时3D电磁发音仪(EMA)和六位说话者的录音来研究澳大利亚英语/l/中的发音-声学关系,这些录音产生/l/在元音/æ/和/ / /的起音和尾音位置。线性混合效应模型显示,舌侧化与所有三个共振峰之间存在显著关系,其中F3是舌侧化的主要声学相关性(β = 0.081, p < 0.001)。/l/的声学特性受到元音环境的强烈影响,F1和F2的元音-侧化相互作用显著,表明侧化的声学后果因元音环境而异。时间分析揭示了位置依赖的时间关系:F3先于尾位发音峰值,但在起始位发音峰值几乎同步,而F1和F2始终滞后于所有条件下的发音峰值。这些发现表明不同的发音-声学耦合机制的开始与尾/l/, F3作为尾位置的预期提示。研究结果强调了/l/发音-声学关系的复杂性和上下文依赖性,并强调了在理解液态辅音产生时考虑频谱和时间维度的重要性。
{"title":"Lateral channel dynamics and F3 modulation: Quantifying para-sagittal articulation in Australian English /l/","authors":"Jia Ying","doi":"10.1016/j.specom.2025.103345","DOIUrl":"10.1016/j.specom.2025.103345","url":null,"abstract":"<div><div>This study investigates articulatory-acoustic relationships in Australian English /l/ using simultaneous 3D electromagnetic articulography (EMA) and acoustic recordings from six speakers producing /l/ in onset and coda positions with /æ/ and /ɪ/ vowels. Linear mixed-effects models revealed significant relationships between tongue lateralization and all three formants, with F3 emerging as the primary acoustic correlate of lateralization (β = 0.081, p < 0.001). Acoustic properties of /l/ were strongly influenced by vowel context, with significant vowel-lateralization interactions for F1 and F2, indicating that the acoustic consequences of lateralization vary by vowel environment. Temporal analysis revealed position-dependent timing relationships: F3 preceded articulatory peaks in coda position but showed near-synchronous timing in onset position, while F1 and F2 consistently lagged behind articulatory peaks across all conditions. These findings suggest distinct articulatory-acoustic coupling mechanisms for onset versus coda /l/, with F3 serving as an anticipatory cue in coda position. The results highlight the complex, context-dependent nature of /l/'s articulatory-acoustic relationships and underscore the importance of considering both spectral and temporal dimensions in understanding liquid consonant production.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103345"},"PeriodicalIF":3.0,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1016/j.specom.2025.103342
Himashi Rathnayake , Jesin James , Gianna Leoni , Ake Nicholas , Catherine Watson , Peter Keegan
Speech emotion recognition (SER) is an emerging field in human–computer interaction. Although numerous studies have focused on SER for well-resourced languages, the literature reveals a significant gap in research on low-resource and Indigenous (LRI) languages. This paper presents a comprehensive review of the existing literature on SER in the context of LRI languages, analysing critical factors to consider at each stage of designing an SER system. The review indicates that most studies on SER for LRI languages adopt emotion categories established for well-resourced languages, often assuming the universality of emotions. However, the literature suggests that this approach may be limited due to emotional disparities influenced by cultural variations. Additionally, the review underscores that current SER systems typically lack community-oriented methodologies in the development of technology for LRI languages. The importance of feature selection is highlighted, with evidence suggesting that a combination of traditional machine learning methods and carefully selected acoustic features may offer viable options for SER in these languages. Furthermore, the review identifies a need for further exploration of semi-supervised and unsupervised approaches to enhance SER capabilities in LRI contexts. Overall, current SER systems for LRI languages lag behind state-of-the-art standards due to the lack of resources, indicating that there is still much work to be done in this area.
{"title":"A review on speech emotion recognition for low-resource and Indigenous languages","authors":"Himashi Rathnayake , Jesin James , Gianna Leoni , Ake Nicholas , Catherine Watson , Peter Keegan","doi":"10.1016/j.specom.2025.103342","DOIUrl":"10.1016/j.specom.2025.103342","url":null,"abstract":"<div><div>Speech emotion recognition (SER) is an emerging field in human–computer interaction. Although numerous studies have focused on SER for well-resourced languages, the literature reveals a significant gap in research on low-resource and Indigenous (LRI) languages. This paper presents a comprehensive review of the existing literature on SER in the context of LRI languages, analysing critical factors to consider at each stage of designing an SER system. The review indicates that most studies on SER for LRI languages adopt emotion categories established for well-resourced languages, often assuming the universality of emotions. However, the literature suggests that this approach may be limited due to emotional disparities influenced by cultural variations. Additionally, the review underscores that current SER systems typically lack community-oriented methodologies in the development of technology for LRI languages. The importance of feature selection is highlighted, with evidence suggesting that a combination of traditional machine learning methods and carefully selected acoustic features may offer viable options for SER in these languages. Furthermore, the review identifies a need for further exploration of semi-supervised and unsupervised approaches to enhance SER capabilities in LRI contexts. Overall, current SER systems for LRI languages lag behind state-of-the-art standards due to the lack of resources, indicating that there is still much work to be done in this area.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103342"},"PeriodicalIF":3.0,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.specom.2025.103343
Frank Lihui Tan, Youngah Do
This study investigates the emergence and development of universal phonetic sensitivity during early phonological learning using an unsupervised modeling approach. Autoencoder models were trained on raw acoustic input from English and Mandarin to simulate bottom-up perceptual development, with a focus on phoneme contrast learning. The results show that phoneme-like categories and feature-aligned representational spaces can emerge from context-free acoustic exposure alone. Crucially, the model exhibits universal phonetic sensitivity as a transient developmental stage that varies across contrasts and gradually gives way to language-specific perception—a trajectory that parallels infant perceptual development. Different featural contrasts remain universally discriminable for varying durations over the course of learning. These findings support the view that universal sensitivity is not innately fixed but emerges through learning, and that early phonological development proceeds along a mosaic, feature-dependent trajectory.
{"title":"Bottom-up modeling of phoneme learning: Universal sensitivity and language-specific transformation","authors":"Frank Lihui Tan, Youngah Do","doi":"10.1016/j.specom.2025.103343","DOIUrl":"10.1016/j.specom.2025.103343","url":null,"abstract":"<div><div>This study investigates the emergence and development of universal phonetic sensitivity during early phonological learning using an unsupervised modeling approach. Autoencoder models were trained on raw acoustic input from English and Mandarin to simulate bottom-up perceptual development, with a focus on phoneme contrast learning. The results show that phoneme-like categories and feature-aligned representational spaces can emerge from context-free acoustic exposure alone. Crucially, the model exhibits universal phonetic sensitivity as a transient developmental stage that varies across contrasts and gradually gives way to language-specific perception—a trajectory that parallels infant perceptual development. Different featural contrasts remain universally discriminable for varying durations over the course of learning. These findings support the view that universal sensitivity is not innately fixed but emerges through learning, and that early phonological development proceeds along a mosaic, feature-dependent trajectory.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103343"},"PeriodicalIF":3.0,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related characteristics solely from the phrasing task. Besides, we explore the potential of pre-trained speaker embeddings for unseen speakers through a few-shot adaptation method. Furthermore, we pioneer the application of phoneme-level pre-trained language models to this TTS front-end task, which significantly boosts the accuracy of the phrasing model. Our methods are rigorously assessed through both objective and subjective evaluations, demonstrating their effectiveness.
{"title":"Speaker-conditioned phrase break prediction for text-to-speech with phoneme-level pre-trained language model","authors":"Dong Yang , Yuki Saito , Takaaki Saeki , Tomoki Koriyama , Wataru Nakata , Detai Xin , Hiroshi Saruwatari","doi":"10.1016/j.specom.2025.103331","DOIUrl":"10.1016/j.specom.2025.103331","url":null,"abstract":"<div><div>This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related characteristics solely from the phrasing task. Besides, we explore the potential of pre-trained speaker embeddings for unseen speakers through a few-shot adaptation method. Furthermore, we pioneer the application of phoneme-level pre-trained language models to this TTS front-end task, which significantly boosts the accuracy of the phrasing model. Our methods are rigorously assessed through both objective and subjective evaluations, demonstrating their effectiveness.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103331"},"PeriodicalIF":3.0,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.specom.2025.103335
Hikaru Yanagida , Yusuke Ijima , Naohiro Tawara
This study aims to identify individual characteristics such as age, gender, personality traits, and values that influence the perception of one’s own recorded voice. While previous studies have shown that the perception of one’s own recorded voice is different from that of others, and that these differences are influenced by individual characteristics, only a limited number of individual characteristics were examined in past research. In our study, we conducted a large-scale subjective experiment with 141 Japanese participants using multiple individual characteristics. Participants evaluated impressions of their own recorded voices and the voices of others, and we analyzed the relationship between each of the individual characteristics and the voice impressions. Our findings showed that individual characteristics such as the frequency of listening to one’s own recorded voice (which had not been examined in the previous studies) influenced the perception of one’s own recorded voice. We further analyzed the use of combinations of multiple individual characteristics, including those that influenced impressions in a single use, to predict impressions of one’s own recorded voice and found that they were better predicted by the combination of multiple individual characteristics than by the use of a single individual characteristic.
{"title":"Effect of individual characteristics on impressions of one’s own recorded voice","authors":"Hikaru Yanagida , Yusuke Ijima , Naohiro Tawara","doi":"10.1016/j.specom.2025.103335","DOIUrl":"10.1016/j.specom.2025.103335","url":null,"abstract":"<div><div>This study aims to identify individual characteristics such as age, gender, personality traits, and values that influence the perception of one’s own recorded voice. While previous studies have shown that the perception of one’s own recorded voice is different from that of others, and that these differences are influenced by individual characteristics, only a limited number of individual characteristics were examined in past research. In our study, we conducted a large-scale subjective experiment with 141 Japanese participants using multiple individual characteristics. Participants evaluated impressions of their own recorded voices and the voices of others, and we analyzed the relationship between each of the individual characteristics and the voice impressions. Our findings showed that individual characteristics such as the frequency of listening to one’s own recorded voice (which had not been examined in the previous studies) influenced the perception of one’s own recorded voice. We further analyzed the use of combinations of multiple individual characteristics, including those that influenced impressions in a single use, to predict impressions of one’s own recorded voice and found that they were better predicted by the combination of multiple individual characteristics than by the use of a single individual characteristic.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103335"},"PeriodicalIF":3.0,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.specom.2025.103333
Theo Lepage, Reda Dehak
Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.
{"title":"Self-Supervised Learning for Speaker Recognition: A study and review","authors":"Theo Lepage, Reda Dehak","doi":"10.1016/j.specom.2025.103333","DOIUrl":"10.1016/j.specom.2025.103333","url":null,"abstract":"<div><div>Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103333"},"PeriodicalIF":3.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.specom.2025.103332
Weijie Lu , Yunfeng Xu , Jintan Gu
Multimodal Dialogue Emotion Recognition is rapidly emerging as a research hotspot with broad application prospects. In recent years, researchers have invested a lot of effort in the integration of modal feature information, but the detailed analysis of each modal feature information is still insufficient, and the difference in the influence of each modal feature information on the recognition results has not been fully considered. In order to solve this problem, we propose a Transformer-based multimodal interaction model with an adaptive weighted fusion mechanism (TIAWFM). The model effectively captures deep inter-modal correlations in multimodal emotion recognition tasks, mitigating the limitations of unimodal representations. We observe that incorporating specific conversational contexts and dynamically allocating weights to each modality not only fully leverages the model’s capabilities but also enables more accurate capture of emotional information embedded in the features. We conducted extensive experiments on two benchmark multimodal datasets, IEMOCAP and MELD. Experimental results demonstrate that TIAWFM exhibits significant advantages in dynamically integrating multimodal information, leading to notable improvements in both the accuracy and robustness of emotion recognition.
{"title":"Adaptive weighting in a transformer framework for multimodal emotion recognition","authors":"Weijie Lu , Yunfeng Xu , Jintan Gu","doi":"10.1016/j.specom.2025.103332","DOIUrl":"10.1016/j.specom.2025.103332","url":null,"abstract":"<div><div>Multimodal Dialogue Emotion Recognition is rapidly emerging as a research hotspot with broad application prospects. In recent years, researchers have invested a lot of effort in the integration of modal feature information, but the detailed analysis of each modal feature information is still insufficient, and the difference in the influence of each modal feature information on the recognition results has not been fully considered. In order to solve this problem, we propose a Transformer-based multimodal interaction model with an adaptive weighted fusion mechanism (TIAWFM). The model effectively captures deep inter-modal correlations in multimodal emotion recognition tasks, mitigating the limitations of unimodal representations. We observe that incorporating specific conversational contexts and dynamically allocating weights to each modality not only fully leverages the model’s capabilities but also enables more accurate capture of emotional information embedded in the features. We conducted extensive experiments on two benchmark multimodal datasets, IEMOCAP and MELD. Experimental results demonstrate that TIAWFM exhibits significant advantages in dynamically integrating multimodal information, leading to notable improvements in both the accuracy and robustness of emotion recognition.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103332"},"PeriodicalIF":3.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145594745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.specom.2025.103330
Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , James Glass , Chang D. Yoo
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20%–23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
{"title":"Towards unsupervised speech recognition without pronunciation models","authors":"Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , James Glass , Chang D. Yoo","doi":"10.1016/j.specom.2025.103330","DOIUrl":"10.1016/j.specom.2025.103330","url":null,"abstract":"<div><div>Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20%–23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103330"},"PeriodicalIF":3.0,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.specom.2025.103329
Paul Foulkes, Vincent Hughes, Kayleigh Peters, Jasmine Rouse
This study compares the relative performance of formant- and MFCC-based analyses of the same dataset, extending the work of Franco-Pedroso and Gonzalez-Rodriguez (2016). Using a corpus of read speech from 24 male English speakers we extracted vowel formant data and segmentally-based MFCCs of all phonemes. Data were taken from three versions of the corpus: 10 min and 3 min samples with wholly automated segment labelling and data extraction (the 10U and 3U datasets), and 3 min samples with manually corrected segment labelling and manual checking of formant tracking (the 3C dataset). The datasets were split in half and used for nine speaker discrimination tests: six tests using formants or MFCCs in each of the 10U, 3U and 3C datasets, and three fused systems combining formants and MFCCs for each dataset.
The formant-based tests revealed that the best performing segments were /ɪ/, /eɪ/, /aɪ/, /e/, /ʌ/ and /əː/. These vowels also performed well in MFCC-based tests, along with the three nasal consonants /m, n, ŋ/ and /k/. Relatively similar patterns were found for the three datasets. There was also a correlation with segment frequency: more frequent phonemes generally yielded better results. In addition, formant-based measures gave better EER and Cllr values than segmentally-based MFCCs. For formants, the best results came from the 10U dataset, while for MFCCs the best results came from the manually corrected 3C dataset. The effect of manual correction was starkest for consonants. Finally, the fused systems performed very well, with both formant- and MFCC-based systems producing EERs close to 0 in some cases. The best systems were those using the 3C dataset. The fused 10U system generally produced notably weaker LLRs, presumably because of the inevitably larger number of data labelling errors.
While the study is not forensically realistic, it has a number of implications for forensic speaker comparison. First, the best performing segments are those vowels in which formant separation is clear, and consonants (nasals) with formant structure. Second, manual correction of data is beneficial, especially for consonants. MFCCs are high dimensional data relative to vowel formants taken at a segment’s midpoint. Misalignment of automated labelling and tracking is thus potentially more likely to have a deleterious effect on MFCCs. While the 10U dataset yielded the best scores for vowel formants, there is a danger that it overestimates the discriminatory power of those segments. A degree of manual correction is therefore worthwhile. Finally, although MFCC data yielded worse scores on a segment by segment basis, the fused system worked very well. Further research is therefore merited on MFCC-based analysis of segments as variables in speaker comparison, and more broadly in phonetic research.
{"title":"The discriminative capacity of English segments in forensic speaker comparison","authors":"Paul Foulkes, Vincent Hughes, Kayleigh Peters, Jasmine Rouse","doi":"10.1016/j.specom.2025.103329","DOIUrl":"10.1016/j.specom.2025.103329","url":null,"abstract":"<div><div>This study compares the relative performance of formant- and MFCC-based analyses of the same dataset, extending the work of Franco-Pedroso and Gonzalez-Rodriguez (2016). Using a corpus of read speech from 24 male English speakers we extracted vowel formant data and segmentally-based MFCCs of all phonemes. Data were taken from three versions of the corpus: 10 min and 3 min samples with wholly automated segment labelling and data extraction (the 10U and 3U datasets), and 3 min samples with manually corrected segment labelling and manual checking of formant tracking (the 3C dataset). The datasets were split in half and used for nine speaker discrimination tests: six tests using formants or MFCCs in each of the 10U, 3U and 3C datasets, and three fused systems combining formants and MFCCs for each dataset.</div><div>The formant-based tests revealed that the best performing segments were /ɪ/, /eɪ/, /aɪ/, /e/, /ʌ/ and /əː/. These vowels also performed well in MFCC-based tests, along with the three nasal consonants /m, n, ŋ/ and /k/. Relatively similar patterns were found for the three datasets. There was also a correlation with segment frequency: more frequent phonemes generally yielded better results. In addition, formant-based measures gave better EER and <em>C</em><sub>llr</sub> values than segmentally-based MFCCs. For formants, the best results came from the 10U dataset, while for MFCCs the best results came from the manually corrected 3C dataset. The effect of manual correction was starkest for consonants. Finally, the fused systems performed very well, with both formant- and MFCC-based systems producing EERs close to 0 in some cases. The best systems were those using the 3C dataset. The fused 10U system generally produced notably weaker LLRs, presumably because of the inevitably larger number of data labelling errors.</div><div>While the study is not forensically realistic, it has a number of implications for forensic speaker comparison. First, the best performing segments are those vowels in which formant separation is clear, and consonants (nasals) with formant structure. Second, manual correction of data is beneficial, especially for consonants. MFCCs are high dimensional data relative to vowel formants taken at a segment’s midpoint. Misalignment of automated labelling and tracking is thus potentially more likely to have a deleterious effect on MFCCs. While the 10U dataset yielded the best scores for vowel formants, there is a danger that it overestimates the discriminatory power of those segments. A degree of manual correction is therefore worthwhile. Finally, although MFCC data yielded worse scores on a segment by segment basis, the fused system worked very well. Further research is therefore merited on MFCC-based analysis of segments as variables in speaker comparison, and more broadly in phonetic research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103329"},"PeriodicalIF":3.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.specom.2025.103324
Eija M.A. Aalto , Hana Ben Asker , Lucie Ménard , Walcir Cardoso , Catherine Laporte
Several publications have explored second language (L2) articulation through lingual ultrasound imaging technology. This systematic review and thematic analysis collate and evaluate these studies, focusing on methodologies, experimental setups, and findings. The review includes 31 works: 23 on ultrasound biofeedback and 8 on characterizing L2 articulation. English is the predominant language studied (82 % as L1 or L2), with participants mainly young adults (2–60 participants per study). The 23 ultrasound biofeedback studies showed significant variation in session numbers and length, including 16 PICO studies (i.e. study design with participants, intervention, controls/comparison group, outcome) where ultrasound biofeedback was compared to auditory feedback and/or control conditions. Data analysis of biofeedback studies often included acoustic or perceptual assessments in addition or instead of ultrasound data analysis. Analysis of results indicate that ultrasound biofeedback is effective for improving L2 articulation. However, the PICO studies revealed that while ultrasound biofeedback may offer certain advantages, these findings remain preliminary and warrant further investigation. Learner characteristics and target selection may affect biofeedback efficacy. Ultrasound also proved valuable for characterizing L2 articulation by showing articulatory and coarticulatory patterns, particularly in English sounds like /ɹ/, /l/, and various vowels. L2 characterization studies frequently used dynamic speech movement analysis. Moving forward, researchers are encouraged to use dynamic movement analysis also in biofeedback studies to deepen understanding of articulation processes. Expanding linguistic and demographic diversity in future research is essential to capturing language heterogeneity.
一些出版物通过语言超声成像技术探讨了第二语言(L2)的发音。本系统综述和专题分析对这些研究进行了整理和评价,重点关注方法、实验设置和结果。本文综述31篇,其中超声生物反馈23篇,L2关节表征8篇。英语是研究的主要语言(82%为L1或L2),参与者主要是年轻人(每个研究2-60名参与者)。23项超声生物反馈研究显示,疗程数量和长度存在显著差异,其中包括16项PICO研究(即研究设计包括受试者、干预、对照组/对照组、结果),其中超声生物反馈与听觉反馈和/或对照条件进行了比较。生物反馈研究的数据分析通常包括声学或感知评估,或者代替超声数据分析。分析结果表明,超声生物反馈对改善L2关节是有效的。然而,PICO研究表明,虽然超声生物反馈可能提供某些优势,但这些发现仍然是初步的,需要进一步的研究。学习者特征和目标选择可能影响生物反馈效果。通过显示发音和辅助发音模式,超声波也被证明对表征L2发音很有价值,特别是在英语发音中,如/ r /, /l/和各种元音。二语表征研究中经常使用动态语音运动分析。展望未来,研究人员也被鼓励在生物反馈研究中使用动态运动分析来加深对发音过程的理解。在未来的研究中扩大语言和人口的多样性是捕捉语言异质性的必要条件。
{"title":"Ultrasound imaging in second language research: Systematic review and thematic analysis","authors":"Eija M.A. Aalto , Hana Ben Asker , Lucie Ménard , Walcir Cardoso , Catherine Laporte","doi":"10.1016/j.specom.2025.103324","DOIUrl":"10.1016/j.specom.2025.103324","url":null,"abstract":"<div><div>Several publications have explored second language (L2) articulation through lingual ultrasound imaging technology. This systematic review and thematic analysis collate and evaluate these studies, focusing on methodologies, experimental setups, and findings. The review includes 31 works: 23 on ultrasound biofeedback and 8 on characterizing L2 articulation. English is the predominant language studied (82 % as L1 or L2), with participants mainly young adults (2–60 participants per study). The 23 ultrasound biofeedback studies showed significant variation in session numbers and length, including 16 PICO studies (i.e. study design with participants, intervention, controls/comparison group, outcome) where ultrasound biofeedback was compared to auditory feedback and/or control conditions. Data analysis of biofeedback studies often included acoustic or perceptual assessments in addition or instead of ultrasound data analysis. Analysis of results indicate that ultrasound biofeedback is effective for improving L2 articulation. However, the PICO studies revealed that while ultrasound biofeedback may offer certain advantages, these findings remain preliminary and warrant further investigation. Learner characteristics and target selection may affect biofeedback efficacy. Ultrasound also proved valuable for characterizing L2 articulation by showing articulatory and coarticulatory patterns, particularly in English sounds like /ɹ/, /l/, and various vowels. L2 characterization studies frequently used dynamic speech movement analysis. Moving forward, researchers are encouraged to use dynamic movement analysis also in biofeedback studies to deepen understanding of articulation processes. Expanding linguistic and demographic diversity in future research is essential to capturing language heterogeneity.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103324"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}