Pub Date : 2024-09-10DOI: 10.1016/j.specom.2024.103139
Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu
One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.
{"title":"CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion","authors":"Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu","doi":"10.1016/j.specom.2024.103139","DOIUrl":"10.1016/j.specom.2024.103139","url":null,"abstract":"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103139"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1016/j.specom.2024.103131
Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu
Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech
{"title":"CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language","authors":"Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu","doi":"10.1016/j.specom.2024.103131","DOIUrl":"10.1016/j.specom.2024.103131","url":null,"abstract":"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103131"},"PeriodicalIF":2.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.
{"title":"Comparing neural network architectures for non-intrusive speech quality prediction","authors":"Leif Førland Schill , Tobias Piechowiak , Clément Laroche , Pejman Mowlaee","doi":"10.1016/j.specom.2024.103123","DOIUrl":"10.1016/j.specom.2024.103123","url":null,"abstract":"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103123"},"PeriodicalIF":2.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.specom.2024.103112
Mohammad Soleymanpour , Michael T. Johnson , Rahim Soleymanpour , Jeffrey Berry
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.
This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.
To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNNHMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/
构音障碍是一种运动性语言障碍,通常表现为语言发音肌肉控制缓慢、不协调,导致语言清晰度降低。自动语音识别(ASR)系统可以帮助构音障碍者更有效地进行交流。然而,针对肢体障碍的强大自动语音识别系统需要大量的训练语音,而肢体障碍者并不容易获得这些语音。本文介绍了一种新的肢体障碍语音合成方法,用于增强自动语音识别系统的训练数据。不同严重程度的发音障碍自发语音在前音和声学特征上的差异是发音障碍语音建模、合成和增强的重要组成部分。在构音障碍语音合成方面,通过添加构音障碍严重程度系数和停顿插入模型,实现了改进的神经多语种 TTS,以合成不同严重程度的构音障碍语音。结果表明,与基线相比,在额外合成的肢体障碍语音上训练的 DNNHMM 模型的相对词错误率(WER)提高了 12.2%,而添加严重程度和停顿插入控制后,词错误率降低了 6.5%,显示了添加这些参数的有效性。TORGO 数据库的总体结果表明,使用障碍合成语音来增加障碍模式语音的训练量,对障碍 ASR 系统有显著影响。此外,我们还进行了一项主观评估,以评价合成语音的障听度和相似度。我们的主观评估结果表明,合成语音的发音障碍感知与真正的发音障碍语音相似,尤其是在构音障碍程度较高的情况下。音频样本见 https://mohammadelc.github.io/SpeechGroupUKY/
{"title":"Accurate synthesis of dysarthric Speech for ASR data augmentation","authors":"Mohammad Soleymanpour , Michael T. Johnson , Rahim Soleymanpour , Jeffrey Berry","doi":"10.1016/j.specom.2024.103112","DOIUrl":"10.1016/j.specom.2024.103112","url":null,"abstract":"<div><p>Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.</p><p>This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.</p><p>To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN<img>HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"164 ","pages":"Article 103112"},"PeriodicalIF":2.4,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.specom.2024.103122
Haoxin Ma , Jiangyan Yi , Chenglong Wang , Xinrui Yan , Jianhua Tao , Tao Wang , Shiming Wang , Ruibo Fu
Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available.1
{"title":"CFAD: A Chinese dataset for fake audio detection","authors":"Haoxin Ma , Jiangyan Yi , Chenglong Wang , Xinrui Yan , Jianhua Tao , Tao Wang , Shiming Wang , Ruibo Fu","doi":"10.1016/j.specom.2024.103122","DOIUrl":"10.1016/j.specom.2024.103122","url":null,"abstract":"<div><p>Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available.<span><span><sup>1</sup></span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"164 ","pages":"Article 103122"},"PeriodicalIF":2.4,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141991278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1016/j.specom.2024.103113
Georgios Karakasidis , Mikko Kurimo , Peter Bell , Tamás Grósz
Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.
传统上,教授人类和教授机器学习(ML)模型是完全不同的,但有组织、有条理的学习能够让人更快、更好地理解基本概念。例如,当人类学习说话时,他们首先学习如何说出基本的电话,然后慢慢转向更复杂的结构,如单词和句子。受此启发,研究人员开始采用这种方法来训练 ML 模型。由于这种方法的主要概念--难度逐渐增加--与教育中的课程概念相似,因此被称为课程学习(CL)。在这项工作中,我们设计并测试了用于训练自动语音识别系统的新的 CL 方法,尤其侧重于所谓的端到端模型。这些模型由执行识别任务的单个大型神经网络组成,而传统的方法是由几个专门的组件负责不同的子任务(如声学和语言建模)。我们证明,如果为端到端模型提供由难度不断增加的示例组成的有组织训练集,它们就能获得更好的性能。为了对训练集进行结构化处理并定义简单示例的概念,我们探索了多种解决方案,既可以使用外部静态评分方法,也可以结合模型本身的反馈。此外,我们还研究了步调函数的效果,该函数可控制在每个训练周期内向网络提供多少数据。我们提出的课程学习策略在两个数据集的语音识别任务中进行了测试,一个数据集包含自发的芬兰语语音,要求志愿者就给定的主题发言;另一个数据集包含计划好的英语语音。实证结果表明,好的课程学习策略可以提高性能,加快收敛速度。经过一定数量的历时后,我们的最佳策略在芬兰语和英语数据集的测试集单词错误率方面分别降低了 5.6% 和 3.4%。
{"title":"Comparison and analysis of new curriculum criteria for end-to-end ASR","authors":"Georgios Karakasidis , Mikko Kurimo , Peter Bell , Tamás Grósz","doi":"10.1016/j.specom.2024.103113","DOIUrl":"10.1016/j.specom.2024.103113","url":null,"abstract":"<div><p>Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103113"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000840/pdfft?md5=60eaa8c29b9e0afde3f299e6bfeb1d10&pid=1-s2.0-S0167639324000840-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1016/j.specom.2024.103121
Weiyi Kang, Yi Xu
Recent research has shown evidence based on a minimal contrast paradigm that consonants and vowels are articulatorily synchronized at the onset of the syllable. What remains less clear is the laryngeal dimension of the syllable, for which evidence of tone synchrony with the consonant-vowel syllable has been circumstantial. The present study assesses the precise tone-vowel alignment in Mandarin Chinese by applying the minimal contrast paradigm. The vowel onset is determined by detecting divergence points of F2 trajectories between a pair of disyllabic sequences with two contrasting vowels, and the onsets of tones are determined by detecting divergence points of f0 trajectories in contrasting disyllabic tone pairs, using generalized additive mixed models (GAMMs). The alignment of the divergence-determined vowel and tone onsets is then evaluated with linear mixed effect models (LMEMs) and their synchrony is validated with Bayes factors. The results indicate that tone and vowel onsets are fully synchronized. There is therefore evidence for strict alignment of consonant, vowel and tone as hypothesized in the synchronization model of the syllable. Also, with the newly established tone onset, the previously reported ‘anticipatory raising’ effect of tone now appears to occur within rather than before the articulatory syllable. Implications of these findings will be discussed.
{"title":"Tone-syllable synchrony in Mandarin: New evidence and implications","authors":"Weiyi Kang, Yi Xu","doi":"10.1016/j.specom.2024.103121","DOIUrl":"10.1016/j.specom.2024.103121","url":null,"abstract":"<div><p>Recent research has shown evidence based on a minimal contrast paradigm that consonants and vowels are articulatorily synchronized at the onset of the syllable. What remains less clear is the laryngeal dimension of the syllable, for which evidence of tone synchrony with the consonant-vowel syllable has been circumstantial. The present study assesses the precise tone-vowel alignment in Mandarin Chinese by applying the minimal contrast paradigm. The vowel onset is determined by detecting divergence points of F2 trajectories between a pair of disyllabic sequences with two contrasting vowels, and the onsets of tones are determined by detecting divergence points of <em>f</em><sub>0</sub> trajectories in contrasting disyllabic tone pairs, using generalized additive mixed models (GAMMs). The alignment of the divergence-determined vowel and tone onsets is then evaluated with linear mixed effect models (LMEMs) and their synchrony is validated with Bayes factors. The results indicate that tone and vowel onsets are fully synchronized. There is therefore evidence for strict alignment of consonant, vowel and tone as hypothesized in the synchronization model of the syllable. Also, with the newly established tone onset, the previously reported ‘anticipatory raising’ effect of tone now appears to occur <em>within</em> rather than <em>before</em> the articulatory syllable. Implications of these findings will be discussed.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103121"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016763932400092X/pdfft?md5=d240d5edd58b402ead4372ec1ec2baa9&pid=1-s2.0-S016763932400092X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper provides a structured examination of Arabic Automatic Speech Recognition (ASR), focusing on the complexity posed by the language’s diverse forms and dialectal variations. We first explore the Arabic language forms, delimiting the challenges encountered with Dialectal Arabic, including issues such as code-switching and non-standardized orthography and, thus, the scarcity of large annotated datasets. Subsequently, we delve into the landscape of Arabic resources, distinguishing between Modern Standard Arabic (MSA) and Dialectal Arabic (DA) Speech Resources and highlighting the disparities in available data between these two categories. Finally, we analyze both traditional and modern approaches in Arabic ASR, assessing their effectiveness in addressing the unique challenges inherent to the language. Through this comprehensive examination, we aim to provide insights into the current state and future directions of Arabic ASR research and development.
本文对阿拉伯语自动语音识别(ASR)进行了结构化研究,重点关注该语言的多种形式和方言变化所带来的复杂性。我们首先探讨了阿拉伯语的语言形式,划分了方言阿拉伯语所遇到的挑战,包括代码转换和非标准化正字法等问题,以及大型注释数据集的稀缺性。随后,我们深入探讨了阿拉伯语资源的现状,区分了现代标准阿拉伯语 (MSA) 和方言阿拉伯语 (DA) 语音资源,并强调了这两个类别之间可用数据的差异。最后,我们分析了阿拉伯语 ASR 的传统和现代方法,评估了它们在应对阿拉伯语固有的独特挑战方面的有效性。通过这种全面的研究,我们旨在为阿拉伯语 ASR 研究和发展的现状和未来方向提供见解。
{"title":"Arabic Automatic Speech Recognition: Challenges and Progress","authors":"Fatma Zahra Besdouri , Inès Zribi , Lamia Hadrich Belguith","doi":"10.1016/j.specom.2024.103110","DOIUrl":"10.1016/j.specom.2024.103110","url":null,"abstract":"<div><p>This paper provides a structured examination of Arabic Automatic Speech Recognition (ASR), focusing on the complexity posed by the language’s diverse forms and dialectal variations. We first explore the Arabic language forms, delimiting the challenges encountered with Dialectal Arabic, including issues such as code-switching and non-standardized orthography and, thus, the scarcity of large annotated datasets. Subsequently, we delve into the landscape of Arabic resources, distinguishing between Modern Standard Arabic (MSA) and Dialectal Arabic (DA) Speech Resources and highlighting the disparities in available data between these two categories. Finally, we analyze both traditional and modern approaches in Arabic ASR, assessing their effectiveness in addressing the unique challenges inherent to the language. Through this comprehensive examination, we aim to provide insights into the current state and future directions of Arabic ASR research and development.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103110"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1016/j.specom.2024.103111
Xiaohua Yu , Sunghye Cho , Yong-cheol Lee
This study examined the vowel merger between the two vowels /e/ and /ɛ/ in Yanbian Korean. This sound change has already spread to Seoul Korean, particularly among speakers born after the 1970s. The aim of this study was to determine whether close exposure to Seoul Korean speakers leads to the neutralization of the distinction between the two vowels /e/ and /ɛ/. We recruited 20 Yanbian Korean speakers and asked them about their frequency of exposure to Seoul Korean. The exposure level of each participant was also recorded using a Likert scale. The results revealed that speakers with limited in-person interactions with Seoul Korean speakers exhibited distinct vowel productions within the vowel space. In contrast, those with frequent in-person interactions with Seoul Korean speakers tended to neutralize the two vowels, displaying considerably overlapping patterns in the vowel space. The relationship between the level of exposure to Seoul Korean and speakers’ vowel production was statistically confirmed by a linear regression analysis. Based on the results of this study, we speculate that the sound change in Yanbian Korean may become more widespread as Yanbian Korean speakers are increasingly exposed to Seoul Korean.
{"title":"Yanbian Korean speakers tend to merge /e/ and /ɛ/ when exposed to Seoul Korean","authors":"Xiaohua Yu , Sunghye Cho , Yong-cheol Lee","doi":"10.1016/j.specom.2024.103111","DOIUrl":"10.1016/j.specom.2024.103111","url":null,"abstract":"<div><p>This study examined the vowel merger between the two vowels /e/ and /ɛ/ in Yanbian Korean. This sound change has already spread to Seoul Korean, particularly among speakers born after the 1970s. The aim of this study was to determine whether close exposure to Seoul Korean speakers leads to the neutralization of the distinction between the two vowels /e/ and /ɛ/. We recruited 20 Yanbian Korean speakers and asked them about their frequency of exposure to Seoul Korean. The exposure level of each participant was also recorded using a Likert scale. The results revealed that speakers with limited in-person interactions with Seoul Korean speakers exhibited distinct vowel productions within the vowel space. In contrast, those with frequent in-person interactions with Seoul Korean speakers tended to neutralize the two vowels, displaying considerably overlapping patterns in the vowel space. The relationship between the level of exposure to Seoul Korean and speakers’ vowel production was statistically confirmed by a linear regression analysis. Based on the results of this study, we speculate that the sound change in Yanbian Korean may become more widespread as Yanbian Korean speakers are increasingly exposed to Seoul Korean.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"164 ","pages":"Article 103111"},"PeriodicalIF":2.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1016/j.specom.2024.103107
Paola Zanchi , Alessandra Provera , Gaia Silibello , Paola Francesca Ajmone , Elena Altamore , Faustina Lalatta , Maria Antonella Costantino , Paola Giovanna Vizziello , Laura Zampini
Although language delays are common in children with sex chromosome trisomies [SCT], no studies have analysed their prosodic abilities. Considering the importance of prosody in communication, this exploratory study aims to analyse the prosodic features of the narratives of 4-year-old children with SCT.
Participants included 22 children with SCT and 22 typically developing [TD] children. The Narrative Competence Task was administered to elicit the child's narrative. Each utterance was prosodically analysed considering pitch and timing variables.
Considering pitch, the only difference was the number of movements since the utterances of children with SCT were characterised by a lower speech modulation. However, considering the timing variables, children with SCT produced a faster speech rate and a shorter final syllable duration than TD children.
Since both speech modulation and duration measures have important syntactic and pragmatic functions, further investigations should deeply analyse the prosodic skills of children with SCT in interaction with syntax and pragmatics.
{"title":"Prosody in narratives: An exploratory study with children with sex chromosomes trisomies","authors":"Paola Zanchi , Alessandra Provera , Gaia Silibello , Paola Francesca Ajmone , Elena Altamore , Faustina Lalatta , Maria Antonella Costantino , Paola Giovanna Vizziello , Laura Zampini","doi":"10.1016/j.specom.2024.103107","DOIUrl":"10.1016/j.specom.2024.103107","url":null,"abstract":"<div><p>Although language delays are common in children with sex chromosome trisomies [SCT], no studies have analysed their prosodic abilities. Considering the importance of prosody in communication, this exploratory study aims to analyse the prosodic features of the narratives of 4-year-old children with SCT.</p><p>Participants included 22 children with SCT and 22 typically developing [TD] children. The Narrative Competence Task was administered to elicit the child's narrative. Each utterance was prosodically analysed considering pitch and timing variables.</p><p>Considering pitch, the only difference was the number of movements since the utterances of children with SCT were characterised by a lower speech modulation. However, considering the timing variables, children with SCT produced a faster speech rate and a shorter final syllable duration than TD children.</p><p>Since both speech modulation and duration measures have important syntactic and pragmatic functions, further investigations should deeply analyse the prosodic skills of children with SCT in interaction with syntax and pragmatics.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103107"},"PeriodicalIF":2.4,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000797/pdfft?md5=0db7a9636fbd49fbec0c9533ae5f4537&pid=1-s2.0-S0167639324000797-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141846464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}