Speech Communication最新文献_第2页

Multi-modal co-learning for silent speech recognition based on ultrasound tongue images 基于超声舌头图像的无声语音识别多模态协同学习

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-12 DOI: 10.1016/j.specom.2024.103140

Minghao Guo , Jianguo Wei , Ruiteng Zhang , Yu Zhao , Qiang Fang

Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.

无声语音识别（SSR）是人机交互中的一项重要任务，旨在从非声学模式中识别语音。无声语音识别面临的一个主要挑战是，由于非声学信号中缺少部分语音信息，因此会产生固有的输入模糊性。这种模糊性会导致同音字--输入相似但发音不同的单词。目前解决这一问题的方法要么是利用更丰富的附加输入，要么是训练额外的模型进行跨模态嵌入补偿。在本文中，我们提出了一种有效的多模态协同学习框架，通过多阶段训练提高无声语音表征的分辨能力。我们首先以超声舌部成像（UTI）为主要模态构建了 SSR 的骨干，然后引入了两种辅助模态：唇部视频和音频信号。利用模态剔除，该模型可从所有可用流中学习共享/特定特征，从而创建一个相同的语义空间，以更好地概括UTI 表征。鉴于跨模态的不平衡优化，我们强调了超参数设置和调制策略对 SSR 实现特定模态协同学习的重要性。实验结果表明，具有单一UTI输入的模态无关模型优于最先进的特定模态模型。基于音素/发音特征的混淆分析证实，共同学习的UTI表征包含区分同音字的宝贵信息。此外，我们的模型在两个未见测试集上表现良好，实现了单模态 SSR 任务的跨模态泛化。

{"title":"Multi-modal co-learning for silent speech recognition based on ultrasound tongue images","authors":"Minghao Guo , Jianguo Wei , Ruiteng Zhang , Yu Zhao , Qiang Fang","doi":"10.1016/j.specom.2024.103140","DOIUrl":"10.1016/j.specom.2024.103140","url":null,"abstract":"<div><p>Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103140"},"PeriodicalIF":2.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion CLESSR-VC：用于单次语音转换的对比学习增强型自监督表示法

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-10 DOI: 10.1016/j.specom.2024.103139

Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu

One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.

单次语音转换（VC）因其广阔的实际应用前景而受到越来越多的关注。在这项任务中，语音特征的表示能力和模型的泛化能力是关注的焦点。本文提出了一种名为 CLESSR-VC 的模型，该模型通过对比学习增强了预训练的自监督学习（SSL）表征，可用于单次 VC。首先，采用预训练 WavLM 第 23 层和第 9 层的 SSL 特征，分别提取内容嵌入和 SSL 说话者嵌入，以确保模型的泛化。然后，引入传统的声学特征 mel-spectrograms 和对比学习来增强语音特征的表示能力。具体来说，对比学习与音高偏移增强方法相结合，可以准确地从 SSL 特征中分离出内容信息。采用梅尔频谱图提取梅尔说话者嵌入。在 SSL 和 mel 说话者嵌入之间应用 AM-Softmax 和跨架构对比学习，以获得融合的说话者嵌入，这有助于提高语音质量和说话者相似度。在 VCTK 语料库上进行的客观和主观评估结果都证实，所提出的 VC 模型具有出色的性能和较少的可训练参数。

{"title":"CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion","authors":"Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu","doi":"10.1016/j.specom.2024.103139","DOIUrl":"10.1016/j.specom.2024.103139","url":null,"abstract":"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103139"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language CSLNSpeech：借助中文手语解决扩展语音分离问题

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-09-02 DOI: 10.1016/j.specom.2024.103131

Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu

Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech

以往的视听语音分离方法是将说话者的面部动作与视频中的语音同步，从而对语音分离进行自我监督。在本文中，我们提出了一个模型来解决由面部和手语共同辅助的语音分离问题，我们称之为扩展语音分离问题。我们设计了一个通用的深度学习网络来学习音频、人脸和手语三种模态信息的组合，从而更好地解决语音分离问题。我们引入了一个名为 "中国手语新闻语音（CSLNSpeech）"的大规模数据集来训练模型，其中音频、人脸和手语三种模态并存。实验结果表明，与普通的视听系统相比，所提出的模型性能更好、更稳健。此外，手语模式也可单独用于监督语音分离任务，引入手语有助于听障人士的学习和交流。最后，我们的模型是一个通用的语音分离框架，可以在两个开源视听数据集上实现极具竞争力的分离性能。代码见 https://github.com/iveveive/SLNSpeech

{"title":"CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language","authors":"Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu","doi":"10.1016/j.specom.2024.103131","DOIUrl":"10.1016/j.specom.2024.103131","url":null,"abstract":"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103131"},"PeriodicalIF":2.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparing neural network architectures for non-intrusive speech quality prediction 比较用于非侵入式语音质量预测的神经网络架构

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-08-30 DOI: 10.1016/j.specom.2024.103123

Leif Førland Schill , Tobias Piechowiak , Clément Laroche , Pejman Mowlaee

Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.

非侵入式语音质量预测器可在不使用参考信号的情况下评估语音质量，因此在许多实际应用中都非常有用。最近，神经网络在这项任务中表现出了最佳性能。文献中的两个此类模型是基于卷积神经网络的 DNSMOS 和基于双向长短期记忆的 Quality-Net，它们最初分别用于预测主观目标和侵入性 PESQ 分数。本文在单一数据集上对这两种架构进行了训练，并将其用于预测侵入性 ViSQOL 分数。评估是在具有各种不匹配条件的测试集上进行的，包括未见过的语音和噪音语料库，以及常见的 IP 语音失真。实验结果表明，这些模型对训练分布具有相似的预测能力，对新的噪音和语音语料具有良好的泛化能力。在这两个模型中，看不见的失真被认为是泛化效果较差的领域，尤其是 DNSMOS。我们的结果还表明，训练集中普遍存在的环境噪声会在泛化到某些类型的噪声时造成问题。最后，我们详细介绍了 ViSQOL 分数如何与参考压力水平和语音活动水平产生不良依赖关系。

{"title":"Comparing neural network architectures for non-intrusive speech quality prediction","authors":"Leif Førland Schill , Tobias Piechowiak , Clément Laroche , Pejman Mowlaee","doi":"10.1016/j.specom.2024.103123","DOIUrl":"10.1016/j.specom.2024.103123","url":null,"abstract":"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103123"},"PeriodicalIF":2.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate synthesis of dysarthric Speech for ASR data augmentation 为 ASR 数据扩增准确合成听力障碍语音

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-08-10 DOI: 10.1016/j.specom.2024.103112

Mohammad Soleymanpour , Michael T. Johnson , Rahim Soleymanpour , Jeffrey Berry

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.

This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.

To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNNHMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/

构音障碍是一种运动性语言障碍，通常表现为语言发音肌肉控制缓慢、不协调，导致语言清晰度降低。自动语音识别（ASR）系统可以帮助构音障碍者更有效地进行交流。然而，针对肢体障碍的强大自动语音识别系统需要大量的训练语音，而肢体障碍者并不容易获得这些语音。本文介绍了一种新的肢体障碍语音合成方法，用于增强自动语音识别系统的训练数据。不同严重程度的发音障碍自发语音在前音和声学特征上的差异是发音障碍语音建模、合成和增强的重要组成部分。在构音障碍语音合成方面，通过添加构音障碍严重程度系数和停顿插入模型，实现了改进的神经多语种 TTS，以合成不同严重程度的构音障碍语音。结果表明，与基线相比，在额外合成的肢体障碍语音上训练的 DNNHMM 模型的相对词错误率（WER）提高了 12.2%，而添加严重程度和停顿插入控制后，词错误率降低了 6.5%，显示了添加这些参数的有效性。TORGO 数据库的总体结果表明，使用障碍合成语音来增加障碍模式语音的训练量，对障碍 ASR 系统有显著影响。此外，我们还进行了一项主观评估，以评价合成语音的障听度和相似度。我们的主观评估结果表明，合成语音的发音障碍感知与真正的发音障碍语音相似，尤其是在构音障碍程度较高的情况下。音频样本见 https://mohammadelc.github.io/SpeechGroupUKY/

{"title":"Accurate synthesis of dysarthric Speech for ASR data augmentation","authors":"Mohammad Soleymanpour , Michael T. Johnson , Rahim Soleymanpour , Jeffrey Berry","doi":"10.1016/j.specom.2024.103112","DOIUrl":"10.1016/j.specom.2024.103112","url":null,"abstract":"<div><p>Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.</p><p>This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.</p><p>To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN<img>HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"164 ","pages":"Article 103112"},"PeriodicalIF":2.4,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CFAD: A Chinese dataset for fake audio detection CFAD：用于假音频检测的中文数据集

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-08-08 DOI: 10.1016/j.specom.2024.103122

Haoxin Ma , Jiangyan Yi , Chenglong Wang , Xinrui Yan , Jianhua Tao , Tao Wang , Shiming Wang , Ruibo Fu

Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available.¹

虚假音频检测日益受到关注，一些相关的数据集已被设计用于研究。然而，目前还没有复杂条件下的标准公开中文数据集。本文旨在填补这一空白，设计一个中文假音频检测数据集（CFAD），用于研究更通用的检测方法。本文使用了 12 种主流语音生成技术来生成假音频。为了模拟真实场景，我们选择了三种噪声数据集在五种不同信噪比下添加噪声，并考虑了六种编解码器用于音频转码（格式转换）。CFAD 数据集不仅可用于假音频检测，还可用于检测音频取证中假语音的算法。基线结果与分析一起呈现。结果表明，假音频检测方法的通用性仍具有挑战性。CFAD 数据集可公开获取1。

引用次数: 0

Comparison and analysis of new curriculum criteria for end-to-end ASR 端到端 ASR 新课程标准的比较与分析

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-31 DOI: 10.1016/j.specom.2024.103113

Georgios Karakasidis , Mikko Kurimo , Peter Bell , Tamás Grósz

Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.

传统上，教授人类和教授机器学习（ML）模型是完全不同的，但有组织、有条理的学习能够让人更快、更好地理解基本概念。例如，当人类学习说话时，他们首先学习如何说出基本的电话，然后慢慢转向更复杂的结构，如单词和句子。受此启发，研究人员开始采用这种方法来训练 ML 模型。由于这种方法的主要概念--难度逐渐增加--与教育中的课程概念相似，因此被称为课程学习（CL）。在这项工作中，我们设计并测试了用于训练自动语音识别系统的新的 CL 方法，尤其侧重于所谓的端到端模型。这些模型由执行识别任务的单个大型神经网络组成，而传统的方法是由几个专门的组件负责不同的子任务（如声学和语言建模）。我们证明，如果为端到端模型提供由难度不断增加的示例组成的有组织训练集，它们就能获得更好的性能。为了对训练集进行结构化处理并定义简单示例的概念，我们探索了多种解决方案，既可以使用外部静态评分方法，也可以结合模型本身的反馈。此外，我们还研究了步调函数的效果，该函数可控制在每个训练周期内向网络提供多少数据。我们提出的课程学习策略在两个数据集的语音识别任务中进行了测试，一个数据集包含自发的芬兰语语音，要求志愿者就给定的主题发言；另一个数据集包含计划好的英语语音。实证结果表明，好的课程学习策略可以提高性能，加快收敛速度。经过一定数量的历时后，我们的最佳策略在芬兰语和英语数据集的测试集单词错误率方面分别降低了 5.6% 和 3.4%。

{"title":"Comparison and analysis of new curriculum criteria for end-to-end ASR","authors":"Georgios Karakasidis , Mikko Kurimo , Peter Bell , Tamás Grósz","doi":"10.1016/j.specom.2024.103113","DOIUrl":"10.1016/j.specom.2024.103113","url":null,"abstract":"<div><p>Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103113"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000840/pdfft?md5=60eaa8c29b9e0afde3f299e6bfeb1d10&pid=1-s2.0-S0167639324000840-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tone-syllable synchrony in Mandarin: New evidence and implications 普通话中的声调-音节同步性：新的证据和影响

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-31 DOI: 10.1016/j.specom.2024.103121

Weiyi Kang, Yi Xu

Recent research has shown evidence based on a minimal contrast paradigm that consonants and vowels are articulatorily synchronized at the onset of the syllable. What remains less clear is the laryngeal dimension of the syllable, for which evidence of tone synchrony with the consonant-vowel syllable has been circumstantial. The present study assesses the precise tone-vowel alignment in Mandarin Chinese by applying the minimal contrast paradigm. The vowel onset is determined by detecting divergence points of F2 trajectories between a pair of disyllabic sequences with two contrasting vowels, and the onsets of tones are determined by detecting divergence points of f₀ trajectories in contrasting disyllabic tone pairs, using generalized additive mixed models (GAMMs). The alignment of the divergence-determined vowel and tone onsets is then evaluated with linear mixed effect models (LMEMs) and their synchrony is validated with Bayes factors. The results indicate that tone and vowel onsets are fully synchronized. There is therefore evidence for strict alignment of consonant, vowel and tone as hypothesized in the synchronization model of the syllable. Also, with the newly established tone onset, the previously reported ‘anticipatory raising’ effect of tone now appears to occur within rather than before the articulatory syllable. Implications of these findings will be discussed.

最近的研究表明，基于最小对比范式的证据表明，辅音和元音在音节开始时是发音同步的。但不太清楚的是音节的喉音维度，声调与辅音-元音音节同步的证据一直是间接的。本研究采用最小对比范式来评估汉语普通话中声调与元音的精确一致。元音的起始点是通过检测一对具有两个对比元音的双音节序列之间 F2 轨迹的发散点来确定的，而声调的起始点则是通过检测对比双音节声调对中轨迹的发散点来确定的。然后，利用线性混合效应模型（LMEMs）评估发散确定的元音和声调起始点的一致性，并利用贝叶斯因子验证它们的同步性。结果表明，声调和元音的起音是完全同步的。因此，有证据表明声母、韵母和声调是严格一致的，正如音节同步模型所假设的那样。此外，在新建立的声调起始点上，以前报告的声调 "预期提高 "效应现在似乎发生了，而不是发音音节。我们将讨论这些发现的意义。

{"title":"Tone-syllable synchrony in Mandarin: New evidence and implications","authors":"Weiyi Kang, Yi Xu","doi":"10.1016/j.specom.2024.103121","DOIUrl":"10.1016/j.specom.2024.103121","url":null,"abstract":"<div><p>Recent research has shown evidence based on a minimal contrast paradigm that consonants and vowels are articulatorily synchronized at the onset of the syllable. What remains less clear is the laryngeal dimension of the syllable, for which evidence of tone synchrony with the consonant-vowel syllable has been circumstantial. The present study assesses the precise tone-vowel alignment in Mandarin Chinese by applying the minimal contrast paradigm. The vowel onset is determined by detecting divergence points of F2 trajectories between a pair of disyllabic sequences with two contrasting vowels, and the onsets of tones are determined by detecting divergence points of <em>f</em><sub>0</sub> trajectories in contrasting disyllabic tone pairs, using generalized additive mixed models (GAMMs). The alignment of the divergence-determined vowel and tone onsets is then evaluated with linear mixed effect models (LMEMs) and their synchrony is validated with Bayes factors. The results indicate that tone and vowel onsets are fully synchronized. There is therefore evidence for strict alignment of consonant, vowel and tone as hypothesized in the synchronization model of the syllable. Also, with the newly established tone onset, the previously reported ‘anticipatory raising’ effect of tone now appears to occur <em>within</em> rather than <em>before</em> the articulatory syllable. Implications of these findings will be discussed.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103121"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016763932400092X/pdfft?md5=d240d5edd58b402ead4372ec1ec2baa9&pid=1-s2.0-S016763932400092X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Arabic Automatic Speech Recognition: Challenges and Progress 阿拉伯语自动语音识别：挑战与进步

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-31 DOI: 10.1016/j.specom.2024.103110

Fatma Zahra Besdouri , Inès Zribi , Lamia Hadrich Belguith

This paper provides a structured examination of Arabic Automatic Speech Recognition (ASR), focusing on the complexity posed by the language’s diverse forms and dialectal variations. We first explore the Arabic language forms, delimiting the challenges encountered with Dialectal Arabic, including issues such as code-switching and non-standardized orthography and, thus, the scarcity of large annotated datasets. Subsequently, we delve into the landscape of Arabic resources, distinguishing between Modern Standard Arabic (MSA) and Dialectal Arabic (DA) Speech Resources and highlighting the disparities in available data between these two categories. Finally, we analyze both traditional and modern approaches in Arabic ASR, assessing their effectiveness in addressing the unique challenges inherent to the language. Through this comprehensive examination, we aim to provide insights into the current state and future directions of Arabic ASR research and development.

本文对阿拉伯语自动语音识别（ASR）进行了结构化研究，重点关注该语言的多种形式和方言变化所带来的复杂性。我们首先探讨了阿拉伯语的语言形式，划分了方言阿拉伯语所遇到的挑战，包括代码转换和非标准化正字法等问题，以及大型注释数据集的稀缺性。随后，我们深入探讨了阿拉伯语资源的现状，区分了现代标准阿拉伯语 (MSA) 和方言阿拉伯语 (DA) 语音资源，并强调了这两个类别之间可用数据的差异。最后，我们分析了阿拉伯语 ASR 的传统和现代方法，评估了它们在应对阿拉伯语固有的独特挑战方面的有效性。通过这种全面的研究，我们旨在为阿拉伯语 ASR 研究和发展的现状和未来方向提供见解。

引用次数: 0

Yanbian Korean speakers tend to merge /e/ and /ɛ/ when exposed to Seoul Korean 讲延边朝鲜语的人在接触首尔朝鲜语时，往往会把 /e/ 和 /ɛ/ 混为一谈。

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-30 DOI: 10.1016/j.specom.2024.103111

Xiaohua Yu , Sunghye Cho , Yong-cheol Lee

This study examined the vowel merger between the two vowels /e/ and /ɛ/ in Yanbian Korean. This sound change has already spread to Seoul Korean, particularly among speakers born after the 1970s. The aim of this study was to determine whether close exposure to Seoul Korean speakers leads to the neutralization of the distinction between the two vowels /e/ and /ɛ/. We recruited 20 Yanbian Korean speakers and asked them about their frequency of exposure to Seoul Korean. The exposure level of each participant was also recorded using a Likert scale. The results revealed that speakers with limited in-person interactions with Seoul Korean speakers exhibited distinct vowel productions within the vowel space. In contrast, those with frequent in-person interactions with Seoul Korean speakers tended to neutralize the two vowels, displaying considerably overlapping patterns in the vowel space. The relationship between the level of exposure to Seoul Korean and speakers’ vowel production was statistically confirmed by a linear regression analysis. Based on the results of this study, we speculate that the sound change in Yanbian Korean may become more widespread as Yanbian Korean speakers are increasingly exposed to Seoul Korean.

本研究考察了延边朝鲜语中/e/和/ɛ/两个元音之间的元音合并。这种音变已经传播到首尔朝鲜语中，尤其是在 20 世纪 70 年代后出生的朝鲜语使用者中。本研究旨在确定与首尔韩语使用者的密切接触是否会导致/e/和/ɛ/这两个元音之间的中和。我们招募了 20 名讲延边朝鲜语的人，询问他们接触首尔朝鲜语的频率。我们还使用李克特量表记录了每位受试者的接触水平。结果表明，与首尔朝鲜语者接触次数有限的人在元音空间内表现出不同的元音发音。与此相反，与首尔韩语使用者频繁接触的受试者则倾向于中和这两个元音，在元音空间中表现出相当程度的重叠模式。通过线性回归分析，统计证实了接触首尔韩语的程度与说话者元音发音之间的关系。根据本研究的结果，我们推测随着延边朝鲜语使用者越来越多地接触首尔朝鲜语，延边朝鲜语的音变可能会变得更加普遍。

{"title":"Yanbian Korean speakers tend to merge /e/ and /ɛ/ when exposed to Seoul Korean","authors":"Xiaohua Yu , Sunghye Cho , Yong-cheol Lee","doi":"10.1016/j.specom.2024.103111","DOIUrl":"10.1016/j.specom.2024.103111","url":null,"abstract":"<div><p>This study examined the vowel merger between the two vowels /e/ and /ɛ/ in Yanbian Korean. This sound change has already spread to Seoul Korean, particularly among speakers born after the 1970s. The aim of this study was to determine whether close exposure to Seoul Korean speakers leads to the neutralization of the distinction between the two vowels /e/ and /ɛ/. We recruited 20 Yanbian Korean speakers and asked them about their frequency of exposure to Seoul Korean. The exposure level of each participant was also recorded using a Likert scale. The results revealed that speakers with limited in-person interactions with Seoul Korean speakers exhibited distinct vowel productions within the vowel space. In contrast, those with frequent in-person interactions with Seoul Korean speakers tended to neutralize the two vowels, displaying considerably overlapping patterns in the vowel space. The relationship between the level of exposure to Seoul Korean and speakers’ vowel production was statistically confirmed by a linear regression analysis. Based on the results of this study, we speculate that the sound change in Yanbian Korean may become more widespread as Yanbian Korean speakers are increasingly exposed to Seoul Korean.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"164 ","pages":"Article 103111"},"PeriodicalIF":2.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0