首页 > 最新文献

Speech Communication最新文献

英文 中文
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments 用于噪声语音实验的语言均衡的丹麦语句子视听记录语料库
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-10-09 DOI: 10.1016/j.specom.2024.103141
A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.
在研发环境中,一个典型的噪声语音实验很容易包含多达 20 个条件,甚至更多,而且通常每个条件至少需要两个测试点。在丹麦语中,还没有一个句子测试包含足够多的句子,可以在不重复的情况下进行如此大量的测试。因此,我们开发了一个新的语料库,以帮助创建一个足够大的句子测试来满足这一需求。语料库本身由 1200 个语言平衡句子的音频和视听录音组成,所有句子均由两名女性和两名男性说话者说出。这些句子是用一种新颖的、基于模板的方法构建的,便于控制词频和句子结构。这些句子从音位分布、自然度和内涵等方面进行了语言学评估,然后进行录音、后处理,并根据其音频、视觉和发音质量进行评分。本文详细介绍了创建和描述该语料库的方法。
{"title":"A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments","authors":"","doi":"10.1016/j.specom.2024.103141","DOIUrl":"10.1016/j.specom.2024.103141","url":null,"abstract":"<div><div>A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Forms, factors and functions of phonetic convergence: Editorial 语音趋同的形式、因素和功能社论
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-30 DOI: 10.1016/j.specom.2024.103142
This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.
这篇为 "语音趋同的形式、因素和功能 "特刊撰写的介绍性文章全面概述了与语音趋同相关的主流理论范式、诱导方法和计算方法,并讨论了影响说话者之间声学调整的既定因素的作用。本论文集中的九篇论文对生产和感知中的调和以及对调和的感知背后的基本机制、因素和功能提出了新的见解。通过整合声学、发音学和知觉对趋同的评估,并结合传统的实验语音分析和计算建模,这九篇论文(1)强调了认知适应性和语音变异性作为趋同触发因素的作用;(2)揭示了趋同感知机制和说话者识别机制之间的基本相似性;(3)阐明了人类和动物发声中适应性之间的进化联系。
{"title":"Forms, factors and functions of phonetic convergence: Editorial","authors":"","doi":"10.1016/j.specom.2024.103142","DOIUrl":"10.1016/j.specom.2024.103142","url":null,"abstract":"<div><div>This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study 利用元音的声学特征估算清醒时上气道横截面积的可行性:试点研究
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-28 DOI: 10.1016/j.specom.2024.103144
Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R2 of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.
上气道尺寸评估在了解阻塞性睡眠呼吸暂停(OSA)的发病机理方面显示出巨大的前景。然而,目前的 OSA 筛查系统无法对上气道进行客观评估。对上气道的评估可以通过核磁共振成像或 CT 扫描来准确进行,但这些方法成本高昂且不易获得。声学咽喉测量法或超声波检查法可能是成本较低的技术,但这些技术需要训练有素的人员,因此不易获得,尤其是在诊所环境或手术前评估上气道时。在本研究中,我们旨在研究元音发音在评估正常呼吸时上气道尺寸方面的实用性。为此,我们通过声学咽喉测量法测量了上气道横截面积(UA-XSA),然后要求参与者在 3 秒钟内发出 5 个元音,并用麦克风记录下来。我们从所有元音中提取了 710 个声学特征,将这些特征与 UA-XSA 进行了比较,并建立了回归模型来估计 UA-XSA。结果表明,梅尔频率倒谱系数(MFCC)是元音最主要的特征,因为在主要特征集中,9 个特征中有 7 个来自 MFCC。多元回归分析表明,声学特征与人体测量特征的组合在估算 UA-XSA 时的 R2 值为 0.80。元音声学分析的重要优势在于其简单易行,可在可穿戴设备或移动应用中轻松实现。这种基于声学的技术可以在重症监护室等不同的临床环境中使用,也可以在偏远地区使用。因此,这些结果可用于开发用户友好型应用程序,利用声学特征和人口信息来估计 UA-XSA。
{"title":"Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study","authors":"","doi":"10.1016/j.specom.2024.103144","DOIUrl":"10.1016/j.specom.2024.103144","url":null,"abstract":"<div><div>Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R<sup>2</sup> of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zero-shot voice conversion based on feature disentanglement 基于特征分解的零镜头语音转换
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-27 DOI: 10.1016/j.specom.2024.103143
Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.
语音转换(VC)的目的是在不修改语言内容的情况下,将源说话者的语音转换为目标说话者的语音。零镜头语音转换在语音转换任务中备受关注,因为它可以实现对训练阶段未出现的说话者的转换。尽管之前的零镜头语音转换方法取得了重大进展,但在分离说话人信息和内容信息方面仍有改进空间。在本文中,我们提出了一种基于特征分离的零镜头 VC 方法。所提模型使用扬声器编码器提取扬声器嵌入,引入混合扬声器层归一化以消除内容编码中的残余扬声器信息,并采用自适应注意力权重归一化进行转换。此外,还引入了动态卷积,以改进语音内容建模,同时只需少量参数。实验证明,拟议模型的性能优于几种最先进的模型,既能实现与目标说话人的高度相似,又能实现可懂度。此外,我们模型的解码速度也远高于现有的先进模型。
{"title":"Zero-shot voice conversion based on feature disentanglement","authors":"","doi":"10.1016/j.specom.2024.103143","DOIUrl":"10.1016/j.specom.2024.103143","url":null,"abstract":"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal co-learning for silent speech recognition based on ultrasound tongue images 基于超声舌头图像的无声语音识别多模态协同学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-12 DOI: 10.1016/j.specom.2024.103140

Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.

无声语音识别(SSR)是人机交互中的一项重要任务,旨在从非声学模式中识别语音。无声语音识别面临的一个主要挑战是,由于非声学信号中缺少部分语音信息,因此会产生固有的输入模糊性。这种模糊性会导致同音字--输入相似但发音不同的单词。目前解决这一问题的方法要么是利用更丰富的附加输入,要么是训练额外的模型进行跨模态嵌入补偿。在本文中,我们提出了一种有效的多模态协同学习框架,通过多阶段训练提高无声语音表征的分辨能力。我们首先以超声舌部成像(UTI)为主要模态构建了 SSR 的骨干,然后引入了两种辅助模态:唇部视频和音频信号。利用模态剔除,该模型可从所有可用流中学习共享/特定特征,从而创建一个相同的语义空间,以更好地概括UTI 表征。鉴于跨模态的不平衡优化,我们强调了超参数设置和调制策略对 SSR 实现特定模态协同学习的重要性。实验结果表明,具有单一UTI输入的模态无关模型优于最先进的特定模态模型。基于音素/发音特征的混淆分析证实,共同学习的UTI表征包含区分同音字的宝贵信息。此外,我们的模型在两个未见测试集上表现良好,实现了单模态 SSR 任务的跨模态泛化。
{"title":"Multi-modal co-learning for silent speech recognition based on ultrasound tongue images","authors":"","doi":"10.1016/j.specom.2024.103140","DOIUrl":"10.1016/j.specom.2024.103140","url":null,"abstract":"<div><p>Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion CLESSR-VC:用于单次语音转换的对比学习增强型自监督表示法
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-10 DOI: 10.1016/j.specom.2024.103139

One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.

单次语音转换(VC)因其广阔的实际应用前景而受到越来越多的关注。在这项任务中,语音特征的表示能力和模型的泛化能力是关注的焦点。本文提出了一种名为 CLESSR-VC 的模型,该模型通过对比学习增强了预训练的自监督学习(SSL)表征,可用于单次 VC。首先,采用预训练 WavLM 第 23 层和第 9 层的 SSL 特征,分别提取内容嵌入和 SSL 说话者嵌入,以确保模型的泛化。然后,引入传统的声学特征 mel-spectrograms 和对比学习来增强语音特征的表示能力。具体来说,对比学习与音高偏移增强方法相结合,可以准确地从 SSL 特征中分离出内容信息。采用梅尔频谱图提取梅尔说话者嵌入。在 SSL 和 mel 说话者嵌入之间应用 AM-Softmax 和跨架构对比学习,以获得融合的说话者嵌入,这有助于提高语音质量和说话者相似度。在 VCTK 语料库上进行的客观和主观评估结果都证实,所提出的 VC 模型具有出色的性能和较少的可训练参数。
{"title":"CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion","authors":"","doi":"10.1016/j.specom.2024.103139","DOIUrl":"10.1016/j.specom.2024.103139","url":null,"abstract":"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language CSLNSpeech:借助中文手语解决扩展语音分离问题
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-09-02 DOI: 10.1016/j.specom.2024.103131

Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech

以往的视听语音分离方法是将说话者的面部动作与视频中的语音同步,从而对语音分离进行自我监督。在本文中,我们提出了一个模型来解决由面部和手语共同辅助的语音分离问题,我们称之为扩展语音分离问题。我们设计了一个通用的深度学习网络来学习音频、人脸和手语三种模态信息的组合,从而更好地解决语音分离问题。我们引入了一个名为 "中国手语新闻语音(CSLNSpeech)"的大规模数据集来训练模型,其中音频、人脸和手语三种模态并存。实验结果表明,与普通的视听系统相比,所提出的模型性能更好、更稳健。此外,手语模式也可单独用于监督语音分离任务,引入手语有助于听障人士的学习和交流。最后,我们的模型是一个通用的语音分离框架,可以在两个开源视听数据集上实现极具竞争力的分离性能。代码见 https://github.com/iveveive/SLNSpeech
{"title":"CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language","authors":"","doi":"10.1016/j.specom.2024.103131","DOIUrl":"10.1016/j.specom.2024.103131","url":null,"abstract":"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing neural network architectures for non-intrusive speech quality prediction 比较用于非侵入式语音质量预测的神经网络架构
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-08-30 DOI: 10.1016/j.specom.2024.103123

Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.

非侵入式语音质量预测器可在不使用参考信号的情况下评估语音质量,因此在许多实际应用中都非常有用。最近,神经网络在这项任务中表现出了最佳性能。文献中的两个此类模型是基于卷积神经网络的 DNSMOS 和基于双向长短期记忆的 Quality-Net,它们最初分别用于预测主观目标和侵入性 PESQ 分数。本文在单一数据集上对这两种架构进行了训练,并将其用于预测侵入性 ViSQOL 分数。评估是在具有各种不匹配条件的测试集上进行的,包括未见过的语音和噪音语料库,以及常见的 IP 语音失真。实验结果表明,这些模型对训练分布具有相似的预测能力,对新的噪音和语音语料具有良好的泛化能力。在这两个模型中,看不见的失真被认为是泛化效果较差的领域,尤其是 DNSMOS。我们的结果还表明,训练集中普遍存在的环境噪声会在泛化到某些类型的噪声时造成问题。最后,我们详细介绍了 ViSQOL 分数如何与参考压力水平和语音活动水平产生不良依赖关系。
{"title":"Comparing neural network architectures for non-intrusive speech quality prediction","authors":"","doi":"10.1016/j.specom.2024.103123","DOIUrl":"10.1016/j.specom.2024.103123","url":null,"abstract":"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate synthesis of dysarthric Speech for ASR data augmentation 为 ASR 数据扩增准确合成听力障碍语音
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-08-10 DOI: 10.1016/j.specom.2024.103112

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.

This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.

To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNNHMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/

构音障碍是一种运动性语言障碍,通常表现为语言发音肌肉控制缓慢、不协调,导致语言清晰度降低。自动语音识别(ASR)系统可以帮助构音障碍者更有效地进行交流。然而,针对肢体障碍的强大自动语音识别系统需要大量的训练语音,而肢体障碍者并不容易获得这些语音。本文介绍了一种新的肢体障碍语音合成方法,用于增强自动语音识别系统的训练数据。不同严重程度的发音障碍自发语音在前音和声学特征上的差异是发音障碍语音建模、合成和增强的重要组成部分。在构音障碍语音合成方面,通过添加构音障碍严重程度系数和停顿插入模型,实现了改进的神经多语种 TTS,以合成不同严重程度的构音障碍语音。结果表明,与基线相比,在额外合成的肢体障碍语音上训练的 DNNHMM 模型的相对词错误率(WER)提高了 12.2%,而添加严重程度和停顿插入控制后,词错误率降低了 6.5%,显示了添加这些参数的有效性。TORGO 数据库的总体结果表明,使用障碍合成语音来增加障碍模式语音的训练量,对障碍 ASR 系统有显著影响。此外,我们还进行了一项主观评估,以评价合成语音的障听度和相似度。我们的主观评估结果表明,合成语音的发音障碍感知与真正的发音障碍语音相似,尤其是在构音障碍程度较高的情况下。音频样本见 https://mohammadelc.github.io/SpeechGroupUKY/
{"title":"Accurate synthesis of dysarthric Speech for ASR data augmentation","authors":"","doi":"10.1016/j.specom.2024.103112","DOIUrl":"10.1016/j.specom.2024.103112","url":null,"abstract":"<div><p>Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.</p><p>This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.</p><p>To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN<img>HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CFAD: A Chinese dataset for fake audio detection CFAD:用于假音频检测的中文数据集
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-08-08 DOI: 10.1016/j.specom.2024.103122

Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available.1

虚假音频检测日益受到关注,一些相关的数据集已被设计用于研究。然而,目前还没有复杂条件下的标准公开中文数据集。本文旨在填补这一空白,设计一个中文假音频检测数据集(CFAD),用于研究更通用的检测方法。本文使用了 12 种主流语音生成技术来生成假音频。为了模拟真实场景,我们选择了三种噪声数据集在五种不同信噪比下添加噪声,并考虑了六种编解码器用于音频转码(格式转换)。CFAD 数据集不仅可用于假音频检测,还可用于检测音频取证中假语音的算法。基线结果与分析一起呈现。结果表明,假音频检测方法的通用性仍具有挑战性。CFAD 数据集可公开获取1。
{"title":"CFAD: A Chinese dataset for fake audio detection","authors":"","doi":"10.1016/j.specom.2024.103122","DOIUrl":"10.1016/j.specom.2024.103122","url":null,"abstract":"<div><p>Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available.<span><span><sup>1</sup></span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141991278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1