首页 > 最新文献

Speech Communication最新文献

英文 中文
Pre-trained models for detection and severity level classification of dysarthria from speech 用于从语音中检测构音障碍并对其严重程度进行分类的预训练模型
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103047
Farhad Javanmardi, Sudarsana Reddy Kadiri, P. Alku
{"title":"Pre-trained models for detection and severity level classification of dysarthria from speech","authors":"Farhad Javanmardi, Sudarsana Reddy Kadiri, P. Alku","doi":"10.1016/j.specom.2024.103047","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103047","url":null,"abstract":"","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139817187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs 单音声道形状足以发音合成德语初级双元音
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103041
Simon Stone, Peter Birkholz

German primary diphthongs are conventionally transcribed using the same symbols used for some monophthong vowels. However, if the corresponding vocal tract shapes are used for articulatory synthesis, the results often sound unnatural. Furthermore, there is no clear consensus in the literature if diphthongs have monopthong constituents and if so, which ones. This study therefore analyzed a set of audio recordings from the reference speaker of the state-of-the-art articulatory synthesizer VocalTractLab to identify likely candidates for the monophthong constituents of the German primary diphthongs. We then evaluated these candidates in a listening experiment with naive listeners to determine a naturalness ranking of these candidates and specialized diphthong shapes. The results showed that the German primary diphthongs can indeed be synthesized with no significant loss in naturalness by replacing the specialized diphthong shapes for the initial and final segments by shapes also used for monopthong vowels.

德语的主要双元音通常使用与某些单元音相同的符号进行转写。然而,如果使用相应的声道形状进行发音合成,结果往往听起来不自然。此外,双元音是否有单音成分,如果有,是哪些成分,文献中还没有明确的共识。因此,本研究分析了一组来自最先进的发音合成器 VocalTractLab 的参考发言人的录音,以确定德语主要双元音的单音成分的可能候选。然后,我们在天真听者的听力实验中对这些候选成分进行了评估,以确定这些候选成分和专门双元音形状的自然度排名。结果表明,德语初级双元音确实可以通过用同样用于单元音的形状来代替首段和尾段的专用双元音形状来合成,而且自然度不会有明显的下降。
{"title":"Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs","authors":"Simon Stone,&nbsp;Peter Birkholz","doi":"10.1016/j.specom.2024.103041","DOIUrl":"10.1016/j.specom.2024.103041","url":null,"abstract":"<div><p><span>German primary diphthongs are conventionally transcribed using the same symbols used for some monophthong vowels. However, if the corresponding vocal tract shapes are used for articulatory synthesis, the results often sound unnatural. Furthermore, there is no clear consensus in the literature if diphthongs have monopthong constituents and if so, which ones. This study therefore analyzed a set of audio recordings from the reference speaker of the state-of-the-art articulatory synthesizer VocalTractLab to identify likely candidates for the monophthong constituents of the German primary diphthongs. We then evaluated these candidates in a listening experiment with naive listeners to determine a </span>naturalness ranking of these candidates and specialized diphthong shapes. The results showed that the German primary diphthongs can indeed be synthesized with no significant loss in naturalness by replacing the specialized diphthong shapes for the initial and final segments by shapes also used for monopthong vowels.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139589966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy 非英语母语者的语音和拟声特征对自动语音识别准确率的影响
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-01-13 DOI: 10.1016/j.specom.2024.103038
Ingy Farouk Emara , Nabil Hamdy Shaker

The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the students’ English speech to find out whether the errors investigated resulted in intelligibility breakdowns in an ASR setting. In terms of the phonological features of non-native speech, the results showed that the teachers gave more weight to pronunciation features of accented speech that did not actually hinder recognition, that the students were mostly oblivious to the L2 errors they made and their impact on intelligibility, and that L2 errors which were not perceived as serious by both teachers and students had negative impacts on ASR accuracy levels. In regard to the prosodic features of non-native speech, it was found that lower speech rates resulted in more accurate speech recognition levels, higher speech intensity led to less deletion errors, and voice pitch did not seem to have any impact on ASR accuracy levels. The study, accordingly, recommends training ASR systems with more non-native data to increase their accuracy levels as well as paying more attention to remedying non-native speakers’ L1-based errors that are more likely to impact non-native automatic speech recognition.

本研究探讨了阿拉伯语者的语音和前音特征对非母语英语语音自动语音识别(ASR)准确性的影响。作者首先调查了 30 名埃及 ESL 教师和 70 名埃及大学生对影响可懂度的基于 L1(阿拉伯语)的错误的看法,然后对学生的英语语音自动识别进行了数据分析,以了解所调查的错误是否会在自动语音识别环境中导致可懂度下降。在非母语语音的语音特征方面,研究结果表明,教师更重视实际上并不妨碍识别的重音语音的发音特征;学生大多忽视他们所犯的 L2 错误及其对可懂度的影响;教师和学生都认为不严重的 L2 错误对 ASR 准确度水平有负面影响。关于非母语语音的前音特征,研究发现,较低的语速会导致更高的语音识别准确率,较高的语音强度会导致较少的删除错误,而声调似乎对 ASR 的准确率水平没有任何影响。因此,该研究建议使用更多的非母语数据对自动语音识别系统进行培训,以提高其准确度,同时更加关注纠正非母语人士基于 L1 的错误,因为这些错误更有可能影响非母语自动语音识别。
{"title":"The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy","authors":"Ingy Farouk Emara ,&nbsp;Nabil Hamdy Shaker","doi":"10.1016/j.specom.2024.103038","DOIUrl":"10.1016/j.specom.2024.103038","url":null,"abstract":"<div><p>The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the students’ English speech to find out whether the errors investigated resulted in intelligibility breakdowns in an ASR setting. In terms of the phonological features of non-native speech, the results showed that the teachers gave more weight to pronunciation features of accented speech that did not actually hinder recognition, that the students were mostly oblivious to the L2 errors they made and their impact on intelligibility, and that L2 errors which were not perceived as serious by both teachers and students had negative impacts on ASR accuracy levels. In regard to the prosodic features of non-native speech, it was found that lower speech rates resulted in more accurate speech recognition levels, higher speech intensity led to less deletion errors, and voice pitch did not seem to have any impact on ASR accuracy levels. The study, accordingly, recommends training ASR systems with more non-native data to increase their accuracy levels as well as paying more attention to remedying non-native speakers’ L1-based errors that are more likely to impact non-native automatic speech recognition.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep temporal clustering features for speech emotion recognition 用于语音情感识别的深度时空聚类特征
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-01-02 DOI: 10.1016/j.specom.2023.103027
Wei-Cheng Lin, Carlos Busso

Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk-based DeepEmoCluster framework for speech emotion recognition (SER) to adopt the concept of deep clustering as a novel semi-supervised learning (SSL) framework, which achieved improved recognition performances over conventional reconstruction-based approaches. However, the vanilla DeepEmoCluster lacks critical sentence-level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence-level temporal modeling alternatives using either the temporal-net or the triplet loss function, resulting in a novel temporal-enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence-level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal-net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal-enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully-supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well-separated emotional patterns in the generated clusters.

深度聚类是一种流行的无监督特征表示学习技术。我们最近为语音情感识别(SER)提出了基于块的 DeepEmoCluster 框架,将深度聚类的概念作为一种新型的半监督学习(SSL)框架,与传统的基于重构的方法相比,该框架提高了识别性能。然而,虚构的 DeepEmoCluster 缺乏对 SER 任务有用的关键句子级时间信息。本研究以 DeepEmoCluster 框架为基础,创建了一种利用句子中时间信息的强大 SSL 方法。我们提出了两种句子级时态建模方案,分别使用时态网或三元组损失函数,从而形成了一种新颖的时态增强 DeepEmoCluster 框架,以捕捉重要的时态信息。实现这一目标的关键贡献在于所提出的句子级统一采样策略,该策略在聚类过程中保留了数据的原始时序。在时序网选项中使用了一个额外的网络模块(如门控递归单元),以编码跨数据块的时序信息。另外,我们还可以在训练 DeepEmoCluster 框架时使用三重损失函数来施加额外的时间约束,这不会增加模型的复杂性。我们基于 MSP-Podcast 语料库的实验结果表明,在情绪属性唤醒度、支配度和价值度的回归任务中,所提出的时序增强框架明显优于普通 DeepEmoCluster 框架和其他现有的 SSL 方法。这些改进在完全监督学习或 SSL 实现中均可观察到。进一步的分析验证了所提出的时间建模的有效性,显示出:(1)聚类分配具有高度的时间一致性;(2)生成的聚类中的情绪模式具有良好的分离性。
{"title":"Deep temporal clustering features for speech emotion recognition","authors":"Wei-Cheng Lin,&nbsp;Carlos Busso","doi":"10.1016/j.specom.2023.103027","DOIUrl":"10.1016/j.specom.2023.103027","url":null,"abstract":"<div><p>Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk-based DeepEmoCluster framework for <em>speech emotion recognition</em> (SER) to adopt the concept of deep clustering as a novel <em>semi-supervised learning</em> (SSL) framework, which achieved improved recognition performances over conventional reconstruction-based approaches. However, the vanilla DeepEmoCluster lacks critical sentence-level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence-level temporal modeling alternatives using either the <em>temporal-net</em> or the <em>triplet loss</em> function, resulting in a novel temporal-enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence-level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal-net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal-enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully-supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well-separated <em>emotional patterns</em> in the generated clusters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001619/pdfft?md5=8a58455c8fa8b02caee36f8fcfccf479&pid=1-s2.0-S0167639323001619-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild LPIPS-AttnWav2Lip:通用音频驱动唇语同步,用于在野外生成对话头像
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2023-12-24 DOI: 10.1016/j.specom.2023.103028
Zhipeng Chen , Xinheng Wang , Lun Xie , Haijie Yuan , Hang Pan

Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results.

研究人员对音频驱动的 "说话头 "生成技术的兴趣与日俱增。话头生成的主要挑战是实现嘴唇与音频之间的视听一致性,即唇部同步。本文提出了一种通用方法 LPIPS-AttnWav2Lip,用于根据音频重建任何说话者的面部图像。我们使用基于残差 CBAM 的 U-Net 架构来更好地编码和融合音频和视觉模态信息。此外,语义对齐模块扩展了生成器网络的感受野,从而有效地获取视觉特征的空间和通道信息;并将视觉特征的统计信息与音频潜向量相匹配,实现音频内容信息对视觉信息的调整和注入。为了实现精确的唇语同步和生成逼真的高质量图像,我们的方法采用了 LPIPS Loss,模拟人类对图像质量的判断,减少了训练过程中不稳定的可能性。主观和客观的评估结果表明,所提出的方法在唇语同步准确性和视觉质量方面都有出色的表现。
{"title":"LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild","authors":"Zhipeng Chen ,&nbsp;Xinheng Wang ,&nbsp;Lun Xie ,&nbsp;Haijie Yuan ,&nbsp;Hang Pan","doi":"10.1016/j.specom.2023.103028","DOIUrl":"10.1016/j.specom.2023.103028","url":null,"abstract":"<div><p>Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network<span> to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results.</span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139027298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network 使用基于卷积注意力网络的听觉启发式掩蔽调制编码器进行鲁棒语音活动检测
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2023-12-14 DOI: 10.1016/j.specom.2023.103024
Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang

Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.

深度学习为语音活动检测(VAD)带来了革命性的变化,提供了前景广阔的解决方案。然而,将原始波形和梅尔频率共振频率系数等传统特征直接应用于深度神经网络,往往会因噪声干扰而导致 VAD 性能下降。相比之下,人类拥有在复杂和嘈杂环境中辨别语音的非凡能力,这促使我们从人类听觉系统中汲取灵感。我们提出了一种稳健的 VAD 算法,称为基于听觉启发的掩蔽调制编码器卷积注意网络(AMME-CANet),它将我们的 AMME 与 CANet 集成在一起。首先,我们研究了作为深度学习编码器(AME)的听觉启发调制特征的设计,有效地模拟了声音信号传输到内耳毛细胞以及神经细胞随后进行调制过滤的过程。其次,基于在人类听觉系统中观察到的掩蔽效应,我们通过在 AMME 中加入掩蔽机制来增强我们的听觉启发调制编码器。AMME 可放大较纯净的语音频率,同时抑制噪声成分。第三,受人类听觉机制的启发并利用上下文信息,我们利用注意力机制进行 VAD。这种方法利用注意力机制,为包含更丰富、更翔实线索的上下文信息分配更高的权重。通过广泛的实验和评估,我们证明了 AMME-CANet 在具有挑战性的噪声条件下增强 VAD 的卓越性能。
{"title":"Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network","authors":"Nan Li ,&nbsp;Longbiao Wang ,&nbsp;Meng Ge ,&nbsp;Masashi Unoki ,&nbsp;Sheng Li ,&nbsp;Jianwu Dang","doi":"10.1016/j.specom.2023.103024","DOIUrl":"10.1016/j.specom.2023.103024","url":null,"abstract":"<div><p><span><span>Deep learning<span> has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency </span></span>cepstral coefficients, to deep </span>neural networks<span><span> often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional </span>attention network<span> (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.</span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138714391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English 单通道语音增强算法在新西兰英语中不同浸入条件下普通话听者身上的表现
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2023-12-14 DOI: 10.1016/j.specom.2023.103026
Yunqi C. Zhang , Yusuke Hioka , C.T. Justine Hui , Catherine I. Watson

Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions. The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general. These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise.

语音增强(SE)是一种广泛应用的技术,用于提高嘈杂语音的质量和可懂度。迄今为止,SE 算法的设计和评估对象都是母语听众,而非母语听众在嘈杂环境中的听力状况则更为不利。本文通过在新西兰英语句子中进行主观听力测试,研究了五种广泛使用的单通道 SE 算法在不同输入信噪比(SNR)条件下对早期浸入新西兰英语(NZE)的听者和母语普通话听者的性能表现。研究了 SE 算法在三组听者中的语音清晰度表现。结果表明,早熟组的语音清晰度总是最高的。在较高的输入信噪比条件下,晚期浸入组的表现优于非浸入组,这可能是由于对新西兰英语口音的熟悉程度不断提高,而在测试的最低输入信噪比条件下,这种优势消失了。本研究中测试的 SE 算法未能改善语音可懂度,反而降低了语音可懂度,这表明这些 SE 算法可能无法缩小早期、晚期和非浸入型听者之间的感知差距,也无法改善负输入信噪比条件下的语音可懂度。这些发现对未来开发适合普通话听者的 SE 算法,以及理解语言浸入对噪声中语音感知的影响具有重要意义。
{"title":"Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English","authors":"Yunqi C. Zhang ,&nbsp;Yusuke Hioka ,&nbsp;C.T. Justine Hui ,&nbsp;Catherine I. Watson","doi":"10.1016/j.specom.2023.103026","DOIUrl":"10.1016/j.specom.2023.103026","url":null,"abstract":"<div><p>Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions. The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general. These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001607/pdfft?md5=34c5bfa551c84f84c20ac950e89b00d4&pid=1-s2.0-S0167639323001607-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Back to grammar: Using grammatical error correction to automatically assess L2 speaking proficiency 回到语法:利用语法纠错自动评估 L2 口语水平
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2023-12-12 DOI: 10.1016/j.specom.2023.103025
Stefano Bannò , Marco Matassoni

In an interconnected world where English has become the lingua franca of culture, entertainment, business, and academia, the growing demand for learning English as a second language (L2) has led to an increasing interest in automatic approaches for assessing spoken language proficiency. In this regard, mastering grammar is one of the key elements of L2 proficiency.

In this paper, we illustrate an approach to L2 proficiency assessment and feedback based on grammatical features using only publicly available data for training and a small proprietary dataset for testing. Specifically, we implement it in a cascaded fashion, starting from learners’ utterances, investigating disfluency detection, exploring spoken grammatical error correction (GEC), and finally using grammatical features extracted with the spoken GEC module for proficiency assessment.

We compare this grading system to a BERT-based grader and find that the two systems have similar performances when using manual transcriptions, but their combinations bring significant improvements to the assessment performance and enhance validity and explainability. Instead, when using automatic transcriptions, the GEC-based grader obtains better results than the BERT-based grader.

The results obtained are discussed and evaluated with appropriate metrics across the proposed pipeline.

在一个相互联系的世界里,英语已成为文化、娱乐、商业和学术界的通用语言,人们对英语作为第二语言(L2)的学习需求日益增长,这导致人们对口语能力自动评估方法的兴趣与日俱增。在本文中,我们展示了一种基于语法特征的 L2 能力评估和反馈方法,该方法仅使用公开数据进行训练,并使用一个小型专有数据集进行测试。具体来说,我们以级联的方式实施这种方法,从学习者的语篇开始,研究不流畅检测,探索口语语法错误纠正(GEC),最后使用口语语法错误纠正模块提取的语法特征进行能力评估。我们将这种评分系统与基于 BERT 的评分系统进行了比较,发现这两种系统在使用人工转录时具有相似的性能,但它们的组合能显著提高评估性能,并增强有效性和可解释性。相反,在使用自动转录时,基于 GEC 的评分器比基于 BERT 的评分器获得了更好的结果。
{"title":"Back to grammar: Using grammatical error correction to automatically assess L2 speaking proficiency","authors":"Stefano Bannò ,&nbsp;Marco Matassoni","doi":"10.1016/j.specom.2023.103025","DOIUrl":"10.1016/j.specom.2023.103025","url":null,"abstract":"<div><p>In an interconnected world where English has become the lingua franca of culture, entertainment, business, and academia, the growing demand for learning English as a second language (L2) has led to an increasing interest in automatic approaches for assessing spoken language proficiency. In this regard, mastering grammar is one of the key elements of L2 proficiency.</p><p>In this paper, we illustrate an approach to L2 proficiency assessment and feedback based on grammatical features using only publicly available data for training and a small proprietary dataset for testing. Specifically, we implement it in a cascaded fashion, starting from learners’ utterances, investigating disfluency detection, exploring spoken grammatical error correction (GEC), and finally using grammatical features extracted with the spoken GEC module for proficiency assessment.</p><p>We compare this grading system to a BERT-based grader and find that the two systems have similar performances when using manual transcriptions, but their combinations bring significant improvements to the assessment performance and enhance validity and explainability. Instead, when using automatic transcriptions, the GEC-based grader obtains better results than the BERT-based grader.</p><p>The results obtained are discussed and evaluated with appropriate metrics across the proposed pipeline.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138580415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speakers’ vocal expression of sexual orientation depends on experimenter gender 说话者对性取向的声音表达取决于实验者的性别
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2023-12-04 DOI: 10.1016/j.specom.2023.103023
Sven Kachel , Adrian P. Simpson , Melanie C. Steffens

Since the early days of (phonetic) convergence research, one of the main questions is which individuals are more likely to adapt their speech to others. Especially differences between women and men have been researched with a high intensity. Using a differential approach as well, we complement the existing literature by focusing on another gender-related characteristic, namely sexual orientation. The present study aims to investigate whether and how women differing in sexual orientation vary in their speaking behavior, especially mean fundamental frequency (f0), in the presence of a female vs. male experimenter. Lesbian (n = 19) and straight female speakers (n = 18) engaged in two interactions each: First, they either engaged with a female or male experimenter, and second with the other-gender experimenter (counter-balanced and random assignment to conditions). For each interaction, recordings of read and spontaneous speech were collected. Analyses of read speech demonstrated mirroring of the first experimenter’s mean f0 which persisted even in the presence of the second experimenter. In spontaneous speech, this order effect interacted with exclusiveness of sexual orientation: Mirroring was found for participants who reported being exclusively lesbian/straight, not for those who reported being mainly lesbian/straight. We discuss implications for studies on convergence and research practice in general.

自(语音)趋同研究的早期以来,主要问题之一是哪个个体更有可能使自己的语言适应他人。尤其是女性和男性之间的差异已经得到了高强度的研究。我们也使用了一种不同的方法,通过关注另一个与性别相关的特征,即性取向,来补充现有的文献。本研究旨在调查不同性取向的女性在男女实验者在场的情况下,她们的说话行为,尤其是平均基本频率(0)是否以及如何发生变化。女同性恋者(n = 19)和异性恋女性演讲者(n = 18)各自进行了两次互动:第一次,她们与女性或男性实验者互动,第二次与异性实验者互动(平衡和随机分配条件)。对于每次互动,收集阅读和自发语音的录音。对读语音的分析表明,即使在第二个实验者在场的情况下,平均0对第一个实验者的适应仍然存在。在自发演讲中,这种顺序效应与性取向的排他性相互作用:在那些报告自己完全是女同性恋/异性恋的参与者中发现了适应,而在那些报告自己主要是女同性恋/异性恋的参与者中没有发现适应。我们讨论了一般意义上的收敛研究和研究实践的含义。
{"title":"Speakers’ vocal expression of sexual orientation depends on experimenter gender","authors":"Sven Kachel ,&nbsp;Adrian P. Simpson ,&nbsp;Melanie C. Steffens","doi":"10.1016/j.specom.2023.103023","DOIUrl":"10.1016/j.specom.2023.103023","url":null,"abstract":"<div><p>Since the early days of (phonetic) convergence research, one of the main questions is which individuals are more likely to adapt their speech to others. Especially differences between women and men have been researched with a high intensity. Using a differential approach as well, we complement the existing literature by focusing on another gender-related characteristic, namely sexual orientation. The present study aims to investigate whether and how women differing in sexual orientation vary in their speaking behavior, especially mean fundamental frequency (f0), in the presence of a female vs. male experimenter. Lesbian (<em>n</em> = 19) and straight female speakers (<em>n</em> = 18) engaged in two interactions each: First, they either engaged with a female or male experimenter, and second with the other-gender experimenter (counter-balanced and random assignment to conditions). For each interaction, recordings of read and spontaneous speech were collected. Analyses of read speech demonstrated mirroring of the first experimenter’s mean f0 which persisted even in the presence of the second experimenter. In spontaneous speech, this order effect interacted with exclusiveness of sexual orientation: Mirroring was found for participants who reported being exclusively lesbian/straight, not for those who reported being mainly lesbian/straight. We discuss implications for studies on convergence and research practice in general.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001577/pdfft?md5=e1280cda33f537c756c6ad3e4b34309d&pid=1-s2.0-S0167639323001577-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138542558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN 只选择最好的语音模仿者:Top-K多对多语音转换与StarGAN
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2023-11-30 DOI: 10.1016/j.specom.2023.103022
Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo

Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best K generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of K during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.

随着语音技术应用的增长,语音转换系统变得越来越重要。深度学习技术,特别是生成对抗网络(GANs),使合成媒体的创造取得了重大进展,包括语音合成领域。最近的一个例子是StarGAN-VC,它使用一对发生器和鉴别器来转换多个扬声器之间的声音。然而,gan的训练稳定性可能是一个问题。Top-K方法只使用最好的K个生成样本来训练生成器,这些样本可以“欺骗”鉴别器,该方法已应用于图像任务和简单的GAN架构。在这项工作中,我们证明了Top-K方法可以在最先进的语音转换系统(如StarGAN-VC)中提高转换语音的质量和稳定性。我们还探讨了实现Top-K方法的最佳时间以及如何在训练期间减少K的值。通过定量和定性研究,发现Top-K方法与常规或香草训练相比,收敛速度更快,转换质量更好。此外,人类听众认为使用Top-K生成的样本更自然,更有可能相信它们是由人类说话者产生的。本研究的结果表明,Top-K方法可以有效地提高基于深度学习的语音转换系统的性能。
{"title":"Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN","authors":"Claudio Fernandez-Martín ,&nbsp;Adrian Colomer ,&nbsp;Claudio Panariello ,&nbsp;Valery Naranjo","doi":"10.1016/j.specom.2023.103022","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103022","url":null,"abstract":"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1