Pub Date : 2024-02-01DOI: 10.1016/j.specom.2024.103047
Farhad Javanmardi, Sudarsana Reddy Kadiri, P. Alku
{"title":"Pre-trained models for detection and severity level classification of dysarthria from speech","authors":"Farhad Javanmardi, Sudarsana Reddy Kadiri, P. Alku","doi":"10.1016/j.specom.2024.103047","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103047","url":null,"abstract":"","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139817187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.1016/j.specom.2024.103041
Simon Stone, Peter Birkholz
German primary diphthongs are conventionally transcribed using the same symbols used for some monophthong vowels. However, if the corresponding vocal tract shapes are used for articulatory synthesis, the results often sound unnatural. Furthermore, there is no clear consensus in the literature if diphthongs have monopthong constituents and if so, which ones. This study therefore analyzed a set of audio recordings from the reference speaker of the state-of-the-art articulatory synthesizer VocalTractLab to identify likely candidates for the monophthong constituents of the German primary diphthongs. We then evaluated these candidates in a listening experiment with naive listeners to determine a naturalness ranking of these candidates and specialized diphthong shapes. The results showed that the German primary diphthongs can indeed be synthesized with no significant loss in naturalness by replacing the specialized diphthong shapes for the initial and final segments by shapes also used for monopthong vowels.
{"title":"Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs","authors":"Simon Stone, Peter Birkholz","doi":"10.1016/j.specom.2024.103041","DOIUrl":"10.1016/j.specom.2024.103041","url":null,"abstract":"<div><p><span>German primary diphthongs are conventionally transcribed using the same symbols used for some monophthong vowels. However, if the corresponding vocal tract shapes are used for articulatory synthesis, the results often sound unnatural. Furthermore, there is no clear consensus in the literature if diphthongs have monopthong constituents and if so, which ones. This study therefore analyzed a set of audio recordings from the reference speaker of the state-of-the-art articulatory synthesizer VocalTractLab to identify likely candidates for the monophthong constituents of the German primary diphthongs. We then evaluated these candidates in a listening experiment with naive listeners to determine a </span>naturalness ranking of these candidates and specialized diphthong shapes. The results showed that the German primary diphthongs can indeed be synthesized with no significant loss in naturalness by replacing the specialized diphthong shapes for the initial and final segments by shapes also used for monopthong vowels.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139589966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-13DOI: 10.1016/j.specom.2024.103038
Ingy Farouk Emara , Nabil Hamdy Shaker
The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the students’ English speech to find out whether the errors investigated resulted in intelligibility breakdowns in an ASR setting. In terms of the phonological features of non-native speech, the results showed that the teachers gave more weight to pronunciation features of accented speech that did not actually hinder recognition, that the students were mostly oblivious to the L2 errors they made and their impact on intelligibility, and that L2 errors which were not perceived as serious by both teachers and students had negative impacts on ASR accuracy levels. In regard to the prosodic features of non-native speech, it was found that lower speech rates resulted in more accurate speech recognition levels, higher speech intensity led to less deletion errors, and voice pitch did not seem to have any impact on ASR accuracy levels. The study, accordingly, recommends training ASR systems with more non-native data to increase their accuracy levels as well as paying more attention to remedying non-native speakers’ L1-based errors that are more likely to impact non-native automatic speech recognition.
本研究探讨了阿拉伯语者的语音和前音特征对非母语英语语音自动语音识别(ASR)准确性的影响。作者首先调查了 30 名埃及 ESL 教师和 70 名埃及大学生对影响可懂度的基于 L1(阿拉伯语)的错误的看法,然后对学生的英语语音自动识别进行了数据分析,以了解所调查的错误是否会在自动语音识别环境中导致可懂度下降。在非母语语音的语音特征方面,研究结果表明,教师更重视实际上并不妨碍识别的重音语音的发音特征;学生大多忽视他们所犯的 L2 错误及其对可懂度的影响;教师和学生都认为不严重的 L2 错误对 ASR 准确度水平有负面影响。关于非母语语音的前音特征,研究发现,较低的语速会导致更高的语音识别准确率,较高的语音强度会导致较少的删除错误,而声调似乎对 ASR 的准确率水平没有任何影响。因此,该研究建议使用更多的非母语数据对自动语音识别系统进行培训,以提高其准确度,同时更加关注纠正非母语人士基于 L1 的错误,因为这些错误更有可能影响非母语自动语音识别。
{"title":"The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy","authors":"Ingy Farouk Emara , Nabil Hamdy Shaker","doi":"10.1016/j.specom.2024.103038","DOIUrl":"10.1016/j.specom.2024.103038","url":null,"abstract":"<div><p>The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the students’ English speech to find out whether the errors investigated resulted in intelligibility breakdowns in an ASR setting. In terms of the phonological features of non-native speech, the results showed that the teachers gave more weight to pronunciation features of accented speech that did not actually hinder recognition, that the students were mostly oblivious to the L2 errors they made and their impact on intelligibility, and that L2 errors which were not perceived as serious by both teachers and students had negative impacts on ASR accuracy levels. In regard to the prosodic features of non-native speech, it was found that lower speech rates resulted in more accurate speech recognition levels, higher speech intensity led to less deletion errors, and voice pitch did not seem to have any impact on ASR accuracy levels. The study, accordingly, recommends training ASR systems with more non-native data to increase their accuracy levels as well as paying more attention to remedying non-native speakers’ L1-based errors that are more likely to impact non-native automatic speech recognition.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1016/j.specom.2023.103027
Wei-Cheng Lin, Carlos Busso
Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk-based DeepEmoCluster framework for speech emotion recognition (SER) to adopt the concept of deep clustering as a novel semi-supervised learning (SSL) framework, which achieved improved recognition performances over conventional reconstruction-based approaches. However, the vanilla DeepEmoCluster lacks critical sentence-level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence-level temporal modeling alternatives using either the temporal-net or the triplet loss function, resulting in a novel temporal-enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence-level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal-net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal-enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully-supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well-separated emotional patterns in the generated clusters.
{"title":"Deep temporal clustering features for speech emotion recognition","authors":"Wei-Cheng Lin, Carlos Busso","doi":"10.1016/j.specom.2023.103027","DOIUrl":"10.1016/j.specom.2023.103027","url":null,"abstract":"<div><p>Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk-based DeepEmoCluster framework for <em>speech emotion recognition</em> (SER) to adopt the concept of deep clustering as a novel <em>semi-supervised learning</em> (SSL) framework, which achieved improved recognition performances over conventional reconstruction-based approaches. However, the vanilla DeepEmoCluster lacks critical sentence-level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence-level temporal modeling alternatives using either the <em>temporal-net</em> or the <em>triplet loss</em> function, resulting in a novel temporal-enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence-level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal-net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal-enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully-supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well-separated <em>emotional patterns</em> in the generated clusters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001619/pdfft?md5=8a58455c8fa8b02caee36f8fcfccf479&pid=1-s2.0-S0167639323001619-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-24DOI: 10.1016/j.specom.2023.103028
Zhipeng Chen , Xinheng Wang , Lun Xie , Haijie Yuan , Hang Pan
Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results.
{"title":"LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild","authors":"Zhipeng Chen , Xinheng Wang , Lun Xie , Haijie Yuan , Hang Pan","doi":"10.1016/j.specom.2023.103028","DOIUrl":"10.1016/j.specom.2023.103028","url":null,"abstract":"<div><p>Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network<span> to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results.</span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139027298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-14DOI: 10.1016/j.specom.2023.103024
Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang
Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.
深度学习为语音活动检测(VAD)带来了革命性的变化,提供了前景广阔的解决方案。然而,将原始波形和梅尔频率共振频率系数等传统特征直接应用于深度神经网络,往往会因噪声干扰而导致 VAD 性能下降。相比之下,人类拥有在复杂和嘈杂环境中辨别语音的非凡能力,这促使我们从人类听觉系统中汲取灵感。我们提出了一种稳健的 VAD 算法,称为基于听觉启发的掩蔽调制编码器卷积注意网络(AMME-CANet),它将我们的 AMME 与 CANet 集成在一起。首先,我们研究了作为深度学习编码器(AME)的听觉启发调制特征的设计,有效地模拟了声音信号传输到内耳毛细胞以及神经细胞随后进行调制过滤的过程。其次,基于在人类听觉系统中观察到的掩蔽效应,我们通过在 AMME 中加入掩蔽机制来增强我们的听觉启发调制编码器。AMME 可放大较纯净的语音频率,同时抑制噪声成分。第三,受人类听觉机制的启发并利用上下文信息,我们利用注意力机制进行 VAD。这种方法利用注意力机制,为包含更丰富、更翔实线索的上下文信息分配更高的权重。通过广泛的实验和评估,我们证明了 AMME-CANet 在具有挑战性的噪声条件下增强 VAD 的卓越性能。
{"title":"Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network","authors":"Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang","doi":"10.1016/j.specom.2023.103024","DOIUrl":"10.1016/j.specom.2023.103024","url":null,"abstract":"<div><p><span><span>Deep learning<span> has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency </span></span>cepstral coefficients, to deep </span>neural networks<span><span> often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional </span>attention network<span> (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.</span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138714391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-14DOI: 10.1016/j.specom.2023.103026
Yunqi C. Zhang , Yusuke Hioka , C.T. Justine Hui , Catherine I. Watson
Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions. The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general. These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise.
语音增强(SE)是一种广泛应用的技术,用于提高嘈杂语音的质量和可懂度。迄今为止,SE 算法的设计和评估对象都是母语听众,而非母语听众在嘈杂环境中的听力状况则更为不利。本文通过在新西兰英语句子中进行主观听力测试,研究了五种广泛使用的单通道 SE 算法在不同输入信噪比(SNR)条件下对早期浸入新西兰英语(NZE)的听者和母语普通话听者的性能表现。研究了 SE 算法在三组听者中的语音清晰度表现。结果表明,早熟组的语音清晰度总是最高的。在较高的输入信噪比条件下,晚期浸入组的表现优于非浸入组,这可能是由于对新西兰英语口音的熟悉程度不断提高,而在测试的最低输入信噪比条件下,这种优势消失了。本研究中测试的 SE 算法未能改善语音可懂度,反而降低了语音可懂度,这表明这些 SE 算法可能无法缩小早期、晚期和非浸入型听者之间的感知差距,也无法改善负输入信噪比条件下的语音可懂度。这些发现对未来开发适合普通话听者的 SE 算法,以及理解语言浸入对噪声中语音感知的影响具有重要意义。
{"title":"Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English","authors":"Yunqi C. Zhang , Yusuke Hioka , C.T. Justine Hui , Catherine I. Watson","doi":"10.1016/j.specom.2023.103026","DOIUrl":"10.1016/j.specom.2023.103026","url":null,"abstract":"<div><p>Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions. The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general. These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001607/pdfft?md5=34c5bfa551c84f84c20ac950e89b00d4&pid=1-s2.0-S0167639323001607-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-12DOI: 10.1016/j.specom.2023.103025
Stefano Bannò , Marco Matassoni
In an interconnected world where English has become the lingua franca of culture, entertainment, business, and academia, the growing demand for learning English as a second language (L2) has led to an increasing interest in automatic approaches for assessing spoken language proficiency. In this regard, mastering grammar is one of the key elements of L2 proficiency.
In this paper, we illustrate an approach to L2 proficiency assessment and feedback based on grammatical features using only publicly available data for training and a small proprietary dataset for testing. Specifically, we implement it in a cascaded fashion, starting from learners’ utterances, investigating disfluency detection, exploring spoken grammatical error correction (GEC), and finally using grammatical features extracted with the spoken GEC module for proficiency assessment.
We compare this grading system to a BERT-based grader and find that the two systems have similar performances when using manual transcriptions, but their combinations bring significant improvements to the assessment performance and enhance validity and explainability. Instead, when using automatic transcriptions, the GEC-based grader obtains better results than the BERT-based grader.
The results obtained are discussed and evaluated with appropriate metrics across the proposed pipeline.
{"title":"Back to grammar: Using grammatical error correction to automatically assess L2 speaking proficiency","authors":"Stefano Bannò , Marco Matassoni","doi":"10.1016/j.specom.2023.103025","DOIUrl":"10.1016/j.specom.2023.103025","url":null,"abstract":"<div><p>In an interconnected world where English has become the lingua franca of culture, entertainment, business, and academia, the growing demand for learning English as a second language (L2) has led to an increasing interest in automatic approaches for assessing spoken language proficiency. In this regard, mastering grammar is one of the key elements of L2 proficiency.</p><p>In this paper, we illustrate an approach to L2 proficiency assessment and feedback based on grammatical features using only publicly available data for training and a small proprietary dataset for testing. Specifically, we implement it in a cascaded fashion, starting from learners’ utterances, investigating disfluency detection, exploring spoken grammatical error correction (GEC), and finally using grammatical features extracted with the spoken GEC module for proficiency assessment.</p><p>We compare this grading system to a BERT-based grader and find that the two systems have similar performances when using manual transcriptions, but their combinations bring significant improvements to the assessment performance and enhance validity and explainability. Instead, when using automatic transcriptions, the GEC-based grader obtains better results than the BERT-based grader.</p><p>The results obtained are discussed and evaluated with appropriate metrics across the proposed pipeline.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138580415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-04DOI: 10.1016/j.specom.2023.103023
Sven Kachel , Adrian P. Simpson , Melanie C. Steffens
Since the early days of (phonetic) convergence research, one of the main questions is which individuals are more likely to adapt their speech to others. Especially differences between women and men have been researched with a high intensity. Using a differential approach as well, we complement the existing literature by focusing on another gender-related characteristic, namely sexual orientation. The present study aims to investigate whether and how women differing in sexual orientation vary in their speaking behavior, especially mean fundamental frequency (f0), in the presence of a female vs. male experimenter. Lesbian (n = 19) and straight female speakers (n = 18) engaged in two interactions each: First, they either engaged with a female or male experimenter, and second with the other-gender experimenter (counter-balanced and random assignment to conditions). For each interaction, recordings of read and spontaneous speech were collected. Analyses of read speech demonstrated mirroring of the first experimenter’s mean f0 which persisted even in the presence of the second experimenter. In spontaneous speech, this order effect interacted with exclusiveness of sexual orientation: Mirroring was found for participants who reported being exclusively lesbian/straight, not for those who reported being mainly lesbian/straight. We discuss implications for studies on convergence and research practice in general.
{"title":"Speakers’ vocal expression of sexual orientation depends on experimenter gender","authors":"Sven Kachel , Adrian P. Simpson , Melanie C. Steffens","doi":"10.1016/j.specom.2023.103023","DOIUrl":"10.1016/j.specom.2023.103023","url":null,"abstract":"<div><p>Since the early days of (phonetic) convergence research, one of the main questions is which individuals are more likely to adapt their speech to others. Especially differences between women and men have been researched with a high intensity. Using a differential approach as well, we complement the existing literature by focusing on another gender-related characteristic, namely sexual orientation. The present study aims to investigate whether and how women differing in sexual orientation vary in their speaking behavior, especially mean fundamental frequency (f0), in the presence of a female vs. male experimenter. Lesbian (<em>n</em> = 19) and straight female speakers (<em>n</em> = 18) engaged in two interactions each: First, they either engaged with a female or male experimenter, and second with the other-gender experimenter (counter-balanced and random assignment to conditions). For each interaction, recordings of read and spontaneous speech were collected. Analyses of read speech demonstrated mirroring of the first experimenter’s mean f0 which persisted even in the presence of the second experimenter. In spontaneous speech, this order effect interacted with exclusiveness of sexual orientation: Mirroring was found for participants who reported being exclusively lesbian/straight, not for those who reported being mainly lesbian/straight. We discuss implications for studies on convergence and research practice in general.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001577/pdfft?md5=e1280cda33f537c756c6ad3e4b34309d&pid=1-s2.0-S0167639323001577-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138542558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.
{"title":"Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN","authors":"Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo","doi":"10.1016/j.specom.2023.103022","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103022","url":null,"abstract":"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}