首页 > 最新文献

Speech Communication最新文献

英文 中文
Analysis of forced aligner performance on L2 English speech 第二语言英语语音的强制对齐器性能分析
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-03-01 DOI: 10.1016/j.specom.2024.103042
Samantha Williams, Paul Foulkes, Vincent Hughes

There is growing interest in how speech technologies perform on L2 speech. Largely omitted from this discussion are tools used in the early data processing steps, such as forced aligners, that can introduce errors and biases. This study adds to the conversation and tests how well a model pre-trained for the alignment of L1 American English speech performs on L2 English speech. We test and discuss the impact of language variety, demographic factors, and segment type on the performance of the forced aligner. We also examine systematic errors encountered.

Forty-five speakers representing nine L2 varieties were selected from the Speech Accent Archive and force aligned using the Montreal Forced Aligner. The phoneme-level boundary placements were manually corrected in order to assess differences between the automatic and manual alignments. Results show marked variation in the performance across language groups and segment types for the two metrics used to assess accuracy: Onset Boundary Displacement, a distance metric between the automatic and manual boundary placements, and Overlap Rate, which indicates to what extent the automatically aligned segment overlaps with the manually aligned segment. The highest accuracy on both measures was obtained for German and French, and lowest accuracy for Russian. The aligner's performance on all varieties was comparable to that on conversational American English and non-standard varieties of English. Furthermore, the percentage of boundary placements within 10 and 20 ms of the corrected boundary was similar to that observed between transcribers. Apart from errors due to variety mismatch, most issues encountered in the alignment were due to issues not exclusive to L2 speech such as inaccurate orthographic transcriptions, hesitations, specific voice qualities, and background noise.

The results of this study can inform the use of automatic aligners on L2 English speech and provide a baseline of potential errors and information to help the development of more robust alignment tools for further development of automatic systems using L2 English.

人们对语音技术如何处理 L2 语音越来越感兴趣。在这一讨论中被忽略的主要是早期数据处理步骤中使用的工具,如强制对齐器,它们可能会引入误差和偏差。本研究对这一讨论进行了补充,并测试了针对 L1 美式英语语音对齐预先训练的模型在 L2 英语语音上的表现。我们测试并讨论了语言种类、人口因素和语段类型对强制对齐器性能的影响。我们从 "语音重音档案 "中选取了代表九种 L2 语言的 45 位发言人,并使用蒙特利尔强制对齐器进行了强制对齐。为了评估自动对齐和人工对齐之间的差异,对音素级边界位置进行了人工校正。结果表明,在用于评估准确性的两个指标方面,不同语言组和不同语段类型的表现存在明显差异:起始边界位移是自动和手动边界定位之间的距离指标,重叠率则表示自动对齐的语段与手动对齐的语段重叠的程度。在这两项指标上,德语和法语的准确率最高,俄语的准确率最低。对齐器在所有语种上的表现都与美式英语会话和非标准语种的表现相当。此外,边界位置在校正边界 10 毫秒和 20 毫秒以内的百分比与誊写者之间的百分比相似。除了因语种不匹配造成的错误外,对齐过程中遇到的大多数问题都是由于 L2 语音不特有的问题造成的,如不准确的正字法转录、犹豫、特定的语音质量和背景噪音。这项研究的结果可以为自动对齐器在 L2 英语语音上的使用提供参考,并提供了潜在错误的基准和信息,有助于开发更强大的对齐工具,从而进一步开发使用 L2 英语的自动系统。
{"title":"Analysis of forced aligner performance on L2 English speech","authors":"Samantha Williams,&nbsp;Paul Foulkes,&nbsp;Vincent Hughes","doi":"10.1016/j.specom.2024.103042","DOIUrl":"10.1016/j.specom.2024.103042","url":null,"abstract":"<div><p>There is growing interest in how speech technologies perform on L2 speech. Largely omitted from this discussion are tools used in the early data processing steps, such as forced aligners, that can introduce errors and biases. This study adds to the conversation and tests how well a model pre-trained for the alignment of L1 American English speech performs on L2 English speech. We test and discuss the impact of language variety, demographic factors, and segment type on the performance of the forced aligner. We also examine systematic errors encountered.</p><p>Forty-five speakers representing nine L2 varieties were selected from the Speech Accent Archive and force aligned using the Montreal Forced Aligner. The phoneme-level boundary placements were manually corrected in order to assess differences between the automatic and manual alignments. Results show marked variation in the performance across language groups and segment types for the two metrics used to assess accuracy: Onset Boundary Displacement, a distance metric between the automatic and manual boundary placements, and Overlap Rate, which indicates to what extent the automatically aligned segment overlaps with the manually aligned segment. The highest accuracy on both measures was obtained for German and French, and lowest accuracy for Russian. The aligner's performance on all varieties was comparable to that on conversational American English and non-standard varieties of English. Furthermore, the percentage of boundary placements within 10 and 20 ms of the corrected boundary was similar to that observed between transcribers. Apart from errors due to variety mismatch, most issues encountered in the alignment were due to issues not exclusive to L2 speech such as inaccurate orthographic transcriptions, hesitations, specific voice qualities, and background noise.</p><p>The results of this study can inform the use of automatic aligners on L2 English speech and provide a baseline of potential errors and information to help the development of more robust alignment tools for further development of automatic systems using L2 English.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000141/pdfft?md5=0ef6d8a9a8c0f2bf6466ba7d7a03e661&pid=1-s2.0-S0167639324000141-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139920646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech-driven head motion generation from waveforms 根据波形生成语音驱动的头部动作
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-03-01 DOI: 10.1016/j.specom.2024.103056
JinHong Lu, Hiroshi Shimodaira

Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, WavCCCAE, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the WavCCCAE-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the WavCCCAEfeature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where WavCCCAEwas chosen to be better than the other models by the participants of MUSHRA test.

针对语音驱动的虚拟代理动画的头部动作生成任务,文献中通常使用手工制作的音频特征(如 MFCC)作为输入特征,再加上额外的特征(如能量和 F0)进行探索。在本文中,我们研究了直接使用语音波形生成头部动作的方法。我们认为,从波形中创建特定任务特征来生成头部运动,比使用标准声学特征来生成头部运动的整体效果更好。同时,我们完全放弃了手工特征提取过程,从而提高了效率。然而,从波形中创建特定任务特征的难点在于其数量惊人的不相关信息,这对神经网络训练造成了潜在的负担。因此,我们应用了一种规范相关约束自动编码器(CCCAE),它能将高维波形压缩成低维嵌入特征,重建误差最小,并以与头部运动的最大规范相关性维持相关信息。我们扩展了之前的研究,在数据集中加入了更多的扬声器,并使用递归神经网络进行调整,以证明我们提出的特征的可行性。通过不同声学特征之间的比较,我们提出的特征 WavCCCAE 在与波形的相关性方面至少提高了 20%,在所有扬声器中分别比流行的声学特征 MFCC 高出至少 5%。通过在前馈神经网络回归(FNN-regression)系统中的比较,基于 WavCCCAE 的系统在客观评估中表现出了相当的性能。在长短期记忆(LSTM)实验中,LSTM 模型改善了归一化均方误差(NMSE)和 CCA 指标的整体性能,并更好地适应了 WavCCCAE 特征,这使得所提出的 LSTM 回归系统优于基于 MFCC 的系统。我们还重新设计了主观评价,主观结果显示了 MUSHRA 测试参与者选择 WavCCCAE 优于其他模型的模型生成的动画。
{"title":"Speech-driven head motion generation from waveforms","authors":"JinHong Lu,&nbsp;Hiroshi Shimodaira","doi":"10.1016/j.specom.2024.103056","DOIUrl":"10.1016/j.specom.2024.103056","url":null,"abstract":"<div><p>Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>feature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>was chosen to be better than the other models by the participants of MUSHRA test.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000281/pdfft?md5=3e4ce95ea878ead804890332c3362074&pid=1-s2.0-S0167639324000281-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140089565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PLDE: A lightweight pooling layer for spoken language recognition PLDE:用于口语识别的轻量级汇集层
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-23 DOI: 10.1016/j.specom.2024.103055
Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su

In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the Cavg of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.

近年来,用语音特征替代声学特征的迁移学习方法已成为端到端口语识别的新范式。然而,这些较大的迁移学习模型总是编码过多的冗余信息。在本文中,我们提出了一种基于语音可学习字典编码(PLDE)层的轻量级语言识别解码器,它更适合语音特征,在大幅减少参数数量的同时实现了更好的识别性能。轻量级解码器主要由三部分组成:(1)带鬼簇的语音可学习字典,它改进了传统的 LDE 汇集层,提高了模型对带鬼簇噪声的建模能力;(2)粗粒度的块级汇集,它能突出电话序列,抑制鬼簇周围的噪声,从而减少鬼簇对后续网络的影响;(3)细粒度的块级投影,它能使判别网络获得更多的语言信息,从而提高模型的建模能力。这三个部分将语言识别解码器简化为 PLDE 池层,将解码器的参数大小减少了至少一个数量级,同时实现了更好的识别性能。在 OLR2020 数据集的实验中,所提方法的 Cavg 超过了目前最先进的语言识别系统,在跨信道测试集和未知噪声测试集上分别提高了 24.68% 和 42.24%。此外,在 OLR2021 数据集上的实验结果也证明了 PLDE 的有效性。
{"title":"PLDE: A lightweight pooling layer for spoken language recognition","authors":"Zimu Li ,&nbsp;Yanyan Xu ,&nbsp;Dengfeng Ke ,&nbsp;Kaile Su","doi":"10.1016/j.specom.2024.103055","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103055","url":null,"abstract":"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139943007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pre-trained models for detection and severity level classification of dysarthria from speech 用于从语音中检测构音障碍并对其严重程度进行分类的预训练模型
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-14 DOI: 10.1016/j.specom.2024.103047
Farhad Javanmardi, Sudarsana Reddy Kadiri, Paavo Alku

Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech. The experiments were conducted using two publicly available databases (UA-Speech and TORGO). One machine learning-based model (support vector machine, SVM) and one deep learning-based model (convolutional neural network, CNN) was used as the classifier. In order to compare the performance of the wav2vec2-BASE, wav2vec2-LARGE, and HuBERT features, three popular acoustic feature sets, namely, mel-frequency cepstral coefficients (MFCCs), openSMILE and extended Geneva minimalistic acoustic parameter set (eGeMAPS) were considered. Experimental results revealed that the features derived from the pre-trained models outperformed the three baseline features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features. In particular, when compared to the best-performing baseline feature (openSMILE), the HuBERT features showed in the detection problem absolute accuracy improvements that varied between 1.33% (the SVM classifier, the TORGO database) and 2.86% (the SVM classifier, the UA-Speech database). In the severity level classification problem, the HuBERT features showed absolute accuracy improvements that varied between 6.54% (the SVM classifier, the TORGO database) and 10.46% (the SVM classifier, the UA-Speech database) compared to the best-performing baseline feature (eGeMAPS).

从语音中自动检测构音障碍并对其严重程度进行分类可实现无创、有效的诊断,有助于临床决定对患者的用药和治疗。在这项工作中,我们研究了三种预训练模型(wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT),以提取特征来构建构音障碍自动检测和严重程度分类系统。实验使用了两个公开数据库(UA-Speech 和 TORGO)。一个基于机器学习的模型(支持向量机,SVM)和一个基于深度学习的模型(卷积神经网络,CNN)被用作分类器。为了比较 wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT 特征的性能,还考虑了三种流行的声学特征集,即 mel-frequency cepstral coefficients(MFCCs)、openSMILE 和 extended Geneva minimalistic acoustic parameter set(eGeMAPS)。实验结果表明,从预训练模型中提取的特征优于三种基线特征。实验还发现,HuBERT 特征的表现优于 wav2vec2-BASE 和 wav2vec2-LARGE 特征。特别是,与表现最好的基线特征(openSMILE)相比,HuBERT 特征在检测问题上的绝对准确率提高了 1.33%(SVM 分类器,TORGO 数据库)和 2.86%(SVM 分类器,UA-Speech 数据库)。在严重程度分类问题中,与表现最好的基线特征(eGeMAPS)相比,HuBERT 特征的绝对准确率提高了 6.54%(SVM 分类器,TORGO 数据库)和 10.46%(SVM 分类器,UA-Speech 数据库)。
{"title":"Pre-trained models for detection and severity level classification of dysarthria from speech","authors":"Farhad Javanmardi,&nbsp;Sudarsana Reddy Kadiri,&nbsp;Paavo Alku","doi":"10.1016/j.specom.2024.103047","DOIUrl":"10.1016/j.specom.2024.103047","url":null,"abstract":"<div><p>Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech. The experiments were conducted using two publicly available databases (UA-Speech and TORGO). One machine learning-based model (support vector machine, SVM) and one deep learning-based model (convolutional neural network, CNN) was used as the classifier. In order to compare the performance of the wav2vec2-BASE, wav2vec2-LARGE, and HuBERT features, three popular acoustic feature sets, namely, mel-frequency cepstral coefficients (MFCCs), openSMILE and extended Geneva minimalistic acoustic parameter set (eGeMAPS) were considered. Experimental results revealed that the features derived from the pre-trained models outperformed the three baseline features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features. In particular, when compared to the best-performing baseline feature (openSMILE), the HuBERT features showed in the detection problem absolute accuracy improvements that varied between 1.33% (the SVM classifier, the TORGO database) and 2.86% (the SVM classifier, the UA-Speech database). In the severity level classification problem, the HuBERT features showed absolute accuracy improvements that varied between 6.54% (the SVM classifier, the TORGO database) and 10.46% (the SVM classifier, the UA-Speech database) compared to the best-performing baseline feature (eGeMAPS).</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000190/pdfft?md5=06e82e9568d6d0d206292d39eb27d9c4&pid=1-s2.0-S0167639324000190-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139877101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On intrusive speech quality measures and a global SNR based metric 关于干扰性语音质量测量和基于信噪比的全局指标
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-14 DOI: 10.1016/j.specom.2024.103044
Chao Pan , Jingdong Chen , Jacob Benesty

Measuring the quality of noisy speech signals has been an increasingly important problem in the field of speech processing as more and more speech-communication and human-machine-interface systems are deployed in practical applications. In this paper, we study four widely used classical performance measures: signal-to-distortion ratio (SDR), short-time objective intelligibility (STOI), signal-to-noise ratio (SNR), and perceptual evaluation of speech quality (PESQ). Through analyzing these performance measures under the same framework and identifying the relationship between their core parameters, we convert these measures into the corresponding equivalent SNRs. This conversion enables not only some new insights into different quality measures but also a way to combine these measures into a new metric. In the derivation of the equivalent SNRs, we introduce the widely used masking technique into the computation of correlation coefficients, which is subsequently used to analyze STOI. Furthermore, we propose an attention method to compute the core parameters of PESQ, and also an empirical formula to project the equivalent SNRs into PESQ scores. Experiments are carried out and the results justifies the properties of the derived quality measures.

随着越来越多的语音通信和人机界面系统在实际应用中得到部署,测量噪声语音信号的质量已成为语音处理领域一个日益重要的问题。本文研究了四种广泛使用的经典性能测量方法:信号失真比 (SDR)、短时客观可懂度 (STOI)、信噪比 (SNR) 和语音质量感知评估 (PESQ)。通过在同一框架下分析这些性能指标并确定其核心参数之间的关系,我们将这些指标转换为相应的等效信噪比。这种转换不仅能让我们对不同的质量度量有新的认识,还能将这些度量组合成一种新的度量。在推导等效信噪比时,我们将广泛使用的掩蔽技术引入相关系数的计算中,随后用于分析 STOI。此外,我们还提出了计算 PESQ 核心参数的注意方法,以及将等效信噪比投射到 PESQ 分数的经验公式。我们进行了实验,实验结果证明了推导出的质量度量的特性。
{"title":"On intrusive speech quality measures and a global SNR based metric","authors":"Chao Pan ,&nbsp;Jingdong Chen ,&nbsp;Jacob Benesty","doi":"10.1016/j.specom.2024.103044","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103044","url":null,"abstract":"<div><p>Measuring the quality of noisy speech signals has been an increasingly important problem in the field of speech processing as more and more speech-communication and human-machine-interface systems are deployed in practical applications. In this paper, we study four widely used classical performance measures: signal-to-distortion ratio (SDR), short-time objective intelligibility (STOI), signal-to-noise ratio (SNR), and perceptual evaluation of speech quality (PESQ). Through analyzing these performance measures under the same framework and identifying the relationship between their core parameters, we convert these measures into the corresponding equivalent SNRs. This conversion enables not only some new insights into different quality measures but also a way to combine these measures into a new metric. In the derivation of the equivalent SNRs, we introduce the widely used masking technique into the computation of correlation coefficients, which is subsequently used to analyze STOI. Furthermore, we propose an attention method to compute the core parameters of PESQ, and also an empirical formula to project the equivalent SNRs into PESQ scores. Experiments are carried out and the results justifies the properties of the derived quality measures.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139749477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency 基于电网频率波动超矢量的语音认证删除和插入篡改检测
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-12 DOI: 10.1016/j.specom.2024.103046
Chunyan Zeng , Shuai Kong , Zhifeng Wang , Shixiong Feng , Nan Zhao , Juan Wang

The current digital speech deletion and insertion tampering detection methods mainly employes the extraction of phase and frequency features of the Electrical Network Frequency (ENF). However, there are some problems with the existing approaches, such as the alignment problem for speech samples with different durations, the sparsity of ENF features, and the small number of tampered speech samples for training, which lead to low accuracy of deletion and insertion tampering detection. Therefore, this paper proposes a tampering detection method for digital speech deletion and insertion based on ENF Fluctuation Super-vector (ENF-FSV) and deep feature learning representation. By extracting the parameters of ENF phase and frequency fitting curves, feature alignment and dimensionality reduction are achieved, and the alignment and sparsity problems are avoided while the fluctuation information of phase and frequency is extracted. To solve the problem of the insufficient sample size of tampered speech for training, the ENF Universal Background Model (ENF-UBM) is built by a large number of untampered speech samples, and the mean vector is updated to extract ENF-FSV. Considering the shallow representation of ENF features with not highlighting important features, we construct an end-to-end deep neural network to strengthen the attention to the abrupt fluctuation information by the attention mechanism to enhance the representational power of the ENF-FSV features, and then the deep ENF-FSV features extracted by the Residual Network (ResNet) module are fed to the designed classification network for tampering detection. The experimental results show that the method in this paper exhibits higher accuracy and better robustness in the Carioca, New Spanish, and ENF High-sampling Group (ENF-HG) databases when compared with the state-of-the-art methods.

目前的数字语音删除和插入篡改检测方法主要采用电网络频率(ENF)的相位和频率特性提取。然而,现有方法存在一些问题,如不同时长语音样本的对齐问题、ENF 特征的稀疏性、用于训练的篡改语音样本数量较少等,导致删除和插入篡改检测的准确率较低。因此,本文提出了一种基于ENF波动超向量(ENF-FSV)和深度特征学习表示的数字语音删插篡改检测方法。通过提取ENF相位和频率拟合曲线参数,实现了特征对齐和降维,在提取相位和频率波动信息的同时,避免了对齐和稀疏性问题。为解决训练时篡改语音样本量不足的问题,利用大量未篡改语音样本建立 ENF 通用背景模型(ENF-UBM),并更新均值向量以提取 ENF-FSV。考虑到ENF特征的表征较浅,无法突出重要特征,我们构建了端到端的深度神经网络,通过注意力机制加强对突变波动信息的关注,增强ENF-FSV特征的表征力,然后将残差网络(ResNet)模块提取的ENF-FSV深度特征反馈给设计的分类网络,进行篡改检测。实验结果表明,与最先进的方法相比,本文的方法在 Carioca、New Spanish 和 ENF 高采样组(ENF-HG)数据库中表现出更高的准确性和更好的鲁棒性。
{"title":"Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency","authors":"Chunyan Zeng ,&nbsp;Shuai Kong ,&nbsp;Zhifeng Wang ,&nbsp;Shixiong Feng ,&nbsp;Nan Zhao ,&nbsp;Juan Wang","doi":"10.1016/j.specom.2024.103046","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103046","url":null,"abstract":"<div><p>The current digital speech deletion and insertion tampering detection methods mainly employes the extraction of phase and frequency features of the Electrical Network Frequency (ENF). However, there are some problems with the existing approaches, such as the alignment problem for speech samples with different durations, the sparsity of ENF features, and the small number of tampered speech samples for training, which lead to low accuracy of deletion and insertion tampering detection. Therefore, this paper proposes a tampering detection method for digital speech deletion and insertion based on ENF Fluctuation Super-vector (ENF-FSV) and deep feature learning representation. By extracting the parameters of ENF phase and frequency fitting curves, feature alignment and dimensionality reduction are achieved, and the alignment and sparsity problems are avoided while the fluctuation information of phase and frequency is extracted. To solve the problem of the insufficient sample size of tampered speech for training, the ENF Universal Background Model (ENF-UBM) is built by a large number of untampered speech samples, and the mean vector is updated to extract ENF-FSV. Considering the shallow representation of ENF features with not highlighting important features, we construct an end-to-end deep neural network to strengthen the attention to the abrupt fluctuation information by the attention mechanism to enhance the representational power of the ENF-FSV features, and then the deep ENF-FSV features extracted by the Residual Network (ResNet) module are fed to the designed classification network for tampering detection. The experimental results show that the method in this paper exhibits higher accuracy and better robustness in the Carioca, New Spanish, and ENF High-sampling Group (ENF-HG) databases when compared with the state-of-the-art methods.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139725832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Some properties of mental speech preparation as revealed by self-monitoring 自我监控揭示的心理演讲准备的一些特性
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-09 DOI: 10.1016/j.specom.2024.103043
Hugo Quené, Sieb G. Nooteboom

The main goal of this paper is to improve our insight in the mental preparation of speech, based on speakers' self-monitoring behavior. To this end we re-analyze the aggregated responses from earlier published experiments eliciting speech sound errors. The re-analyses confirm or show that (1) “early” and “late” detections of elicited speech sound errors can be distinguished, with a time delay in the order of 500 ms; (2) a main cause for some errors to be detected “early”, others “late” and others again not at all is the size of the phonetic contrast between the error and the target speech sound; (3) repairs of speech sound errors stem from competing (and sometimes active) word candidates. These findings lead to some speculative conclusions regarding the mental preparation of speech. First, there are two successive stages of mental preparation, an “early” and a “late” stage. Second, at the “early” stage of speech preparation, speech sounds are represented as targets in auditory perceptual space, at the “late” stage as coordinated motor commands necessary for articulation. Third, repairs of speech sound errors stem from response candidates competing for the same slot with the error form, and some activation often is sustained until after articulation.

本文的主要目的是根据说话者的自我监控行为,提高我们对语音心理准备的洞察力。为此,我们重新分析了早先发表的诱发语音错误实验中的综合反应。重新分析证实或表明:(1) 对语音错误的 "早期 "和 "晚期 "检测是可以区分的,时间延迟在 500 毫秒左右;(2) 有些错误被 "早期 "检测到,有些则被 "晚期 "检测到,有些则完全没有被检测到,其主要原因在于错误与目标语音之间的语音对比度大小;(3) 修复语音错误源于相互竞争的(有时是主动的)候选词。这些发现为语音的心理准备提供了一些推测性结论。首先,心理准备有两个连续的阶段,即 "早期 "和 "晚期 "阶段。其次,在语音准备的 "早期 "阶段,语音在听觉知觉空间中表现为目标,而在 "晚期 "阶段则表现为发音所需的协调运动指令。第三,语音错误的修复源于候选反应与错误形式争夺同一位置,有些激活往往持续到发音之后。
{"title":"Some properties of mental speech preparation as revealed by self-monitoring","authors":"Hugo Quené,&nbsp;Sieb G. Nooteboom","doi":"10.1016/j.specom.2024.103043","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103043","url":null,"abstract":"<div><p>The main goal of this paper is to improve our insight in the mental preparation of speech, based on speakers' self-monitoring behavior. To this end we re-analyze the aggregated responses from earlier published experiments eliciting speech sound errors. The re-analyses confirm or show that (1) “early” and “late” detections of elicited speech sound errors can be distinguished, with a time delay in the order of 500 ms; (2) a main cause for some errors to be detected “early”, others “late” and others again not at all is the size of the phonetic contrast between the error and the target speech sound; (3) repairs of speech sound errors stem from competing (and sometimes active) word candidates. These findings lead to some speculative conclusions regarding the mental preparation of speech. First, there are two successive stages of mental preparation, an “early” and a “late” stage. Second, at the “early” stage of speech preparation, speech sounds are represented as targets in auditory perceptual space, at the “late” stage as coordinated motor commands necessary for articulation. Third, repairs of speech sound errors stem from response candidates competing for the same slot with the error form, and some activation often is sustained until after articulation.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000153/pdfft?md5=0778601c47d5f7635cc40d5c60526a59&pid=1-s2.0-S0167639324000153-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139738033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions 在办案条件下验证用于法证自动语音识别的 ECAPA-TDNN 系统
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-09 DOI: 10.1016/j.specom.2024.103045
Francesco Sigona, Mirko Grimaldi

In this work, we tested different variants of a Forensic Automatic Speaker Recognition (FASR) system based on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN). To this scope, conditions reflecting those of a real forensic voice comparison case have been taken into consideration according to the forensic_eval_01 evaluation campaign settings. Using this recent neural model as an embedding extraction block, various normalization strategies at the level of embeddings and scores allowed us to observe the variations in system performance in terms of discriminating power, accuracy and precision metrics. Our findings suggest that the ECAPA-TDNN can be successfully used as a base component of a FASR system, managing to surpass the previous state of the art, at least in the context of the considered operating conditions.

在这项工作中,我们测试了基于时延神经网络(ECAPA-TDNN)的强调通道注意、传播和聚合的法证自动语音识别(FASR)系统的不同变体。为此,根据 forensic_eval_01 评估活动的设置,考虑了反映真实法证语音比对案例的条件。使用这个最新的神经模型作为嵌入提取块,在嵌入和分数层面采用各种归一化策略,使我们能够观察到系统在辨别力、准确度和精确度指标方面的性能变化。我们的研究结果表明,ECAPA-TDNN 可以成功地用作 FASR 系统的基础组件,至少在所考虑的运行条件下,它能够超越以前的技术水平。
{"title":"Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions","authors":"Francesco Sigona,&nbsp;Mirko Grimaldi","doi":"10.1016/j.specom.2024.103045","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103045","url":null,"abstract":"<div><p>In this work, we tested different variants of a Forensic Automatic Speaker Recognition (FASR) system based on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN). To this scope, conditions reflecting those of a real forensic voice comparison case have been taken into consideration according to the <em>forensic_eval_01</em> evaluation campaign settings. Using this recent neural model as an embedding extraction block, various normalization strategies at the level of embeddings and scores allowed us to observe the variations in system performance in terms of discriminating power, accuracy and precision metrics. Our findings suggest that the ECAPA-TDNN can be successfully used as a base component of a FASR system, managing to surpass the previous state of the art, at least in the context of the considered operating conditions.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000177/pdfft?md5=4a1c2390e5be4931eca4de00e7d357e7&pid=1-s2.0-S0167639324000177-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139914548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic classification of neurological voice disorders using wavelet scattering features 利用小波散射特征对神经性嗓音疾病进行自动分类
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103040
Madhu Keerthana Yagnavajjula , Kiran Reddy Mittapalle , Paavo Alku , Sreenivasa Rao K. , Pabitra Mitra

Neurological voice disorders are caused by problems in the nervous system as it interacts with the larynx. In this paper, we propose to use wavelet scattering transform (WST)-based features in automatic classification of neurological voice disorders. As a part of WST, a speech signal is processed in stages with each stage consisting of three operations – convolution, modulus and averaging – to generate low-variance data representations that preserve discriminability across classes while minimizing differences within a class. The proposed WST-based features were extracted from speech signals of patients suffering from either spasmodic dysphonia (SD) or recurrent laryngeal nerve palsy (RLNP) and from speech signals of healthy speakers of the Saarbruecken voice disorder (SVD) database. Two machine learning algorithms (support vector machine (SVM) and feed forward neural network (NN)) were trained separately using the WST-based features, to perform two binary classification tasks (healthy vs. SD and healthy vs. RLNP) and one multi-class classification task (healthy vs. SD vs. RLNP). The results show that WST-based features outperformed state-of-the-art features in all three tasks. Furthermore, the best overall classification performance was achieved by the NN classifier trained using WST-based features.

神经性嗓音疾病是由于神经系统与喉部相互作用时出现问题而造成的。在本文中,我们建议在神经性嗓音疾病的自动分类中使用基于小波散射变换(WST)的特征。作为小波散射变换的一部分,语音信号会被分阶段处理,每个阶段包括三次运算--卷积、模数和平均,以生成低方差数据表示,从而保持不同类别之间的可区分性,同时最大限度地减少类别内的差异。所提出的基于 WST 的特征是从痉挛性发音障碍(SD)或喉返神经麻痹(RLNP)患者的语音信号以及萨尔布吕肯语音障碍(SVD)数据库中健康说话者的语音信号中提取的。使用基于 WST 的特征分别训练了两种机器学习算法(支持向量机 (SVM) 和前馈神经网络 (NN)),以完成两项二元分类任务(健康 vs. SD 和健康 vs. RLNP)和一项多类分类任务(健康 vs. SD vs. RLNP)。结果表明,在所有三个任务中,基于 WST 的特征都优于最先进的特征。此外,使用基于 WST 特征训练的 NN 分类器取得了最佳的整体分类性能。
{"title":"Automatic classification of neurological voice disorders using wavelet scattering features","authors":"Madhu Keerthana Yagnavajjula ,&nbsp;Kiran Reddy Mittapalle ,&nbsp;Paavo Alku ,&nbsp;Sreenivasa Rao K. ,&nbsp;Pabitra Mitra","doi":"10.1016/j.specom.2024.103040","DOIUrl":"10.1016/j.specom.2024.103040","url":null,"abstract":"<div><p>Neurological voice disorders are caused by problems in the nervous system as it interacts with the larynx. In this paper, we propose to use wavelet scattering transform (WST)-based features in automatic classification of neurological voice disorders. As a part of WST, a speech signal is processed in stages with each stage consisting of three operations – convolution, modulus and averaging – to generate low-variance data representations that preserve discriminability across classes while minimizing differences within a class. The proposed WST-based features were extracted from speech signals of patients suffering from either spasmodic dysphonia (SD) or recurrent laryngeal nerve palsy (RLNP) and from speech signals of healthy speakers of the Saarbruecken voice disorder (SVD) database. Two machine learning algorithms (support vector machine (SVM) and feed forward neural network (NN)) were trained separately using the WST-based features, to perform two binary classification tasks (healthy vs. SD and healthy vs. RLNP) and one multi-class classification task (healthy vs. SD vs. RLNP). The results show that WST-based features outperformed state-of-the-art features in all three tasks. Furthermore, the best overall classification performance was achieved by the NN classifier trained using WST-based features.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000128/pdfft?md5=98a659d5cd3309ac33e76a42084db6ed&pid=1-s2.0-S0167639324000128-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139589964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AVID: A speech database for machine learning studies on vocal intensity AVID:用于声音强度机器学习研究的语音数据库
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103039
Paavo Alku , Manila Kodali , Laura Laaksonen , Sudarsana Reddy Kadiri

Vocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learning (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.

语音强度是语音的一个主要特征,通常用声压级 (SPL) 来量化。要测量语音录音的声压级,需要将标准校准音(参考声压级为 94 dB 或 114 dB)与语音一起录制。然而,大多数用于语音和说话人识别等领域的流行数据库都是在没有校准信息的情况下录制的,用任意振幅标度来表达语音。因此,录制语音的声强信息(包括声压级)就丢失了。在当前的研究中,我们引入了一个新的开放式校准语音/电子声门图(EGG)数据库,名为阿尔托声带强度数据库(AVID)。AVID 包括 50 位说话者(25 位男性,25 位女性)的语音和 EGG,他们的声音强度分为四类(轻柔、正常、响亮和非常响亮)。录音时,嘴与麦克风的距离保持不变,并录制校准音。语音数据按句子进行了标注,共使用了 19 个标签,这些标签支持在基于机器学习(ML)的声乐强度研究中使用基于监督学习的数据。为了展示如何利用 AVID 数据研究声乐强度,我们研究了一项多类分类任务(将语音分为柔和、正常、响亮和非常响亮的强度类别)和一项回归任务(预测语音的声压级)。在这两项任务中,我们故意扭曲了输入语音的电平,将信号归一化,使其最大振幅等于 1.0,也就是说,我们模拟了当前语音数据库中普遍存在的情况。结果表明,使用频谱图特征和支持向量机分类器对声音强度进行多类分类的准确率为 82%。在预测声压级时,使用频谱图特征和支持向量回归器得出的平均绝对误差约为 2 dB,决定系数为 92%。我们欢迎对分类和回归问题感兴趣的研究人员在声乐强度研究中使用 AVID,并希望当前的结果可以作为未来有关该主题的 ML 研究的基线。
{"title":"AVID: A speech database for machine learning studies on vocal intensity","authors":"Paavo Alku ,&nbsp;Manila Kodali ,&nbsp;Laura Laaksonen ,&nbsp;Sudarsana Reddy Kadiri","doi":"10.1016/j.specom.2024.103039","DOIUrl":"10.1016/j.specom.2024.103039","url":null,"abstract":"<div><p>Vocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learning (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000116/pdfft?md5=c116ec551b37da3e4f4867e6d11803ea&pid=1-s2.0-S0167639324000116-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139560424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1