Exploring Speaker Age Estimation on Different Self-Supervised Learning Models

Duc-Tuan Truong, Tran The Anh, Chng Eng Siong
{"title":"Exploring Speaker Age Estimation on Different Self-Supervised Learning Models","authors":"Duc-Tuan Truong, Tran The Anh, Chng Eng Siong","doi":"10.23919/APSIPAASC55919.2022.9979878","DOIUrl":null,"url":null,"abstract":"Self-supervised learning (SSL) has played an important role in various tasks in the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict the speaker's age and gender using speech signals. In this paper, we investigate seven SSL models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, and data2vec in the joint age estimation and gender classification task on the TIMIT corpus. Additionally, we also study the effect of using different hidden encoder layers within these models on the age estimation result. Furthermore, we evaluate how the performance of different SSL models varies in predicting the speaker's age under simulated noisy conditions. The simulated noisy speech is created by mixing the clean utterance from the TIMIT test set with random noises from the Music and Noise category of the MUSAN corpus on multiple levels of signal-to-noise ratio (SNR). Our findings confirm that a recent SSL model, namely WavLM can obtain better and more robust speech representation than wav2vec 2.0 SSL model used in the current state-of-the-art (SOTA) approach by achieving a 3.6% and 11.32% mean average error (MAE) reduction on the clean and 5dB SNR TIMIT test set.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9979878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Self-supervised learning (SSL) has played an important role in various tasks in the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict the speaker's age and gender using speech signals. In this paper, we investigate seven SSL models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, and data2vec in the joint age estimation and gender classification task on the TIMIT corpus. Additionally, we also study the effect of using different hidden encoder layers within these models on the age estimation result. Furthermore, we evaluate how the performance of different SSL models varies in predicting the speaker's age under simulated noisy conditions. The simulated noisy speech is created by mixing the clean utterance from the TIMIT test set with random noises from the Music and Noise category of the MUSAN corpus on multiple levels of signal-to-noise ratio (SNR). Our findings confirm that a recent SSL model, namely WavLM can obtain better and more robust speech representation than wav2vec 2.0 SSL model used in the current state-of-the-art (SOTA) approach by achieving a 3.6% and 11.32% mean average error (MAE) reduction on the clean and 5dB SNR TIMIT test set.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
不同自监督学习模型对说话人年龄估计的探讨
自监督学习(SSL)在语音和音频处理领域的各种任务中发挥着重要作用。然而,利用这些SSL模型利用语音信号预测说话人的年龄和性别的研究有限。本文研究了PASE+、NPC、wav2vec 2.0、XLSR、HuBERT、WavLM和data2vec 7种SSL模型在TIMIT语料库上的年龄估计和性别联合分类任务。此外,我们还研究了在这些模型中使用不同的隐藏编码器层对年龄估计结果的影响。此外,我们评估了不同SSL模型在模拟噪声条件下预测说话人年龄的性能变化。通过将TIMIT测试集的干净语音与MUSAN语料库中Music and Noise类别的随机噪声在多个信噪比(SNR)水平上混合,生成模拟噪声语音。我们的研究结果证实,最近的SSL模型,即WavLM,可以获得比当前最先进(SOTA)方法中使用的wav2vec 2.0 SSL模型更好、更稳健的语音表示,在干净和5dB信噪比TIMIT测试集上实现3.6%和11.32%的平均误差(MAE)降低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
相关文献
Ground-state properties of the antiferromagnetic potts model in an external field
IF 2.6 3区 物理与天体物理Physics Letters APub Date : 1983-02-07 DOI: 10.1016/0375-9601(83)90478-4
Loïc Turban
Ground states for the ising model with an external field on the Cayley tree
IF 0 Uzbek Mathematical JournalPub Date : 2018-09-21 DOI: 10.29229/UZMJ.2018-3-15
M. Rahmatullaev, M. R. Abdusalomova, M. A. Rasulova
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multi-class Vehicle Counting System for Multi-view Traffic Videos Optimal Deep Multi-Route Self-Attention for Single Image Super-Resolution Distance Estimation Between Camera and Vehicles from an Image using YOLO and Machine Learning ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks PVGCRA: Prediction Variance Guided Cross Region Domain Adaptation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1