不同自监督学习模型对说话人年龄估计的探讨

Duc-Tuan Truong, Tran The Anh, Chng Eng Siong
{"title":"不同自监督学习模型对说话人年龄估计的探讨","authors":"Duc-Tuan Truong, Tran The Anh, Chng Eng Siong","doi":"10.23919/APSIPAASC55919.2022.9979878","DOIUrl":null,"url":null,"abstract":"Self-supervised learning (SSL) has played an important role in various tasks in the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict the speaker's age and gender using speech signals. In this paper, we investigate seven SSL models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, and data2vec in the joint age estimation and gender classification task on the TIMIT corpus. Additionally, we also study the effect of using different hidden encoder layers within these models on the age estimation result. Furthermore, we evaluate how the performance of different SSL models varies in predicting the speaker's age under simulated noisy conditions. The simulated noisy speech is created by mixing the clean utterance from the TIMIT test set with random noises from the Music and Noise category of the MUSAN corpus on multiple levels of signal-to-noise ratio (SNR). Our findings confirm that a recent SSL model, namely WavLM can obtain better and more robust speech representation than wav2vec 2.0 SSL model used in the current state-of-the-art (SOTA) approach by achieving a 3.6% and 11.32% mean average error (MAE) reduction on the clean and 5dB SNR TIMIT test set.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Exploring Speaker Age Estimation on Different Self-Supervised Learning Models\",\"authors\":\"Duc-Tuan Truong, Tran The Anh, Chng Eng Siong\",\"doi\":\"10.23919/APSIPAASC55919.2022.9979878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Self-supervised learning (SSL) has played an important role in various tasks in the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict the speaker's age and gender using speech signals. In this paper, we investigate seven SSL models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, and data2vec in the joint age estimation and gender classification task on the TIMIT corpus. Additionally, we also study the effect of using different hidden encoder layers within these models on the age estimation result. Furthermore, we evaluate how the performance of different SSL models varies in predicting the speaker's age under simulated noisy conditions. The simulated noisy speech is created by mixing the clean utterance from the TIMIT test set with random noises from the Music and Noise category of the MUSAN corpus on multiple levels of signal-to-noise ratio (SNR). Our findings confirm that a recent SSL model, namely WavLM can obtain better and more robust speech representation than wav2vec 2.0 SSL model used in the current state-of-the-art (SOTA) approach by achieving a 3.6% and 11.32% mean average error (MAE) reduction on the clean and 5dB SNR TIMIT test set.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9979878\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9979878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

自监督学习(SSL)在语音和音频处理领域的各种任务中发挥着重要作用。然而,利用这些SSL模型利用语音信号预测说话人的年龄和性别的研究有限。本文研究了PASE+、NPC、wav2vec 2.0、XLSR、HuBERT、WavLM和data2vec 7种SSL模型在TIMIT语料库上的年龄估计和性别联合分类任务。此外,我们还研究了在这些模型中使用不同的隐藏编码器层对年龄估计结果的影响。此外,我们评估了不同SSL模型在模拟噪声条件下预测说话人年龄的性能变化。通过将TIMIT测试集的干净语音与MUSAN语料库中Music and Noise类别的随机噪声在多个信噪比(SNR)水平上混合,生成模拟噪声语音。我们的研究结果证实,最近的SSL模型,即WavLM,可以获得比当前最先进(SOTA)方法中使用的wav2vec 2.0 SSL模型更好、更稳健的语音表示,在干净和5dB信噪比TIMIT测试集上实现3.6%和11.32%的平均误差(MAE)降低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Exploring Speaker Age Estimation on Different Self-Supervised Learning Models
Self-supervised learning (SSL) has played an important role in various tasks in the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict the speaker's age and gender using speech signals. In this paper, we investigate seven SSL models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, and data2vec in the joint age estimation and gender classification task on the TIMIT corpus. Additionally, we also study the effect of using different hidden encoder layers within these models on the age estimation result. Furthermore, we evaluate how the performance of different SSL models varies in predicting the speaker's age under simulated noisy conditions. The simulated noisy speech is created by mixing the clean utterance from the TIMIT test set with random noises from the Music and Noise category of the MUSAN corpus on multiple levels of signal-to-noise ratio (SNR). Our findings confirm that a recent SSL model, namely WavLM can obtain better and more robust speech representation than wav2vec 2.0 SSL model used in the current state-of-the-art (SOTA) approach by achieving a 3.6% and 11.32% mean average error (MAE) reduction on the clean and 5dB SNR TIMIT test set.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
相关文献
二甲双胍通过HDAC6和FoxO3a转录调控肌肉生长抑制素诱导肌肉萎缩
IF 8.9 1区 医学Journal of Cachexia, Sarcopenia and MusclePub Date : 2021-11-02 DOI: 10.1002/jcsm.12833
Min Ju Kang, Ji Wook Moon, Jung Ok Lee, Ji Hae Kim, Eun Jeong Jung, Su Jin Kim, Joo Yeon Oh, Sang Woo Wu, Pu Reum Lee, Sun Hwa Park, Hyeon Soo Kim
具有疾病敏感单倍型的非亲属供体脐带血移植后的1型糖尿病
IF 3.2 3区 医学Journal of Diabetes InvestigationPub Date : 2022-11-02 DOI: 10.1111/jdi.13939
Kensuke Matsumoto, Taisuke Matsuyama, Ritsu Sumiyoshi, Matsuo Takuji, Tadashi Yamamoto, Ryosuke Shirasaki, Haruko Tashiro
封面:蛋白质组学分析确定IRSp53和fastin是PRV输出和直接细胞-细胞传播的关键
IF 3.4 4区 生物学ProteomicsPub Date : 2019-12-02 DOI: 10.1002/pmic.201970201
Fei-Long Yu, Huan Miao, Jinjin Xia, Fan Jia, Huadong Wang, Fuqiang Xu, Lin Guo
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multi-class Vehicle Counting System for Multi-view Traffic Videos Optimal Deep Multi-Route Self-Attention for Single Image Super-Resolution Distance Estimation Between Camera and Vehicles from an Image using YOLO and Machine Learning ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks PVGCRA: Prediction Variance Guided Cross Region Domain Adaptation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1