使用视觉变形器在人机交互中进行个性化语音情感识别

Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa
{"title":"使用视觉变形器在人机交互中进行个性化语音情感识别","authors":"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa","doi":"arxiv-2409.10687","DOIUrl":null,"url":null,"abstract":"Emotions are an essential element in verbal communication, so understanding\nindividuals' affect during a human-robot interaction (HRI) becomes imperative.\nThis paper investigates the application of vision transformer models, namely\nViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\npipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\ngeneralize the SER models for individual speech characteristics by fine-tuning\nthese models on benchmark datasets and exploiting ensemble methods. For this\npurpose, we collected audio data from different human subjects having\npseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\nViT and BEiT-based models and tested these models on unseen speech samples from\nthe participants. In the results, we show that fine-tuning vision transformers\non benchmark datasets and and then using either these already fine-tuned models\nor ensembling ViT/BEiT models gets us the highest classification accuracies per\nindividual when it comes to identifying four primary emotions from their\nspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\nor BEiTs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers\",\"authors\":\"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa\",\"doi\":\"arxiv-2409.10687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotions are an essential element in verbal communication, so understanding\\nindividuals' affect during a human-robot interaction (HRI) becomes imperative.\\nThis paper investigates the application of vision transformer models, namely\\nViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\\npipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\\ngeneralize the SER models for individual speech characteristics by fine-tuning\\nthese models on benchmark datasets and exploiting ensemble methods. For this\\npurpose, we collected audio data from different human subjects having\\npseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\\nViT and BEiT-based models and tested these models on unseen speech samples from\\nthe participants. In the results, we show that fine-tuning vision transformers\\non benchmark datasets and and then using either these already fine-tuned models\\nor ensembling ViT/BEiT models gets us the highest classification accuracies per\\nindividual when it comes to identifying four primary emotions from their\\nspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\\nor BEiTs.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文研究了视觉变换器模型,即ViT(视觉变换器)和BEiT(图像变换器的BERT预训练)管道在人机交互中的语音情感识别(SER)应用。重点是通过在基准数据集上对这些模型进行微调,并利用集合方法,针对单个语音特征对 SER 模型进行泛化。为此,我们收集了不同人类受试者与NAO机器人进行伪自然对话的音频数据。然后,我们对基于ViT和BEiT的模型进行了微调,并在受试者未见过的语音样本上对这些模型进行了测试。结果表明,与微调ViTs或BEiTs相比,微调基准数据集的视觉变换,然后使用这些已微调的模型或ViT/BEiT模型的集合,在从语音中识别四种主要情绪(中性、快乐、悲伤和愤怒)时,每个人的分类准确率最高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers
Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems Conformal Prediction for Manifold-based Source Localization with Gaussian Processes Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1