语音命令识别的注意机制研究

IF 3 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Multimedia Tools and Applications Pub Date : 2024-09-02 DOI:10.1007/s11042-024-20129-7
Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo
{"title":"语音命令识别的注意机制研究","authors":"Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo","doi":"10.1007/s11042-024-20129-7","DOIUrl":null,"url":null,"abstract":"<p>As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":"7 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Investigation of attention mechanism for speech command recognition\",\"authors\":\"Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo\",\"doi\":\"10.1007/s11042-024-20129-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.</p>\",\"PeriodicalId\":18770,\"journal\":{\"name\":\"Multimedia Tools and Applications\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Multimedia Tools and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11042-024-20129-7\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20129-7","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

作为语音命令识别的一个应用领域,智能家居为人们提供了与各种数字设备交流的便捷方式。深度学习在语音命令识别中的有效性已得到证实。然而,很少有研究对利用注意力机制来提高其性能进行广泛研究。在本研究中,我们旨在研究深度学习架构,以提高与说话人无关的语音命令识别能力。具体来说,我们首先使用 VGG 风格和 VGG-skip 风格网络对 log-Mel 频谱图和 log-Gammatone 频谱图进行比较。然后,选出表现最佳的模型,并使用不同的注意机制进行研究,包括信道-时间注意、信道-频率注意和信道-时间-频率注意。最后,使用具有交叉注意力的双 CNN 进行语音命令分类。实验使用了一个自制的数据集,其中包括 40 名参与者和 12 个类别,这些数据都是用普通话录制的,在不同的环境下使用了各种智能手机设备。实验结果表明,使用具有交叉注意力的对数-伽马通频谱图和 VGG-skip 风格网络可以获得最佳性能,准确率、精确度、召回率和 F1 分数分别为 94.59%、95.84%、94.64% 和 94.57%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Investigation of attention mechanism for speech command recognition

As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Multimedia Tools and Applications
Multimedia Tools and Applications 工程技术-工程:电子与电气
CiteScore
7.20
自引率
16.70%
发文量
2439
审稿时长
9.2 months
期刊介绍: Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed. Specific areas of interest include: - Multimedia Tools: - Multimedia Applications: - Prototype multimedia systems and platforms
期刊最新文献
MeVs-deep CNN: optimized deep learning model for efficient lung cancer classification Text-driven clothed human image synthesis with 3D human model estimation for assistance in shopping Hybrid golden jackal fusion based recommendation system for spatio-temporal transportation's optimal traffic congestion and road condition classification Deep-Dixon: Deep-Learning frameworks for fusion of MR T1 images for fat and water extraction Unified pre-training with pseudo infrared images for visible-infrared person re-identification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1