在 ERB 标度上使用 sincNet 从原始语音中自动识别说话者和儿童年龄

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-04-01 DOI:10.1016/j.specom.2024.103069
Kodali Radha , Mohan Bansal , Ram Bilas Pachori
{"title":"在 ERB 标度上使用 sincNet 从原始语音中自动识别说话者和儿童年龄","authors":"Kodali Radha ,&nbsp;Mohan Bansal ,&nbsp;Ram Bilas Pachori","doi":"10.1016/j.specom.2024.103069","DOIUrl":null,"url":null,"abstract":"<div><p>This paper presents the newly developed non-native children’s English speech (NNCES) corpus to reveal the findings of automatic speaker and age recognition from raw speech. Convolutional neural networks (CNN), which have the ability to learn low-level speech representations, can be fed directly with raw speech signals instead of using traditional hand-crafted features. Moreover, the filters that were learned using standard CNNs appeared to be noisy because they consider all elements of each filter. In contrast, sincNet can be able to generate more meaningful filters simply by replacing the first convolutional layer by a sinc-layer in standard CNNs. The low and high cutoff frequencies of the rectangular band-pass filter are the only parameters that can be learned in sincNet, which has the potential to extract significant speech cues from the speaker, such as pitch and formants. In this work, the sincNet model is significantly changed by switching from baseline Mel scale initializations to equivalent rectangular bandwidth (ERB) initializations, which has the added benefit of allocating additional filters in the lower region of the spectrum. Additionally, it needs to be highlighted that the novel sincNet model is well suited to identify the age of the children. The investigations on both read and spontaneous speech tasks in speaker identification, gender independent &amp; dependent age-group identification of children outperform the baseline models with varying relative improvements in terms of accuracy.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic speaker and age identification of children from raw speech using sincNet over ERB scale\",\"authors\":\"Kodali Radha ,&nbsp;Mohan Bansal ,&nbsp;Ram Bilas Pachori\",\"doi\":\"10.1016/j.specom.2024.103069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This paper presents the newly developed non-native children’s English speech (NNCES) corpus to reveal the findings of automatic speaker and age recognition from raw speech. Convolutional neural networks (CNN), which have the ability to learn low-level speech representations, can be fed directly with raw speech signals instead of using traditional hand-crafted features. Moreover, the filters that were learned using standard CNNs appeared to be noisy because they consider all elements of each filter. In contrast, sincNet can be able to generate more meaningful filters simply by replacing the first convolutional layer by a sinc-layer in standard CNNs. The low and high cutoff frequencies of the rectangular band-pass filter are the only parameters that can be learned in sincNet, which has the potential to extract significant speech cues from the speaker, such as pitch and formants. In this work, the sincNet model is significantly changed by switching from baseline Mel scale initializations to equivalent rectangular bandwidth (ERB) initializations, which has the added benefit of allocating additional filters in the lower region of the spectrum. Additionally, it needs to be highlighted that the novel sincNet model is well suited to identify the age of the children. The investigations on both read and spontaneous speech tasks in speaker identification, gender independent &amp; dependent age-group identification of children outperform the baseline models with varying relative improvements in terms of accuracy.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000414\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000414","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

本文介绍了新开发的非母语儿童英语语音(NNCES)语料库,以揭示从原始语音自动识别说话人和年龄的研究成果。卷积神经网络(CNN)具有学习低级语音表征的能力,可以直接输入原始语音信号,而不是使用传统的手工创建特征。此外,使用标准 CNN 学习的滤波器会出现噪音,因为它们会考虑每个滤波器的所有元素。相比之下,sincNet 只需将第一个卷积层替换为标准 CNN 中的 sinc 层,就能生成更有意义的滤波器。矩形带通滤波器的低频和高频截止频率是 sincNet 中唯一可以学习的参数,它有可能从说话者身上提取重要的语音线索,如音高和声调。在这项工作中,sincNet 模型通过从基线 Mel scale 初始化转换为等效矩形带宽 (ERB) 初始化而发生了显著变化,其额外的好处是在频谱的较低区域分配了额外的滤波器。此外,需要强调的是,新颖的 sincNet 模型非常适合识别儿童的年龄。在朗读和自发语音任务中对说话者的识别、与性别无关的&与年龄组相关的儿童识别等方面的研究结果都优于基线模型,在准确率方面也有不同程度的相对提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Automatic speaker and age identification of children from raw speech using sincNet over ERB scale

This paper presents the newly developed non-native children’s English speech (NNCES) corpus to reveal the findings of automatic speaker and age recognition from raw speech. Convolutional neural networks (CNN), which have the ability to learn low-level speech representations, can be fed directly with raw speech signals instead of using traditional hand-crafted features. Moreover, the filters that were learned using standard CNNs appeared to be noisy because they consider all elements of each filter. In contrast, sincNet can be able to generate more meaningful filters simply by replacing the first convolutional layer by a sinc-layer in standard CNNs. The low and high cutoff frequencies of the rectangular band-pass filter are the only parameters that can be learned in sincNet, which has the potential to extract significant speech cues from the speaker, such as pitch and formants. In this work, the sincNet model is significantly changed by switching from baseline Mel scale initializations to equivalent rectangular bandwidth (ERB) initializations, which has the added benefit of allocating additional filters in the lower region of the spectrum. Additionally, it needs to be highlighted that the novel sincNet model is well suited to identify the age of the children. The investigations on both read and spontaneous speech tasks in speaker identification, gender independent & dependent age-group identification of children outperform the baseline models with varying relative improvements in terms of accuracy.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1