Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System

Hari Krishna Vydana, Sivanand Achanta, A. Vuppala
{"title":"Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System","authors":"Hari Krishna Vydana, Sivanand Achanta, A. Vuppala","doi":"10.21437/sltu.2018-36","DOIUrl":null,"url":null,"abstract":"Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be in-corporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Spoken Language Technologies for Under-resourced Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/sltu.2018-36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be in-corporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
结合说话人规范化能力到端到端语音识别系统
说话人归一化是自动语音识别系统(ASR)的关键问题之一。说话人归一化是为了减少由于说话人变化而导致的ASR性能下降。传统的说话人归一化方法大多是对每个说话人估计的输入数据进行线性变换,这种变换在数据充足的情况下是有效的。在实际场景中,测试说话者只能说出一个话语。本研究探索了端到端语音识别系统的说话人规范化方法,即使在看不见的说话人的单个话语可用时也可以有效地执行。在这项工作中,假设通过在训练端到端神经网络的同时适当地提供有关说话人身份的信息,可以将说话人的变异性归一化的能力纳入ASR系统。这些归一化方法的效率取决于对未见说话者使用的表示。在这项工作中,训练说话人的身份以两种不同的方式表示,即i)使用一热说话人代码,ii)所有训练说话人身份的加权组合。来自测试集的未见的说话人使用训练说话人表示的加权组合来表示。两种方法均将WSJ语料库的单词错误率(WER)降低了0.6,1.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Corpus of the Sorani Kurdish Folkloric Lyrics A Sentiment Analysis Dataset for Code-Mixed Malayalam-English Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1