Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription

Juan M. Perero-Codosero, Javier Antón-Martín, D. Merino, Eduardo López Gonzalo, L. A. H. Gómez
{"title":"Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription","authors":"Juan M. Perero-Codosero, Javier Antón-Martín, D. Merino, Eduardo López Gonzalo, L. A. H. Gómez","doi":"10.21437/IBERSPEECH.2018-55","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are hybrid models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict charac-ter/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (be-tween 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IberSPEECH Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/IBERSPEECH.2018-55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are hybrid models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict charac-ter/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (be-tween 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索开源深度学习ASR用于语音到文本电视节目转录
深度神经网络(DNN)是当前ASR的基础部分。最先进的是混合模型,其中声学模型(AM)是利用神经网络设计的。然而,人们对开发端到端深度学习解决方案越来越感兴趣,其中神经网络被训练来预测可以直接转换为单词的字符/字素或子词序列。尽管端到端ASR系统已经取得了一些令人鼓舞的成果,但目前尚不清楚它们是否能够取代混合动力系统。在这篇文章中,我们在IberSpeech-RTVE语音到文本转录挑战下评估了开源的最先进的混合和端到端深度学习ASR。混合ASR是基于Kaldi和Wav2Letter的端到端框架。使用6小时的dev1和dev2分区进行实验。在参考电视节目(LM-20171107)中,混合系统(小写格式,不带标点符号)的最低WER为22.23%。Wav2Letter的主要限制是高训练计算需求(根据训练集的不同,在6小时到1天/epoch之间)。这迫使我们停止训练过程,以满足挑战的最后期限。但我们相信,随着更多的训练时间,它将提供具有竞争力的结果与混合系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Recurrent Neural Network Approach to Audio Segmentation for Broadcast Domain Data The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization Challenge AUDIAS-CEU: A Language-independent approach for the Query-by-Example Spoken Term Detection task of the Search on Speech ALBAYZIN 2018 evaluation The GTM-UVIGO System for Audiovisual Diarization Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1