KMSAV: Korean multi-speaker spontaneous audiovisual dataset

IF 1.3 4区 计算机科学 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC ETRI Journal Pub Date : 2024-02-28 DOI:10.4218/etrij.2023-0352
Kiyoung Park, Changhan Oh, Sunghee Dong
{"title":"KMSAV: Korean multi-speaker spontaneous audiovisual dataset","authors":"Kiyoung Park,&nbsp;Changhan Oh,&nbsp;Sunghee Dong","doi":"10.4218/etrij.2023-0352","DOIUrl":null,"url":null,"abstract":"<p>Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.</p>","PeriodicalId":11901,"journal":{"name":"ETRI Journal","volume":"46 1","pages":"71-81"},"PeriodicalIF":1.3000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2023-0352","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETRI Journal","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.4218/etrij.2023-0352","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
KMSAV:韩国多说话者自发视听数据集
语音和视觉识别深度学习的最新进展加速了多模态语音识别的发展,并取得了许多创新成果。我们介绍了一个韩国视听语音识别语料库。该数据集包括约 150 小时的人工转录和注释视听数据,以及另外 2000 小时根据创作共用许可从 YouTube 收集的未转录视频。该数据集可供免费访问,用于无限制的研究目的。除了语料库,我们还提出了一个用于自动语音识别(ASR)和视听语音识别(AVSR)的开源框架。我们利用最先进的 ASR 和 AVSR 技术,通过预训练模型和微调过程进行评估,验证了语料库的有效性。微调后,ASR 和 AVSR 的字符错误率分别为 11.1% 和 18.9%。这种误差差异凸显了 AVSR 技术改进的必要性。我们希望我们的语料库将成为支持 AVSR 改进的重要资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ETRI Journal
ETRI Journal 工程技术-电信学
CiteScore
4.00
自引率
7.10%
发文量
98
审稿时长
6.9 months
期刊介绍: ETRI Journal is an international, peer-reviewed multidisciplinary journal published bimonthly in English. The main focus of the journal is to provide an open forum to exchange innovative ideas and technology in the fields of information, telecommunications, and electronics. Key topics of interest include high-performance computing, big data analytics, cloud computing, multimedia technology, communication networks and services, wireless communications and mobile computing, material and component technology, as well as security. With an international editorial committee and experts from around the world as reviewers, ETRI Journal publishes high-quality research papers on the latest and best developments from the global community.
期刊最新文献
Issue Information Free-space quantum key distribution transmitter system using WDM filter for channel integration Metaheuristic optimization scheme for quantum kernel classifiers using entanglement-directed graphs SNN eXpress: Streamlining Low-Power AI-SoC Development With Unsigned Weight Accumulation Spiking Neural Network NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1