Seeing and hearing too: Audio representation for video captioning

Shun-Po Chuang, Chia-Hung Wan, Pang-Chi Huang, Chi-Yu Yang, Hung-yi Lee
{"title":"Seeing and hearing too: Audio representation for video captioning","authors":"Shun-Po Chuang, Chia-Hung Wan, Pang-Chi Huang, Chi-Yu Yang, Hung-yi Lee","doi":"10.1109/ASRU.2017.8268961","DOIUrl":null,"url":null,"abstract":"Video captioning has been widely researched. Most related work takes into account only visual content in generating descriptions. However, auditory content such as human speech or environmental sounds contains rich information for describing scenes, but has yet to be widely explored for video captions. Here, we experiment with different ways to use this auditory content in videos, and demonstrate improved caption generation in terms of popular evaluation methods such as BLEU, CIDEr, and METEOR. We also measure the semantic similarities between generated captions and human-provided ground truth using sentence embeddings, and find that good use of multi-modal contents helps the machine to generate captions that are more semantically related to the ground truth. When analyzing the generated sentences, we find some ambiguous situations for which visual-only models yield incorrect results but that are resolved by approaches that take into account auditory cues.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268961","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Video captioning has been widely researched. Most related work takes into account only visual content in generating descriptions. However, auditory content such as human speech or environmental sounds contains rich information for describing scenes, but has yet to be widely explored for video captions. Here, we experiment with different ways to use this auditory content in videos, and demonstrate improved caption generation in terms of popular evaluation methods such as BLEU, CIDEr, and METEOR. We also measure the semantic similarities between generated captions and human-provided ground truth using sentence embeddings, and find that good use of multi-modal contents helps the machine to generate captions that are more semantically related to the ground truth. When analyzing the generated sentences, we find some ambiguous situations for which visual-only models yield incorrect results but that are resolved by approaches that take into account auditory cues.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视觉和听觉:视频字幕的音频表示
视频字幕已经被广泛研究。大多数相关工作在生成描述时只考虑视觉内容。然而,人类语言或环境声音等听觉内容包含了丰富的描述场景的信息,但尚未广泛探索视频字幕。在这里,我们尝试了在视频中使用这种听觉内容的不同方法,并根据BLEU、CIDEr和METEOR等流行的评估方法演示了改进的标题生成。我们还使用句子嵌入测量了生成的标题和人类提供的基础真值之间的语义相似性,并发现多模态内容的良好使用有助于机器生成与基础真值在语义上更相关的标题。在分析生成的句子时,我们发现一些模糊的情况,仅视觉模型产生不正确的结果,但可以通过考虑听觉线索的方法来解决。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Scalable multi-domain dialogue state tracking Topic segmentation in ASR transcripts using bidirectional RNNS for change detection Consistent DNN uncertainty training and decoding for robust ASR Cracking the cocktail party problem by multi-beam deep attractor network ONENET: Joint domain, intent, slot prediction for spoken language understanding
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1