Multi sentence description of complex manipulation action videos

IF 2.4 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine Vision and Applications Pub Date : 2024-05-09 DOI:10.1007/s00138-024-01547-x
Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter
{"title":"Multi sentence description of complex manipulation action videos","authors":"Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter","doi":"10.1007/s00138-024-01547-x","DOIUrl":null,"url":null,"abstract":"<p>Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"43 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01547-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
复杂操作动作视频的多句描述
自动视频描述需要生成能概括视频中的动作、事件和对象的自然语言语句。人类描述视频的一项基本能力是改变细节层次,而现有的自动视频描述方法通常会生成单一、固定层次的细节句子,往往会忽略这一特点。这项研究深入探讨了操作动作的视频描述,在视频描述中,不同层次的细节对于传达动作的层次结构信息至关重要,这也与当代的机器人学习技术息息相关。我们最初提出了两个框架:混合统计模型和端到端方法。混合方法需要的数据量要少得多,它对视频片段中的不确定性进行统计建模。相反,端到端方法则需要更多数据,它绕过任何统计处理,在视觉编码器和语言解码器之间建立直接联系。此外,我们还引入了一种集成方法,旨在综合混合统计方法和端到端方法的优势,提高视频描述在不同数据可用性场景下的适应性和深度。所有这三种框架都利用 LSTM 堆栈来促进描述粒度,允许通过简洁的单句或精致的多句叙述来描述视频。定量结果表明,与其他竞争方法相比,这些方法能生成更逼真的描述。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Machine Vision and Applications
Machine Vision and Applications 工程技术-工程:电子与电气
CiteScore
6.30
自引率
3.00%
发文量
84
审稿时长
8.7 months
期刊介绍: Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal. Particular emphasis is placed on engineering and technology aspects of image processing and computer vision. The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.
期刊最新文献
A novel key point based ROI segmentation and image captioning using guidance information Specular Surface Detection with Deep Static Specular Flow and Highlight Removing cloud shadows from ground-based solar imagery Underwater image object detection based on multi-scale feature fusion Object Recognition Consistency in Regression for Active Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1