Influence of Late Fusion of High-Level Features on User Relevance Feedback for Videos

Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval Pub Date : 2022-10-14 DOI:10.1145/3552467.3554795

O. Khan, Jan Zahálka, Björn þór Jónsson

{"title":"Influence of Late Fusion of High-Level Features on User Relevance Feedback for Videos","authors":"O. Khan, Jan Zahálka, Björn þór Jónsson","doi":"10.1145/3552467.3554795","DOIUrl":null,"url":null,"abstract":"Content-based media retrieval relies on multimodal data representations. For videos, these representations mainly focus on the textual, visual, and audio modalities. While the modality representations can be used individually, combining their information can improve the overall retrieval experience. For video collections, retrieval focuses on either finding a full length video or specific segment(s) from one or more videos. For the former, the textual metadata along with broad descriptions of the contents are useful. For the latter, visual and audio modality representations are preferable as they represent the contents of specific segments in videos. Interactive learning approaches, such as user relevance feedback, have shown promising results when solving exploration and search tasks in larger collections. When combining modality representations in user relevance feedback, often a form of late modality fusion method is applied. While this generally tends to improve retrieval, its performance for video collections with multiple modality representations of high-level features, is not well known. In this study we analyse the effects of late fusion using high-level features, such as semantic concepts, actions, scenes, and audio. From our experiments on three video datasets, V3C1, Charades, and VGG-Sound, we show that fusion works well, but depending on the task or dataset, excluding one or more modalities can improve results. When it is clear that a modality is better for a task, setting a preference to enhance that modality's influence in the fusion process can also be greatly beneficial. Furthermore, we show that mixing fusion results and results from individual modalities can be better than only performing fusion.","PeriodicalId":168191,"journal":{"name":"Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552467.3554795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Content-based media retrieval relies on multimodal data representations. For videos, these representations mainly focus on the textual, visual, and audio modalities. While the modality representations can be used individually, combining their information can improve the overall retrieval experience. For video collections, retrieval focuses on either finding a full length video or specific segment(s) from one or more videos. For the former, the textual metadata along with broad descriptions of the contents are useful. For the latter, visual and audio modality representations are preferable as they represent the contents of specific segments in videos. Interactive learning approaches, such as user relevance feedback, have shown promising results when solving exploration and search tasks in larger collections. When combining modality representations in user relevance feedback, often a form of late modality fusion method is applied. While this generally tends to improve retrieval, its performance for video collections with multiple modality representations of high-level features, is not well known. In this study we analyse the effects of late fusion using high-level features, such as semantic concepts, actions, scenes, and audio. From our experiments on three video datasets, V3C1, Charades, and VGG-Sound, we show that fusion works well, but depending on the task or dataset, excluding one or more modalities can improve results. When it is clear that a modality is better for a task, setting a preference to enhance that modality's influence in the fusion process can also be greatly beneficial. Furthermore, we show that mixing fusion results and results from individual modalities can be better than only performing fusion.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高级特征后期融合对视频用户相关性反馈的影响

基于内容的媒体检索依赖于多模态数据表示。对于视频，这些表征主要集中在文本、视觉和音频模式上。虽然模态表示可以单独使用，但将它们的信息组合起来可以改善整体检索体验。对于视频集合，检索的重点是从一个或多个视频中找到完整长度的视频或特定片段。对于前者，文本元数据以及对内容的广泛描述是有用的。对于后者，视觉和音频的模态表示是可取的，因为它们表示视频中特定片段的内容。交互式学习方法，如用户相关性反馈，在解决大型集合中的探索和搜索任务时显示出有希望的结果。在用户关联反馈中结合模态表示时，通常采用一种后期模态融合方法。虽然这通常倾向于提高检索，但它对具有高级特征的多模态表示的视频集合的性能并不为人所知。在这项研究中，我们使用高级特征分析后期融合的影响，如语义概念、动作、场景和音频。从我们对三个视频数据集(V3C1、Charades和VGG-Sound)的实验中，我们发现融合效果很好，但根据任务或数据集的不同，排除一种或多种模式可以改善结果。当一种模态明显更适合某项任务时，设置偏好以增强该模态在融合过程中的影响也可能非常有益。此外，我们表明混合融合结果和单个模式的结果可以比只进行融合更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval

自引率

0.00%

发文量

期刊最新文献

VILT NeoCube An Asynchronous Scheme for the Distributed Evaluation of Interactive Multimedia Retrieval Combining Semantic and Visual Image Graphs for Efficient Search and Exploration of Large Dynamic Image Collections Influence of Late Fusion of High-Level Features on User Relevance Feedback for Videos