Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classification

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Processing & Management Pub Date : 2024-06-01 DOI:10.1016/j.ipm.2024.103798

Fugui Fan , Yuting Su , Yun Liu , Peiguang Jing , Kaihua Qu , Yu Liu

{"title":"Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classification","authors":"Fugui Fan , Yuting Su , Yun Liu , Peiguang Jing , Kaihua Qu , Yu Liu","doi":"10.1016/j.ipm.2024.103798","DOIUrl":null,"url":null,"abstract":"<div><p>As one of the typical formats of prevalent user-generated content in social media platforms, micro-videos inherently incorporate multimodal characteristics associated with a group of label concepts. However, existing methods generally explore the consensus features aggregated from all modalities to train a final multi-label predictor, while overlooking fine-grained semantic dependencies between modality and label domains. To address this problem, we present a novel multimodal deep hierarchical semantic-aligned matrix factorization (DHSAMF) method, which is devoted to bridging the dual-domain semantic discrepancies and the inter-modal heterogeneity gap for solving the multi-label classification task of micro-videos. Specifically, we utilize deep matrix factorization to individually explore the hierarchical modality-specific representations. A series of semantic embeddings is introduced to facilitate latent semantic interactions between modality-specific representations and label features in a layerwise manner. To further improve the representation ability of each modality, we leverage underlying correlation structures among instances to adequately mine intra-modal complementary attributes, and maximize the inter-modal alignment by aggregating consensus attributes in an optimal permutation. The experimental results conducted on the MTSVRC and VidOR datasets have demonstrated that our DHSAMF outperforms other state-of-the-art methods by nearly 3% and 4% improvements in terms of the AP metric.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 5","pages":"Article 103798"},"PeriodicalIF":6.9000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324001572","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

As one of the typical formats of prevalent user-generated content in social media platforms, micro-videos inherently incorporate multimodal characteristics associated with a group of label concepts. However, existing methods generally explore the consensus features aggregated from all modalities to train a final multi-label predictor, while overlooking fine-grained semantic dependencies between modality and label domains. To address this problem, we present a novel multimodal deep hierarchical semantic-aligned matrix factorization (DHSAMF) method, which is devoted to bridging the dual-domain semantic discrepancies and the inter-modal heterogeneity gap for solving the multi-label classification task of micro-videos. Specifically, we utilize deep matrix factorization to individually explore the hierarchical modality-specific representations. A series of semantic embeddings is introduced to facilitate latent semantic interactions between modality-specific representations and label features in a layerwise manner. To further improve the representation ability of each modality, we leverage underlying correlation structures among instances to adequately mine intra-modal complementary attributes, and maximize the inter-modal alignment by aggregating consensus attributes in an optimal permutation. The experimental results conducted on the MTSVRC and VidOR datasets have demonstrated that our DHSAMF outperforms other state-of-the-art methods by nearly 3% and 4% improvements in terms of the AP metric.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于微视频多标签分类的多模态深度分层语义对齐矩阵因式分解方法

微视频作为社交媒体平台上流行的用户生成内容的典型格式之一，本身就包含了与一组标签概念相关的多模态特征。然而，现有的方法一般都是利用从所有模态中汇总的共识特征来训练最终的多标签预测器，却忽略了模态和标签域之间的细粒度语义依赖关系。针对这一问题，我们提出了一种新颖的多模态深度分层语义对齐矩阵因式分解（DHSAMF）方法，该方法致力于弥合双域语义差异和模态间异质性差距，以解决微视频的多标签分类任务。具体来说，我们利用深度矩阵因式分解来单独探索特定模态的分层表征。我们引入了一系列语义嵌入，以分层方式促进特定模态表征与标签特征之间的潜在语义交互。为了进一步提高每种模态的表征能力，我们利用实例之间的潜在相关结构来充分挖掘模态内的互补属性，并通过以最优排列方式聚合共识属性来最大限度地提高模态间的一致性。在 MTSVRC 和 VidOR 数据集上进行的实验结果表明，就 AP 指标而言，我们的 DHSAMF 优于其他最先进的方法，分别提高了近 3% 和 4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.