ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

arXiv - CS - Multimedia Pub Date : 2024-08-06 DOI:arxiv-2408.02978

Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li

{"title":"ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li","doi":"arxiv-2408.02978","DOIUrl":null,"url":null,"abstract":"E-commerce is increasingly multimedia-enriched, with products exhibited in a\nbroad-domain manner as images, short videos, or live stream promotions. A\nunified and vectorized cross-domain production representation is essential. Due\nto large intra-product variance and high inter-product similarity in the\nbroad-domain scenario, a visual-only representation is inadequate. While\nAutomatic Speech Recognition (ASR) text derived from the short or live-stream\nvideos is readily accessible, how to de-noise the excessively noisy text for\nmultimodal representation learning is mostly untouched. We propose ASR-enhanced\nMultimodal Product Representation Learning (AMPere). In order to extract\nproduct-specific information from the raw ASR text, AMPere uses an\neasy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,\ntogether with visual data, is then fed into a multi-branch network to generate\ncompact multimodal embeddings. Extensive experiments on a large-scale\ntri-domain dataset verify the effectiveness of AMPere in obtaining a unified\nmultimodal product representation that clearly improves cross-domain product\nretrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

针对跨域产品检索的 ASR 增强型多模态表征学习

电子商务的多媒体化程度越来越高，产品以图片、短视频或现场直播推广的方式在国外展示。统一和矢量化的跨域生产表示是必不可少的。在广域场景中，产品内部差异大，产品之间相似度高，因此仅有视觉表示是不够的。虽然从短视频或直播视频中提取的自动语音识别（ASR）文本很容易获得，但如何通过多模态表征学习对噪声过大的文本进行去噪处理却大多没有涉及。我们提出了 ASR 增强多模态产品表征学习（AMPere）。为了从原始 ASR 文本中提取特定产品信息，AMPere 使用了一个易于实现的基于 LLM 的 ASR 文本摘要器。经过 LLM 总结的文本与视觉数据一起输入多分支网络，生成紧凑的多模态嵌入。在大型三域数据集上进行的大量实验验证了 AMPere 在获得统一的多模态产品表示法方面的有效性，这种表示法明显改善了跨域产品检索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs