Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li
{"title":"ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li","doi":"arxiv-2408.02978","DOIUrl":null,"url":null,"abstract":"E-commerce is increasingly multimedia-enriched, with products exhibited in a\nbroad-domain manner as images, short videos, or live stream promotions. A\nunified and vectorized cross-domain production representation is essential. Due\nto large intra-product variance and high inter-product similarity in the\nbroad-domain scenario, a visual-only representation is inadequate. While\nAutomatic Speech Recognition (ASR) text derived from the short or live-stream\nvideos is readily accessible, how to de-noise the excessively noisy text for\nmultimodal representation learning is mostly untouched. We propose ASR-enhanced\nMultimodal Product Representation Learning (AMPere). In order to extract\nproduct-specific information from the raw ASR text, AMPere uses an\neasy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,\ntogether with visual data, is then fed into a multi-branch network to generate\ncompact multimodal embeddings. Extensive experiments on a large-scale\ntri-domain dataset verify the effectiveness of AMPere in obtaining a unified\nmultimodal product representation that clearly improves cross-domain product\nretrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
E-commerce is increasingly multimedia-enriched, with products exhibited in a
broad-domain manner as images, short videos, or live stream promotions. A
unified and vectorized cross-domain production representation is essential. Due
to large intra-product variance and high inter-product similarity in the
broad-domain scenario, a visual-only representation is inadequate. While
Automatic Speech Recognition (ASR) text derived from the short or live-stream
videos is readily accessible, how to de-noise the excessively noisy text for
multimodal representation learning is mostly untouched. We propose ASR-enhanced
Multimodal Product Representation Learning (AMPere). In order to extract
product-specific information from the raw ASR text, AMPere uses an
easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,
together with visual data, is then fed into a multi-branch network to generate
compact multimodal embeddings. Extensive experiments on a large-scale
tri-domain dataset verify the effectiveness of AMPere in obtaining a unified
multimodal product representation that clearly improves cross-domain product
retrieval.
电子商务的多媒体化程度越来越高,产品以图片、短视频或现场直播推广的方式在国外展示。统一和矢量化的跨域生产表示是必不可少的。在广域场景中,产品内部差异大,产品之间相似度高,因此仅有视觉表示是不够的。虽然从短视频或直播视频中提取的自动语音识别(ASR)文本很容易获得,但如何通过多模态表征学习对噪声过大的文本进行去噪处理却大多没有涉及。我们提出了 ASR 增强多模态产品表征学习(AMPere)。为了从原始 ASR 文本中提取特定产品信息,AMPere 使用了一个易于实现的基于 LLM 的 ASR 文本摘要器。经过 LLM 总结的文本与视觉数据一起输入多分支网络,生成紧凑的多模态嵌入。在大型三域数据集上进行的大量实验验证了 AMPere 在获得统一的多模态产品表示法方面的有效性,这种表示法明显改善了跨域产品检索。