Multimodal Quasi-AutoRegression: forecasting the visual popularity of new fashion products

IF 3.6 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Multimedia Information Retrieval Pub Date : 2022-04-08 DOI:10.48550/arXiv.2204.04014

Stefanos Papadopoulos, C. Koutlis, S. Papadopoulos, Y. Kompatsiaris

{"title":"Multimodal Quasi-AutoRegression: forecasting the visual popularity of new fashion products","authors":"Stefanos Papadopoulos, C. Koutlis, S. Papadopoulos, Y. Kompatsiaris","doi":"10.48550/arXiv.2204.04014","DOIUrl":null,"url":null,"abstract":"Estimating the preferences of consumers is of utmost importance for the fashion industry as appropriately leveraging this information can be beneficial in terms of profit. Trend detection in fashion is a challenging task due to the fast pace of change in the fashion industry. Moreover, forecasting the visual popularity of new garment designs is even more demanding due to lack of historical data. To this end, we propose MuQAR, a Multimodal Quasi-AutoRegressive deep learning architecture that combines two modules: (1) a multimodal multilayer perceptron processing categorical, visual and textual features of the product and (2) a Quasi-AutoRegressive neural network modelling the “target” time series of the product’s attributes along with the “exogenous” time series of all other attributes. We utilize computer vision, image classification and image captioning, for automatically extracting visual features and textual descriptions from the images of new products. Product design in fashion is initially expressed visually and these features represent the products’ unique characteristics without interfering with the creative process of its designers by requiring additional inputs (e.g. manually written texts). We employ the product’s target attributes time series as a proxy of temporal popularity patterns, mitigating the lack of historical data, while exogenous time series help capture trends among interrelated attributes. We perform an extensive ablation analysis on two large-scale image fashion datasets, Mallzee-P and SHIFT15m to assess the adequacy of MuQAR and also use the Amazon Reviews: Home and Kitchen dataset to assess generalization to other domains. A comparative study on the VISUELLE dataset shows that MuQAR is capable of competing and surpassing the domain’s current state of the art by 4.65% and 4.8% in terms of WAPE and MAE, respectively.","PeriodicalId":48501,"journal":{"name":"International Journal of Multimedia Information Retrieval","volume":"61 1","pages":"717-729"},"PeriodicalIF":3.6000,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Multimedia Information Retrieval","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2204.04014","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 7

Abstract

Estimating the preferences of consumers is of utmost importance for the fashion industry as appropriately leveraging this information can be beneficial in terms of profit. Trend detection in fashion is a challenging task due to the fast pace of change in the fashion industry. Moreover, forecasting the visual popularity of new garment designs is even more demanding due to lack of historical data. To this end, we propose MuQAR, a Multimodal Quasi-AutoRegressive deep learning architecture that combines two modules: (1) a multimodal multilayer perceptron processing categorical, visual and textual features of the product and (2) a Quasi-AutoRegressive neural network modelling the “target” time series of the product’s attributes along with the “exogenous” time series of all other attributes. We utilize computer vision, image classification and image captioning, for automatically extracting visual features and textual descriptions from the images of new products. Product design in fashion is initially expressed visually and these features represent the products’ unique characteristics without interfering with the creative process of its designers by requiring additional inputs (e.g. manually written texts). We employ the product’s target attributes time series as a proxy of temporal popularity patterns, mitigating the lack of historical data, while exogenous time series help capture trends among interrelated attributes. We perform an extensive ablation analysis on two large-scale image fashion datasets, Mallzee-P and SHIFT15m to assess the adequacy of MuQAR and also use the Amazon Reviews: Home and Kitchen dataset to assess generalization to other domains. A comparative study on the VISUELLE dataset shows that MuQAR is capable of competing and surpassing the domain’s current state of the art by 4.65% and 4.8% in terms of WAPE and MAE, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多模态准自回归:预测新时尚产品的视觉流行度

估计消费者的偏好对时尚行业来说是至关重要的，因为适当地利用这些信息对利润是有益的。由于时尚行业的快速变化，时尚趋势检测是一项具有挑战性的任务。此外，由于缺乏历史数据，预测新服装设计的视觉流行度更加困难。为此，我们提出了MuQAR，一种多模态准自回归深度学习架构，它结合了两个模块:(1)处理产品的分类、视觉和文本特征的多模态多层感知器;(2)对产品属性的“目标”时间序列以及所有其他属性的“外生”时间序列进行建模的准自回归神经网络。我们利用计算机视觉、图像分类和图像字幕，从新产品图像中自动提取视觉特征和文本描述。时尚产品设计最初是通过视觉来表达的，这些特征代表了产品的独特特征，而不需要额外的输入(例如手工书写文本)来干扰设计师的创作过程。我们使用产品的目标属性时间序列作为时间流行模式的代理，减轻了历史数据的缺乏，而外生时间序列有助于捕获相关属性之间的趋势。我们对两个大型图像时尚数据集Mallzee-P和SHIFT15m进行了广泛的消纳分析，以评估MuQAR的充分性，并使用亚马逊评论:家庭和厨房数据集评估对其他领域的泛化。对VISUELLE数据集的比较研究表明，MuQAR能够在WAPE和MAE方面分别竞争并超过该领域目前的技术水平4.65%和4.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Multimedia Information Retrieval Multiple-

CiteScore

7.80

自引率

5.40%

发文量

期刊介绍： Aims and Scope The International Journal of Multimedia Information Retrieval (IJMIR) is a scholarly archival journal publishing original, peer-reviewed research contributions. Its editorial board strives to present the most important research results in areas within the field of multimedia information retrieval. Core areas include exploration, search, and mining in general collections of multimedia consisting of information from the WWW to scientific imaging to personal archives. Comprehensive review and survey papers that offer up new insights, and lay the foundations for further exploratory and experimental work, are also relevant. Relevant topics include Image and video retrieval - theory, algorithms, and systems Social media interaction and retrieval - collaborative filtering, social voting and ranking Music and audio retrieval - theory, algorithms, and systems Scientific and Bio-imaging - MRI, X-ray, ultrasound imaging analysis and retrieval Semantic learning - visual concept detection, object recognition, and tag learning Exploration of media archives - browsing, experiential computing Interfaces - multimedia exploration, visualization, query and retrieval Multimedia mining - life logs, WWW media mining, pervasive media analysis Interactive search - interactive learning and relevance feedback in multimedia retrieval Distributed and high performance media search - efficient and very large scale search Applications - preserving cultural heritage, 3D graphics models, etc. Editorial Policies: We aim for a fast decision time (less than 4 months for the initial decision) There are no page charges in IJMIR. Papers are published on line in advance of print publication. Academic, industrial researchers, and practitioners involved with multimedia search, exploration, and mining will find IJMIR to be an essential source for important results in the field.

期刊最新文献

Text-assisted attention-based cross-modal hashing Augmented inputs for surveillance re-identification Image enhancement with bi-directional normalization and color attention-guided generative adversarial networks PSNet: position-shift alignment network for image caption Sentiment analysis using deep learning techniques: a comprehensive review