Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation

Proceedings of the 16th ACM Conference on Recommender Systems Pub Date : 2022-09-18 DOI:10.1145/3523227.3546774

Yaxiong Wu, C. Macdonald, I. Ounis

{"title":"Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation","authors":"Yaxiong Wu, C. Macdonald, I. Ounis","doi":"10.1145/3523227.3546774","DOIUrl":null,"url":null,"abstract":"Multi-modal interactive recommendation is a type of task that allows users to receive visual recommendations and express natural-language feedback about the recommended items across multiple iterations of interactions. However, such multi-modal dialog sequences (i.e. turns consisting of the system’s visual recommendations and the user’s natural-language feedback) make it challenging to correctly incorporate the users’ preferences across multiple turns. Indeed, the existing formulations of interactive recommender systems suffer from their inability to capture the multi-modal sequential dependencies of textual feedback and visual recommendations because of their use of recurrent neural network-based (i.e., RNN-based) or transformer-based models. To alleviate the multi-modal sequential dependency issue, we propose a novel multi-modal recurrent attention network (MMRAN) model to effectively incorporate the users’ preferences over the long visual dialog sequences of the users’ natural-language feedback and the system’s visual recommendations. Specifically, we leverage a gated recurrent network (GRN) with a feedback gate to separately process the textual and visual representations of natural-language feedback and visual recommendations into hidden states (i.e. representations of the past interactions) for multi-modal sequence combination. In addition, we apply a multi-head attention network (MAN) to refine the hidden states generated by the GRN and to further enhance the model’s ability in dynamic state tracking. Following previous work, we conduct extensive experiments on the Fashion IQ Dresses, Shirts, and Tops & Tees datasets to assess the effectiveness of our proposed model by using a vision-language transformer-based user simulator as a surrogate for real human users. Our results show that our proposed MMRAN model can significantly outperform several existing state-of-the-art baseline models.","PeriodicalId":443279,"journal":{"name":"Proceedings of the 16th ACM Conference on Recommender Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM Conference on Recommender Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3523227.3546774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Multi-modal interactive recommendation is a type of task that allows users to receive visual recommendations and express natural-language feedback about the recommended items across multiple iterations of interactions. However, such multi-modal dialog sequences (i.e. turns consisting of the system’s visual recommendations and the user’s natural-language feedback) make it challenging to correctly incorporate the users’ preferences across multiple turns. Indeed, the existing formulations of interactive recommender systems suffer from their inability to capture the multi-modal sequential dependencies of textual feedback and visual recommendations because of their use of recurrent neural network-based (i.e., RNN-based) or transformer-based models. To alleviate the multi-modal sequential dependency issue, we propose a novel multi-modal recurrent attention network (MMRAN) model to effectively incorporate the users’ preferences over the long visual dialog sequences of the users’ natural-language feedback and the system’s visual recommendations. Specifically, we leverage a gated recurrent network (GRN) with a feedback gate to separately process the textual and visual representations of natural-language feedback and visual recommendations into hidden states (i.e. representations of the past interactions) for multi-modal sequence combination. In addition, we apply a multi-head attention network (MAN) to refine the hidden states generated by the GRN and to further enhance the model’s ability in dynamic state tracking. Following previous work, we conduct extensive experiments on the Fashion IQ Dresses, Shirts, and Tops & Tees datasets to assess the effectiveness of our proposed model by using a vision-language transformer-based user simulator as a surrogate for real human users. Our results show that our proposed MMRAN model can significantly outperform several existing state-of-the-art baseline models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

交互式时尚推荐的多模态对话框状态跟踪

多模态交互推荐是一种允许用户接收视觉推荐并跨多个交互迭代表达关于推荐项目的自然语言反馈的任务。然而，这种多模式对话序列(即由系统的视觉建议和用户的自然语言反馈组成的回合)使得在多个回合中正确整合用户的偏好变得具有挑战性。事实上，现有的交互式推荐系统由于使用基于循环神经网络(即基于rnn)或基于变压器的模型而无法捕获文本反馈和视觉推荐的多模态顺序依赖关系。为了缓解多模态顺序依赖问题，我们提出了一种新的多模态循环注意网络(MMRAN)模型，以有效地将用户的偏好与用户自然语言反馈的长视觉对话序列和系统的视觉推荐相结合。具体来说，我们利用带有反馈门的门控循环网络(GRN)将自然语言反馈和视觉推荐的文本和视觉表示分别处理为多模态序列组合的隐藏状态(即过去相互作用的表示)。此外，我们采用多头注意网络(MAN)对GRN产生的隐藏状态进行细化，进一步增强了模型的动态跟踪能力。在之前的工作之后，我们对Fashion IQ Dresses, Shirts和Tops & Tees数据集进行了广泛的实验，通过使用基于视觉语言转换器的用户模拟器作为真实人类用户的代理来评估我们提出的模型的有效性。我们的研究结果表明，我们提出的MMRAN模型可以显著优于几个现有的最先进的基线模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 16th ACM Conference on Recommender Systems

自引率

0.00%

发文量

期刊最新文献

Heterogeneous Graph Representation Learning for multi-target Cross-Domain Recommendation Imbalanced Data Sparsity as a Source of Unfair Bias in Collaborative Filtering Position Awareness Modeling with Knowledge Distillation for CTR Prediction Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation Denoising Self-Attentive Sequential Recommendation