Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements

IF 6.6 4区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Intelligent Systems and Technology Pub Date : 2024-03-12 DOI:10.1145/3645099

Weidong He, Zhi Li, Hao Wang, Tong Xu, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, Enhong Chen

{"title":"Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements","authors":"Weidong He, Zhi Li, Hao Wang, Tong Xu, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, Enhong Chen","doi":"10.1145/3645099","DOIUrl":null,"url":null,"abstract":"<p>The topic of multimodal conversation systems has recently garnered significant attention across various industries, including travel, retail, and others. While pioneering works in this field have shown promising performance, they often focus solely on context information at the utterance level, overlooking the context-aware dependencies of multimodal semantic elements like words and images. Furthermore, the ordinal information of images, which indicates the relevance between visual context and users’ demands, remains underutilized during the integration of visual content. Additionally, the exploration of how to effectively utilize corresponding attributes provided by users when searching for desired products is still largely unexplored. To address these challenges, we propose a Position-aware Multimodal diAlogue system with semanTic Elements, abbreviated as PMATE. Specifically, to obtain semantic representations at the element-level, we first unfold the multimodal historical utterances and devise a position-aware multimodal element-level encoder. This component considers all images that may be relevant to the current turn and introduces a novel position-aware image selector to choose related images before fusing the information from the two modalities. Finally, we present a knowledge-aware two-stage decoder and an attribute-enhanced image searcher for the tasks of generating textual responses and selecting image responses, respectively. We extensively evaluate our model on two large-scale multimodal dialog datasets, and the results of our experiments demonstrate that our approach outperforms several baseline methods.</p>","PeriodicalId":48967,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology","volume":"60 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3645099","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The topic of multimodal conversation systems has recently garnered significant attention across various industries, including travel, retail, and others. While pioneering works in this field have shown promising performance, they often focus solely on context information at the utterance level, overlooking the context-aware dependencies of multimodal semantic elements like words and images. Furthermore, the ordinal information of images, which indicates the relevance between visual context and users’ demands, remains underutilized during the integration of visual content. Additionally, the exploration of how to effectively utilize corresponding attributes provided by users when searching for desired products is still largely unexplored. To address these challenges, we propose a Position-aware Multimodal diAlogue system with semanTic Elements, abbreviated as PMATE. Specifically, to obtain semantic representations at the element-level, we first unfold the multimodal historical utterances and devise a position-aware multimodal element-level encoder. This component considers all images that may be relevant to the current turn and introduces a novel position-aware image selector to choose related images before fusing the information from the two modalities. Finally, we present a knowledge-aware two-stage decoder and an attribute-enhanced image searcher for the tasks of generating textual responses and selecting image responses, respectively. We extensively evaluate our model on two large-scale multimodal dialog datasets, and the results of our experiments demonstrate that our approach outperforms several baseline methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过捕捉语义要素的上下文感知依赖关系和排序信息实现多模态对话系统

多模态对话系统这一话题最近在旅游、零售等各行各业引起了极大关注。虽然这一领域的先驱作品表现出了良好的性能，但它们往往只关注语篇层面的语境信息，而忽略了单词和图像等多模态语义元素的语境感知依赖性。此外，在整合视觉内容的过程中，表示视觉语境与用户需求之间相关性的图像序号信息仍未得到充分利用。此外，在搜索所需产品时，如何有效利用用户提供的相应属性，这一问题在很大程度上仍未得到探讨。为了应对这些挑战，我们提出了一种带有语义元素的位置感知多模态对话系统，简称 PMATE。具体来说，为了获得元素级的语义表征，我们首先展开多模态历史语篇，并设计了一个位置感知多模态元素级编码器。该组件考虑了可能与当前转向相关的所有图像，并引入了一个新颖的位置感知图像选择器，以便在融合两种模态的信息之前选择相关图像。最后，我们提出了一个知识感知两阶段解码器和一个属性增强图像搜索器，分别用于生成文本回复和选择图像回复。我们在两个大型多模态对话数据集上广泛评估了我们的模型，实验结果表明我们的方法优于几种基线方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Intelligent Systems and Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

9.30

自引率

2.00%

发文量

131

期刊介绍： ACM Transactions on Intelligent Systems and Technology is a scholarly journal that publishes the highest quality papers on intelligent systems, applicable algorithms and technology with a multi-disciplinary perspective. An intelligent system is one that uses artificial intelligence (AI) techniques to offer important services (e.g., as a component of a larger system) to allow integrated systems to perceive, reason, learn, and act intelligently in the real world. ACM TIST is published quarterly (six issues a year). Each issue has 8-11 regular papers, with around 20 published journal pages or 10,000 words per paper. Additional references, proofs, graphs or detailed experiment results can be submitted as a separate appendix, while excessively lengthy papers will be rejected automatically. Authors can include online-only appendices for additional content of their published papers and are encouraged to share their code and/or data with other readers.