Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE New Generation Computing Pub Date : 2024-02-15 DOI:10.1007/s00354-024-00243-8

Aman Jain, Anirudh Reddy Kondapally, Kentaro Yamada, Hitomi Yanaka

{"title":"Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems","authors":"Aman Jain, Anirudh Reddy Kondapally, Kentaro Yamada, Hitomi Yanaka","doi":"10.1007/s00354-024-00243-8","DOIUrl":null,"url":null,"abstract":"<p>Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.</p>","PeriodicalId":54726,"journal":{"name":"New Generation Computing","volume":"41 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"New Generation Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00354-024-00243-8","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于人机界面系统中多模态参考表达理解的神经符号推理

传统的人机交互（HMI）界面主要依靠图形用户界面和语音指令。然而，人类的自然交流也包括非语言交流，其中包括手势，如指点。因此，最近在人机界面系统方面开展的工作试图将指点手势作为一种输入方式，在识别指点手势并将其与语音指令集成方面取得了重大进展。然而，现有的方法往往将这些输入模式独立处理，限制了它们处理需要对语言和手势进行复杂推理的复杂多模态指令的能力。另一方面，需要复杂推理的多模态任务正在语言和视觉领域受到挑战，但这些任务通常不包括手势（如指向）。为了弥合这一差距，我们探讨了多模态人机界面系统中具有挑战性的多模态任务之一--"参考表达理解（REC）"，该任务包含指向手势。我们展示了一个虚拟场景，在这个场景中，机器人与用户共享一个环境，机器人的任务是根据用户的语言和手势指令识别物体。此外，为了应对这一挑战，我们提出了一种神经-符号混合模型，该模型结合了深度学习的多功能性和符号推理的可解释性。我们的贡献包括为人机交互系统提供了一个具有挑战性的多模态 REC 数据集、一个可解释的神经符号模型，以及对该模型将推理推广到未知环境的能力评估，并对模型的内部运作进行了深入的定性分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

New Generation Computing 工程技术-计算机：理论方法

CiteScore

5.90

自引率

15.40%

发文量

审稿时长

>12 weeks

期刊介绍： The journal is specially intended to support the development of new computational and cognitive paradigms stemming from the cross-fertilization of various research fields. These fields include, but are not limited to, programming (logic, constraint, functional, object-oriented), distributed/parallel computing, knowledge-based systems, agent-oriented systems, and cognitive aspects of human embodied knowledge. It also encourages theoretical and/or practical papers concerning all types of learning, knowledge discovery, evolutionary mechanisms, human cognition and learning, and emergent systems that can lead to key technologies enabling us to build more complex and intelligent systems. The editorial board hopes that New Generation Computing will work as a catalyst among active researchers with broad interests by ensuring a smooth publication process.